Talk title: Towards LLMs that Understand, Remember, and Adapt to Users
Abstract: Large language models (LLMs) are trained on the collective knowledge of the internet, but they are used to serve billions of individual users. Recent years witness increasing interests in adapting population-level LLMs to accommodate the diverse goals, preferences, and contexts of individual users. To build personalised LLMs, we need models that can understand users, maintain memory over time, learn from sparse and heterogeneous personal data, and align with what each user values. In this talk, I will illustrate how each of these challenges can be addressed through examples from our group’s recent work. I will conclude by discussing open problems and potential directions for personalised LLM research.
Bio: Yulan He is a Professor in Natural Language Processing at King’s College London and a Turing AI Fellow. Her research focuses on improving the robustness of Large Language Model (LLM) reasoning, agentic AI, long-context QA, model interpretability, safety alignment, AI for education, health, and science. She has received several prizes and awards for her research. Her five-year Turing AI Fellowship project, “Event-Centric Framework for Natural Language Understanding”, was awarded Best Research Project (Research Excellence) at the RAi UK AI & Robotics Awards 2026. She also received a Best Research Paper Award at the same awards ceremony. In addition, she is the recipient of a SWSA Ten-Year Award and a CIKM Test-of-Time Award, and was named an inaugural Highly Ranked Scholar by ScholarGPS.
Talk title: Evaluating for LLM uncertainty and bias in healthcare
Abstract: In this talk, I discuss two of our recent papers (EACL 2026 and ACL 2026) and how they combine uncertainty (quantification) and bias (evaluation) for LLMs in healthcare. First, we explore uncertainty quantification methods for LLMs tailored to clinical applications. We look at how different uncertainty quantification methods for LLMs fare when applied to different clinical specialties (e.g. cardiology vs. speech language pathology) and to different types of clinical concepts (e.g. diagnosis vs. procedures). Second, we investigate how sexual orientation and religious affiliation of a patient distort uncertainty signals and model accuracy. We show that these identity markers cause a "calibration crisis" in LLMs with harms to calibration that compound non-additively, a significant risk to LLM fairness and safety.
Bio: Iacer Calixto has a background in Computer Science, Natural Language Processing, and Machine Learning, and obtained his PhD from Dublin City University (2017) on the topic of integrating visual information in machine translation. He was a Marie-Curie Postdoctoral Fellow visiting the New York University Courant Institute of Mathematical Sciences, and currently leads the NLP4Health Lab in the Department of Medical Informatics of the University of Amsterdam. His lab, including 6 PhD students and 2 postdocs, tackles the methodological bottlenecks necessary to take NLP methods from bench to clinical practice: how to guarantee patient privacy, quantifying the uncertainty of large language models (LLMs), integrating data from multiple modalities (e.g., structured data, medical images, time series, text), how to make NLP models interpretable and explainable. Finally, Iacer focuses on real-world problems across a number of high-impact clinical specialties, such as cardiology and intensive care medicine. He holds an NWO AiNED Fellowship (2024-2029) and is involved in the EU projects DataTools4Heart (which goal is the development of a toolbox for clinicians, researchers, and data scientists) and Medispeech (which goal is to automate the reporting of doctor-patient consultations using LLMs).
Talk title: Responsibly Building Multilingual Language Models for Hundreds of Languages
Abstract: Large Language Models (LLMs) have transformed artificial intelligence, yet their development remains heavily skewed toward high-resource languages, leaving many global communities underserved. Multilingual LLMs aim to bridge this gap but face steep hurdles, including severe data scarcity, biased tokenization systems, and a lack of representative evaluation benchmarks. This presentation explores these critical challenges and introduces practical, technically innovative solutions designed to build fairer AI systems. The discussion details new insights into how multilingual models transfer knowledge across diverse languages, alongside architectural enhancements that boost cross-lingual reasoning. To tackle data and evaluation constraints, the research introduces a novel contrastive learning approach for low-resource language identification, a parity-aware tokenization algorithm to eliminate script biases, and a comprehensive evaluation benchmark spanning 44 languages. Ultimately, these contributions provide the theoretical frameworks and practical tools necessary to advance equitable AI technologies that leave no language behind.
Bio: Negar Foroutan is a research scientist at Google Research, Zurich, and a recent PhD graduate from EPFL. Her research broadly encompasses NLP and machine learning, with a passionate focus on improving the multilingual capabilities of LLMs, especially in low-resource settings. She is actively involved in the full pipeline of training multilingual LLMs, including pretraining data construction, data mixtures, language-aware tokenization, and robust evaluation. When she isn't training models, you can find her hiking, keeping up with politics, or annoying her friends with unsolicited etymology facts :)
Talk title: Compositional approaches in modelling language and reasoning
Abstract: Neural approaches to modelling language and concepts have proven quite effective, with a proliferation of large models trained on correspondingly massive datasets. However, these models still fail on some tasks that humans, and symbolic approaches, can easily solve. Large neural models are also, to a certain extent, black boxes - particularly those that are proprietary. There is therefore a need to integrate compositional and neural approaches, firstly to potentially improve the performance of large neural models, and secondly to analyze and explain the representations that these systems are using. In this talk I will present results showing that large neural models can fail at tasks that humans are able to do, and discuss alternative, theory-based approaches that have the potential to perform more strongly. I will give applications in language, reasoning, and vision. Finally, I will present some future directions in understanding the types of reasoning or symbol manipulation that large neural models may be performing.
Talk title: How Much Reasoning Can Prediction Buy? Context-Directed Extrapolation and the Limits of Generalisation in LLMs
Abstract: Large language models have demonstrated exceptional performance across diverse tasks for which they were not explicitly trained, including those that require complex reasoning abilities in people. However, despite their impressive capabilities, LLMs tend to fail unexpectedly even on tasks that young children can solve with relative ease. How do we reconcile these conflicting capabilities and limitations of state-of-the-art LLMs? To answer this question, I will draw on a series of papers (ACL, TMLR) and argue that both follow from one mechanism. What we call reasoning is “context-directed extrapolation”: extrapolation from training priors, directed by the prompt. This gives one account of capability and failure alike. More importantly, from this perspective, it becomes possible to predict, and therefore begin to mitigate, specific failure modes in even the most advanced “thinking” models (EMNLP, book chapters). I will close by exploring the open problem this leaves, characterising the boundary of generalisation itself and how it moves as models scale.
Bio: Harish Tayyar Madabushi is a Senior Lecturer (Associate Professor) in Artificial Intelligence at the University of Bath. His research focuses on the fundamental mechanisms that underpin the performance and functioning of large language models such as ChatGPT. His work was included in the discussion paper on the Capabilities and Risks of Frontier AI that informed discussions at the UK AI Safety Summit held at Bletchley Park. His research on the constructional information encoded in language models has been influential in bringing together the fields of construction grammar and pre-trained language models. His work also includes collaborative industrial research aimed at correcting biases in speech-to-text systems widely used across the UK. Before starting his PhD in automated question answering at the University of Birmingham, he founded and led a social media data analytics company based in Singapore.
Talk title: Illuminating Generative AI: Mapping Knowledge in Large Language Models
Abstract: Millions of everyday users are interacting with technologies built with generative AI. While these AI-based systems are being increasingly integrated into modern life, they can also magnify risks, inequities, and dissatisfaction when providers deploy unreliable systems. A primary obstacle to having more reliable systems is the opacity of the underlying large language models— we lack a systematic understanding of how models work, where critical vulnerabilities may arise, why they are happening, and how models must be redesigned to address them. In this talk, I will first describe my work in investigating large language models to illuminate when models acquire knowledge and capabilities. Then, I will briefly touch upon our efforts to build tools that enable greater data transparency for large language models. I will then describe our work on understanding why large language models produce incorrect knowledge, or hallucinate, and implications for building the next generation of responsible AI systems.
Bio: Abhilasha Ravichander is tenure-track faculty at the Max Planck Institute for Software Systems. Prior to joining MPI, Abhilasha was a postdoctoral scholar at the University of Washington and the Allen Institute for Artificial Intelligence. She received her PhD from Carnegie Mellon University in 2022. Her research focuses on improving the factuality, robustness, and transparency of large-scale language models. Abhilasha’s work has been presented at several top NLP conferences, receiving Outstanding Paper Award at ACL 2025, Best Resource Paper Award at ACL 2024, Best Theme Paper Award at ACL 2024, and Area Chair Favorite Paper award at COLING 2018. She has been recognized as a "Rising Star in Generative AI" (2024), "Rising Star in EECS" (2022), and "Rising Star in Data Science" (2021).
Talk title: How expert is your "expert AI" in the real world?
Abstract: Large-scale data annotation, training, and alignment bring the promise of "Expert AI". Depending on who you ask, this ranges from models capable of assisting domain experts in their work to superhuman AGI, which may replace you in the near future. Evidence of this "expertise" often comes from benchmarks, leaderboards, and sometimes even ad-hoc cases. In this talk, we take a closer (and perhaps, critical) look at this evaluation paradigm: how "expert" are these models when we apply them to really challenging real-world scenarios? We will cover three hard cases: irrelevant information, subjective preferences, and enterprise QA. Our "expert AI" might gracefully collapse, teaching us a lesson on what we need to do before we deploy these systems in sensitive domains.
Bio: Simone Balloccu is a computer scientist with 9 years of research experience in NLP & AI. He worked within several EU-funded projects, including Horizon 2020, and ERC, focusing on AI for mental health and behaviour change, human evaluation, and more generally on AI applied to expert domains. He leads the “NLP for expert domains” lab at TU Darmstadt, researching Expert-AI interaction and cooperation. His current research involves efficient RAG systems over corporate knowledge, Multimodal NLP applied to mental health, and modelling expert preferences in LLMs.