2026 Cohort

The field of AI evaluation doesn’t have enough people in it.

These 40 are doing something about that.

They looked at a field still figuring itself out —
and decided that was exactly the right moment to show up, lean in, and do the hard thing.

They are the kind of people who ask:

How do we know this system is safe?
What are we actually measuring?
What are we missing?

Different backgrounds. Very different paths.

One shared instinct:
Don’t trust the system. Test it — rigorously.

Meet the 2026 cohort of the AI Evaluation Programme 🌍

*includes students who consented to being featured

Aashiq Muhamed

India

I'm a Ph.D. student in the Machine Learning Department at Carnegie Mellon University, advised by Professors Mona Diab and Virginia Smith. My research focuses on technical preparedness for transformative AI, spanning mechanistic interpretability, and collaborative AI frameworks.
I'm drawn to AI evaluation through my work on LLM-as-a-judge frameworks and my belief that robust evaluation paradigms are essential for safe AI deployment. Before my Ph.D., I spent five years as an Applied Scientist at Amazon, which grounds my research in practical impact. I'm an Anthropic Fellow, Amazon AI PhD Fellow, and Siebel Scholar, and I've published at venues including COLM, EMNLP, CVPR, and NeurIPS workshops.

Ana Simina Stoian

Romania

Ana Simina Stoian brings 8+ years of experience in AI training and evaluation (RLHF). Her work focuses on developing rewards, reasoning frameworks, and expert benchmarks for frontier AI agents. She holds a Master’s in Digital Technology and Public Policy from Sciences Po Paris (Hons).

Transitioning into AI Safety and Governance to bridge development and policymaking, Simina focuses on agent behavior in multi-agent systems and the risks involved in ensuring alignment with human values. She also explores the empirical and ethical inquiries raised by the possibility of AI consciousness

Simina believes that without robust model evaluations to indicate capabilities and safety, effective AI Governance is impossible. This motivated her to join the International Programme on AI Evaluation.

Anna Liashenko

Ukraine

I'm a Sr. Product Manager at Sprinklr, deploying AI products to enterprise customers. I came to tech after doing neuroscience research on decision-making. First, I fell in love with edtech, and then with GenAI's potential to transform how we learn. I spent 3+ years at Amazon ideating, piloting and scaling a GenAI learning platform from zero to thousands of users globally, and co-founded a growing community of 700+ Ukrainian educators empowering them to use AI to improve teaching and learning. AI evaluation doesn't get much spotlight, but without it AI won't deliver. What drew me here is the opportunity to meet forward-thinking people from various backgrounds, and learn future-minded approaches I can apply at scale and bring home to Ukraine.

Antoni Mestre Gascón

Spain

Antoni Mestre is a policy officer at the AI Office of the European Commission, working on the governance and responsible deployment of artificial intelligence. His background combines academic research in human-centered and explainable AI with experience in public policy and institutional advisory roles.
Previously, he was a researcher at the Valencian Research Institute for Artificial Intelligence and an assistant professor at the Open University of Catalonia. He has collaborated with international and governmental institutions, including the United Nations International Atomic Energy Agency and the European Commission, and has served as an external advisor to the Corts Valencianes and the Spanish Ministry of Youth and Adolescence.

Bernardo Villegas

Ecuador

I am a PhD candidate at the Center for Human-Inspired Artificial Intelligence (CHIA), University of Cambridge, supervised by Dr. Umang Bhatt and co-supervised by Prof. Anna Korhonen. I am a member of Wolfson College.

My research investigates the externalities of human-AI interaction: what happens when AI-assisted work flows to colleagues, patients, or decision-makers who weren’t present. Current AI systems are optimizers with hidden costs: they enhance individual performance while the ripple effects go unmeasured.

I bridge AI, cognitive science, social science, and system design to make these hidden costs visible and ensure AI augments not just individual users, but the broader human networks they serve.

Bijna Balan

India

As a Digital Regulatory Expert in the Defence and Space sector, I bridge the gap between innovation and accountability. Holding a Master’s in Robotics Engineering and pursuing doctoral research in AI Ethics, I specialize in translating complex frameworks, like the EU AI Act, into operational governance. My background spans Engineering, Data Privacy, and Project Management, and shall now be further enhanced by the International AI Evaluation Programme. I look forward to collaborating with global experts to implement the latest safety standards, ensuring that AI deployment remains both transformative and ethically sound.

Bolun Sun

China

I am a Ph.D. student at the Kellogg School of Management at Northwestern University. My research lies at the intersection of computational social science, human-centered AI, and natural language processing, with applications in social psychology, policy communication, and AI-mediated information systems. I have published in top-tier conferences such as ICLR, NeurIPS, and ACL, and serve as a reviewer. I am drawn to AI evaluation because many real-world failures of AI systems stem not from model capability, but from misalignment between evaluation benchmarks and human, social, and institutional contexts. I am particularly interested in developing evaluation frameworks that better capture societal impact, robustness, and normative implications of AI systems.

Chaoran Chen

China

Chaoran Chen is a PhD candidate at the University of Notre Dame specializing in Human-Centered AI and Usable Security. His research addresses a new state of the world where users delegate critical security and privacy decisions to AI agents. To ensure these agents uphold the properties of safety and trust, Chaoran focuses on bridging the gap between model behavior and human reasoning. He is drawn to AI evaluation as a means to restore user agency, creating tools like contextual assistants and auditing sandboxes. His goal is to make AI-driven security and privacy judgments cognitively legible, allowing users to verify and align agent behavior with their expectations.

Christina Wiremu-Brook

New Zealand

Christina is an AI ethics and data strategy expert focused on high-risk, high-impact sectors.

She is the founder of Kuria, an advisory firm that helps organisations embed responsible AI in practice—ensuring AI is innovative, fair, safe, and aligned with human values.

Christina studied at Tsinghua University, completing a Masters in Global Affairs (AI Ethics).
She later served as Head of Safe AI at the NSW Department of Education, supporting the rollout of NSWEduChat—at the time, one of the largest education AI pilots globally.

Earlier in her career, she was a data scientist at a large consulting firm, working across health, justice, education and more to strengthen data governance and evaluation.

Christina is Ngapuhi Maori from Aotearoa, New Zealand.

Claire Liu

China

I currently work on foundation model safety evaluation at a technology company. My day-to-day work has given me substantial practical insight into AI evaluation, while also raising deeper questions about how such evaluations can be conducted more scientifically and reliably. I am interested in contributing toward the development of a more robust and standardized paradigm for safety evaluation.

Haomiaomiao Wang

China

Harmonia Wang is a Research Assistant at Dublin City University’s Insight SFI Research Centre for Data Analytics, working at the intersection of cognitive science and AI. Her research focuses on decision-making, human-AI interaction, and evaluating how language models behave under uncertainty and constraint. She is drawn to AI evaluation as a tool for systematically identifying model limitations and risks, and for informing safer and more responsible AI deployment.

Harshit Dave

India

Harshit Dave, a Data Scientist at slice, building production AI systems for voice payments and conversational agents. Alongside this, he's published research at EMNLP and AAAI on multimodal fact verification and LLM benchmarking grounded in real-world use. What really motivates him is AI evaluation, its process, and becoming the first generation of AI evaluation expert. He has seen models excel on clean benchmarks but fail under real-world adversarial conditions. He is driven to develop evaluation frameworks that capture not just accuracy, but safety, robustness, and the human behaviors that emerge when AI systems meet messy reality.

Isabel Barberá

Spain

I am a Senior AI Advisor at the Dutch Coordinating Supervisor on AI and Algorithms (DCA), part of the Dutch Data Protection Authority (AP). In this role, I lead AI safety and security initiatives and contribute to AI standardization. I am also an independent researcher with a focus on AI risks and safety.
I am particularly interested in how AI evaluations can be operationalised to help detect, assess, and mitigate real-world harms. As AI evaluation and safety is still an advancing field, it is important to develop expertise and establish shared methods and standards that regulators, industry, and other stakeholders can trust. I am proud to be part of this first cohort taking steps toward establishing AI Evaluation & Safety as a formal academic discipline.

Ishan Kavathekar

India

I'm a final year master's student at IIIT-Hyderabad, advised by Dr. Ponnurangam Kumaraguru. I am interested in safety and robustness of LLM agents, multi-agent systems and evaluating emergent behaviors in AI models. Previously, I've interned at Microsoft Research India and Adobe Research India.

Israel Mason-Williams

United Kingdom

For the last few years, I have been finding innovative and creative solutions for issues surrounding trust, bias, transparency and safety to support a synergy between development and fairness. I did this first through my M.Phil at The University of Cambridge and am currently developing these ideas at Kings and Imperial working under the Safe and Trusted AI CDT. Through this I have published research at world class AI conferences 2x NeurIPS, 4x ICML and 1x ICLR. As a scholar of AI, I am drawn to AI Evaluation as I care deeply about best standards and how to ensure that evaluations of capacities are informative, fair and ultimately have a multifaceted impact in informing decisions for core stakeholders such as scientists, business leaders and policymakers.

Jodi Casabianca

United States

Jodi Casabianca, PhD, is the Founder and Chief Scientist of BroadMetrics, a psychometric and assessment firm focused on principled measurement of both AI systems and humans. A psychometrician by training, she specializes in assessment design, human data systems, AI scoring, and the emerging field of AI/LLM psychometrics. Her work is grounded in years of hands-on experience evaluating AI scoring models for large-scale, high-stakes assessments—work that shaped her conviction that the opacity of AI systems makes rigorous psychometric evaluation not just useful, but essential. She brings principles of validity, reliability, and fairness from traditional psychometrics into AI evaluation, developing frameworks that make claims about AI behavior interpretable, defensible, and ethically grounded.

Jodie Levy

United Kingdom

Jodie Levy was a founding member of the UK AI Security Institute (AISI), she played a key role in building the organization’s global research efforts across a $100M portfolio. She designed the Systemic AI Safety grants programme at AISI, and evaluated projects delivering impactful progress to mitigate AI risks. Jodie is also a lead contributor to early regulatory approaches to frontier AI oversight.

Now serving as Global Head of AI Governance & Policy at Dream, she leads research on AI policy to advance the governance of frontier AI and agentic systems. She has been close to AI evaluations for several years, but never performed one, and is keen to gain this practical skillset, and support the cohort with insights into AI policy and AI safety. She is excited to join this great community!

Juan Vazquez

Spain

I am a Software and Research Engineer at Arb Research. My research interests span evaluations, multi-agent environments, model internals, and language model creativity. In the past, I have also worked on computational materials research, and energy modelling for sustainable architecture.
I care about reliable AI evaluation because I want to be able to trust model outputs beyond task completion. My goal is to develop standards that researchers want to use, and users can understand.

Julie Fitzgerald

United States

I am currently a contractor with a leading research agency, helping to develop and manage emerging technology programs. I took a chance on AI over 25 years ago, leaving graduate studies in philosophy to pursue work in ontology design and reasoning. After improving internal evaluations, I began to design and run evaluations for large government-funded research efforts, eventually founding my own consulting company. I am deeply interested how evaluation of AI systems can support responsible and ethical use of technology, especially how we prevent cognitive harms from AI for users and society more broadly.

Kai Rawal

India

I am currently a doctoral student at the University of Oxford, with prior experience working in AI both in industry and in academia. My research focus until now has been in AI explainability and interpretability, and I am interested in exploring intersections with AI evaluation.

Konstantinos Krampis

Greece

I have been working in bioinformatics for 15 years, and pivoted towards teaching and research in deep learning and AI interpretability for the past 4 years. I believe there is a critical asymmetry in AI development: while LLM model alignment techniques optimize outward compliance, they cannot verify whether internal objectives genuinely align with stated goals or merely simulate alignment opportunistically. My goal is to do research on AI interpretability, long horizon agent tasks and evals of whether LLM model behaviors indicate sophisticated cognition—including situational awareness, hidden reasoning, and strategic planning, and whether these are implemented through interpretable features, directly informing questions about the nature of AI cognition. https://kkrampis.github.io/blog/

Li Kong

China

I am Li Kong, a first-year Master's student at the Gaoling School of Artificial Intelligence, Renmin University of China, under the guidance of Professor Qi Qi. My main research interests lie in the areas of game theory, operation research and LLM. Passionate researcher shaping the future through technology. Always open to new collaborations and conversations!

Lorena Almaraz De La Garza

Mexico

Lorena is an AI Governance professional focused on AI Trust and Safety. In particular, her focus is the effective embedding of regulatory and policy frameworks into everyday science and engineering practices. In other words, Lorena is a translator across fields: research, engineering, product, legal, compliance, and risk. Building a common language across these perspectives, Lorena has contributed to the development and adoption of AI Governance programs in the software industry and to applied engineering programs in AI research. Lorena has a Master of Information with concentrations in Human-Centred Data Science and Critical Information Policy from the University of Toronto.

Marysia Winkels

Netherlands

Marysia is an AI Safety Engineer at Gray Swan AI, a company that offers enterprise-grade security solutions for large language models (LLMs) and is developed by pioneers in AI vulnerability research. In her spare time, she actively participates as a community member and organizer for the Amsterdam AI Safety group and serves as a board member of Encode EU.

Maxwell IGWEOGU

Nigeria

Maxwell Igweogu is a Nigerian-trained lawyer and PhD researcher at the Nigeria Artificial Intelligence Research Lab, University of Lagos. His work focuses on AI governance, with particular interest in bias detection, fairness, and non-discrimination in algorithmic systems. His doctoral research examines how Nigerian discrimination law aligns, or fails to align, with technical concepts of algorithmic fairness. He is drawn to AI evaluation because it sits at the intersection of law, technology, and social justice, offering practical ways to ensure AI systems are accountable, fair, and context-sensitive.

Muhua Huang

China

Muhua Huang is a PhD student in Organizational Behavior at Stanford University. She studies the characters of humans and AI (e.g., cognition, personality, values, and culture), by treating AI either as “alien participants” or as a vast knowledge base. Her prior work in psychometrics and human-centered AI shaped her interest in context-sensitive approaches to AI evaluation. Coming from an interdisciplinary background, she aims to bring perspectives from psychology, sociology, and management science into AI evaluation. Before Stanford, she earned an MA in Computational Social Science at UChicago and a BA in Computer Science and Psychology at UBC.

Rishav Punatar

United Kingdom

Rishav Punatar is a Senior Manager at the UK Government’s Incubator for AI (i.AI), working on AI in public services. He leads the end-to-end delivery of a UK government-led benchmark for AI in education under strategic oversight from No.10 Downing Street and the Prime Minister’s AI Adviser. He works with leading AI labs and safety institutes on evaluation design and construct validity. He holds a degree in Philosophy, Politics and Economics and previously worked as a GenAI consultant for Fortune 500 clients. He is an Imperial AI Policy Fellow.

Scott Blain

United States

Scott Blain is a postdoctoral researcher at Ohio State University studying personality and social cognition across human and artificial minds. His work applies psychometrics, experimental psychology, and cognitive neuroscience to AI evaluation and alignment—particularly how AI systems model other agents, detect deception, and represent mental states. He holds a BS in cognitive science from Vanderbilt and a PhD in psychology from the University of Minnesota. What draws him to AI evaluation is the conviction that the same tools psychologists use to understand human minds—careful measurement, validated constructs, and attention to individual differences—are essential for understanding the minds we’re building.

Sonia Fereidooni

United States

I am a Digital Humanities PhD at the University of Cambridge studying the militarization of Generative AI for harm. I completed my BS in Computer Science & Data Science, BA in Sociology, and MS in Computer Science & Engineering at the University of Washington. My research has spanned AI Bias, Commonsense Reasoning, improving unsupervised Computer Vision models, and designing equitable Computer Science education. I previously worked at Google Brain and DeepMind, developing and open-sourcing the AI/ML development frameworks T5X and SeqIO. I was also a Responsible AI Research Fellow at Google's Impact Lab involving analyzing systemic biases within Google's search algorithms, and researched AI bias in commonsense reasoning of SOTA models at the Mosaic Team in the Allen Institute for AI (Ai2).

Sumaya Adan

Kenya

I am a DPhil student in Computer Science at the University of Oxford and Research Lead at the Centre for AI Security and Access (CASA), working on decentralised AI security infrastructure and the foundations needed for trustworthy model deployment and evaluation. My research focuses on building AI evaluation and safety capacity in regions where AI deployment is accelerating but evaluative infrastructure remains limited, particularly across Global majority countries. I am drawn to AI evaluation because it is a critical mechanism for safety and accountability, yet remains highly concentrated in frontier labs and a few countries. My work aims to translate frontier-level evaluation methods into contexts with different constraints, priorities, and institutional realities.

Xinhui Kang

Singapore

I’m a Computer Science Master’s student at Tsinghua University working at the intersection of LLM agents, AI safety/alignment, and human-centered computing. I’m excited about AI evaluation because it is the bridge between rapid progress and trustworthy deployment by making safety claims auditable and linking observed behaviors to internal evidence.

YUHUAI WEI

China

I am an LLM Evaluation Specialist at a leading tech company in China, where I design evaluation frameworks for large language models. With an interdisciplinary background in sociology and China Studies from Peking University, my work focuses on bridging the gap between static benchmarks and real-world performance. I also contribute to academic research on reasoning and writing benchmarks. I am drawn to AI evaluation to help build shared, reliable standards that ensure frontier models are safe, reliable, and context-aware.

Zachary Yahn

United States

I'm a PhD student at Georgia Tech, where I research adversarial machine learning and computer vision for privacy. I'm also a machine learning fellow at 10a Labs, where I work on various projects related to AI safety and security. Broadly, I'm interested in probing and measuring the capabilities of frontier AI systems to understand their potential for societal harm. I approach this through the lens of jailbreaks, adversarial attacks, red-teaming, and other techniques.

2026 Cohort

Aashiq Muhamed

Ana Simina Stoian

Anna Liashenko

Antoni Mestre Gascón

Bernardo Villegas

Bijna Balan

Bolun Sun

Chaoran Chen

Christina Wiremu-Brook

Claire Liu

Haomiaomiao Wang

Harshit Dave

Isabel Barberá

Ishan Kavathekar

Israel Mason-Williams

Jodi Casabianca

Jodie Levy

Juan Vazquez

Julie Fitzgerald

Kai Rawal

Konstantinos Krampis

Li Kong

Lorena Almaraz De La Garza

Marysia Winkels

Maxwell IGWEOGU

Muhua Huang

Rishav Punatar

Scott Blain

Sonia Fereidooni

Sumaya Adan

Xinhui Kang

YUHUAI WEI

Zachary Yahn

AI Evaluation Programme: Capabilities and Safety

Contact