Everyone's building AI evaluations. But who's evaluating the evaluations?
As AI systems become more powerful and more embedded in society, governments, companies, and researchers are racing to build evaluation methods — benchmarks, red teams, audits. But many of these evaluations lack basic scientific rigour: they're not reproducible, their results don't generalise beyond the test setting, and there's no shared standard for what "good enough" looks like.
This session explores the emerging field of meta-evaluation — the effort to build standards, best practices, and institutional frameworks for AI evaluation itself.
Firstly, from the methodological side: what statistical best practices should evaluations follow? What does a "ladder of evaluations" look like — from quick checks to deep assessments? And how should we think about evaluating open-weight models, where anyone can modify the system after release?
Secondly, from the institutional side: what initiatives are emerging to build a robust third-party evaluation ecosystem? The session covers the AI Evaluator Forum, AVERI's forthcoming statement, the regulatory markets approach (from researchers at JHU and others), and why traditional standards bodies like ISO may or may not be the right fit for AI evaluation.
Part of Module 10: Governance, Policy, and Regulation.
Patricia Paskov is an AI evals and policy researcher at RAND and a Research Affiliate at the Oxford Martin AI Governance Initiative. Her work focuses on building the scientific and institutional foundations for AI evaluation — from methodological guidance on human baselines and benchmark design to frameworks for third-party AI auditing.
She is the Resilience Section Lead of the 2026 International AI Safety Report and a Working Group Chair at the EvalEval Coalition. Her research has been published at ICML (where it received a 2025 Spotlight Award), FAccT, and in policy venues including RAND and Carnegie. She has presented to stakeholders including the US Department of Defense, the EU AI Office, and Microsoft Research.
Want to join this session?
Sign up to register and get notified about upcoming lectures.