Back to All Events

(Past) Measurement Theory for AI Evaluation: Cognitive Constructs, IRT, and Latent Factor Models — Dr. Sanmi Koyejo

How do we know whether an AI evaluation is actually measuring what it claims to measure? This talk introduces key ideas from measurement theory and psychometrics that can make AI evaluation more rigorous and meaningful.

Dr. Sanmi Koyejo covers three foundational concepts: cognitive constructs and how to define what we're trying to evaluate, Item Response Theory (IRT) as a framework for understanding test items and model abilities, and latent factor models for uncovering structure beneath surface-level performance.

Mixing conceptual framing with technical detail, this session offers a foundation for thinking critically about what benchmarks and evaluations can — and cannot — tell us about AI systems.

Dr. Sanmi Koyejo is an Assistant Professor of Computer Science at Stanford University and an Adjunct Associate Professor at the University of Illinois at Urbana-Champaign. He leads the Stanford Trustworthy AI Research (STAIR) group, which focuses on developing measurement-theoretic foundations for trustworthy AI systems, including AI evaluation science, algorithmic accountability, and privacy-preserving machine learning.

Dr. Koyejo's applied research extends to healthcare, neuroimaging, and scientific discovery. He's affiliated with several Stanford institutes and groups, including SAIL, HAI, CRFM, AIMI, AI Safety, Machine Learning Group, and Bio-X.


Watch the Recording

Sanmi argued that the problem isn't that benchmarks are bad tools, it's that we've been asking them to do something they were never designed for. In reality, benchmark scores aren't capability verdicts, they're evidence for limited claims.

Fields like psychology and education have been measuring complex things for decades, ran into the exact same issues, and built frameworks to deal with them. AI has been largely ignoring that work. That's a problem now that benchmark scores are being used to make real decisions — about deployment, regulation, procurement — rather than just to compare models in a research lab.

Session Summary

Key takeaways, concepts, and references from this session — compiled by our team so you can revisit the ideas and help you digest the content.

Want to know when new sessions go live?

Sign up to get notified about upcoming lectures.

Previous
Previous
February 27

(Past) Evaluating Multi-Agent / Social Systems — Prof. Joel Z. Leibo

Next
Next
March 25

(Past) Evaluating AI Agents — Dr. Cozmin Ududec