(Past) Measurement Theory for AI Evaluation: Cognitive Constructs, IRT, and Latent Factor Models — Dr. Sanmi Koyejo

Wednesday, March 18, 2026
5:00 PM 6:00 PM

Google Calendar ICS

How do we know whether an AI evaluation is actually measuring what it claims to measure? This talk introduces key ideas from measurement theory and psychometrics that can make AI evaluation more rigorous and meaningful.

Dr. Sanmi Koyejo covers three foundational concepts: cognitive constructs and how to define what we're trying to evaluate, Item Response Theory (IRT) as a framework for understanding test items and model abilities, and latent factor models for uncovering structure beneath surface-level performance.

Mixing conceptual framing with technical detail, this session offers a foundation for thinking critically about what benchmarks and evaluations can — and cannot — tell us about AI systems.

Dr. Sanmi Koyejo is an Assistant Professor of Computer Science at Stanford University and an Adjunct Associate Professor at the University of Illinois at Urbana-Champaign. He leads the Stanford Trustworthy AI Research (STAIR) group, which focuses on developing measurement-theoretic foundations for trustworthy AI systems, including AI evaluation science, algorithmic accountability, and privacy-preserving machine learning.

Dr. Koyejo's applied research extends to healthcare, neuroimaging, and scientific discovery. He's affiliated with several Stanford institutes and groups, including SAIL, HAI, CRFM, AIMI, AI Safety, Machine Learning Group, and Bio-X.

Watch the Recording

Sanmi argued that the problem isn't that benchmarks are bad tools, it's that we've been asking them to do something they were never designed for. In reality, benchmark scores aren't capability verdicts, they're evidence for limited claims.

Fields like psychology and education have been measuring complex things for decades, ran into the exact same issues, and built frameworks to deal with them. AI has been largely ignoring that work. That's a problem now that benchmark scores are being used to make real decisions — about deployment, regulation, procurement — rather than just to compare models in a research lab.