How do we know whether an AI evaluation is actually measuring what it claims to measure? This talk introduces key ideas from measurement theory and psychometrics that can make AI evaluation more rigorous and meaningful.
Dr. Sanmi Koyejo covers three foundational concepts: cognitive constructs and how to define what we're trying to evaluate, Item Response Theory (IRT) as a framework for understanding test items and model abilities, and latent factor models for uncovering structure beneath surface-level performance.
Mixing conceptual framing with technical detail, this session offers a foundation for thinking critically about what benchmarks and evaluations can — and cannot — tell us about AI systems.
Dr. Sanmi Koyejo is an Assistant Professor of Computer Science at Stanford University and an Adjunct Associate Professor at the University of Illinois at Urbana-Champaign. He leads the Stanford Trustworthy AI Research (STAIR) group, which focuses on developing measurement-theoretic foundations for trustworthy AI systems, including AI evaluation science, algorithmic accountability, and privacy-preserving machine learning.
Dr. Koyejo's applied research extends to healthcare, neuroimaging, and scientific discovery. He's affiliated with several Stanford institutes and groups, including SAIL, HAI, CRFM, AIMI, AI Safety, Machine Learning Group, and Bio-X.