AI systems are no longer just answering questions.
They’re browsing the web, writing code, booking flights, and making decisions across multiple steps.

These are AI agents.
And evaluating them is a fundamentally different challenge.

We know, more or less, how to test a model that produces an answer.
We still do not fully know how to test a system that acts.

With a standard model, evaluation is relatively simple: give it an input, inspect the output.
With an agent, the picture gets much messier.

The same agent can succeed or fail on the same task depending on its scaffolding — the prompts, tools, memory, and infrastructure around the model. So what exactly are you evaluating: the model itself, or the wider system built around it?

Failure is not straightforward either. An agent might arrive at the right answer through a flawed or brittle process. Or it might reason well for most of the task, only to fail at the final step. Because agents operate in sequences, a small mistake early on can cascade through everything that follows.

Then there is the compute question: if you give an agent more time to “think,” does it actually perform better — or does it just explore more dead ends with greater confidence?

This session introduces the foundations of agentic evaluation.

We’ll explore:

what agent tasks look like in practice
how scaffolding — the tools, prompts, and environment around a model — shapes its behaviour
how inference-time compute scaling affects performance
and how to do quality assurance by analysing decision traces step by step

Because when systems act over time, the output alone is no longer enough.
You need to understand the process.

Part of Module 8: Agentic, Alignment & Control Evaluations.

Dr. Cozmin Ududec leads the Science of Evaluations team at the UK AI Security Institute. He has a PhD in Quantum Physics from the University of Waterloo and Perimeter Institute for Theoretical Physics, where he worked on reconstructing quantum theory from first principles.

Before joining AISI, he spent a decade leading applied research at the intersection of machine learning, probabilistic modelling, and complex systems — most recently as Chief Scientist at Invenia Labs, where he led a team of 25 researchers optimising decision-making in electricity grids.