The maturity phases of running evals — Phil Hetzel, Braintrust

Categories: AI, Tools

Summary

Evals aren't exhaustive unit tests—they're risk management tools focused on specific agent failure modes. Phil Hetzel from Braintrust reveals that LLM-as-judge scoring doesn't need 100% accuracy, only directional consistency, and that maturity in eval practices follows predictable stages as agent complexity increases.

Key Takeaways

  1. Evals should target specific failure modes identified by you or subject matter experts, not exhaustively cover every potential problem. This prevents spending all time writing tests instead of shipping.
  2. LLM-as-judge scoring doesn't require 100% accuracy—directional trending is sufficient. Non-deterministic evaluation results are acceptable as long as you're moving in the right direction.
  3. Every eval has three core components: a task (agent/prompt under test), a dataset of examples to initiate it, and scoring functions to judge output quality.
  4. Evals serve both defensive (risk mitigation: reputational, systems, compliance) and offensive (measuring improvement per tweak) purposes for agent quality.
  5. There are discrete maturity stages for eval practices that correlate directly with agent complexity—simpler agents require fewer evaluation layers, but the progression is necessary.

Related topics

Transcript Excerpt

Welcome everyone. Um, it's always a challenge to be a presenter directly after lunch because that's typically when the energy level goes from right around here to around here, but I'm going to try to make this session worth your while uh today. We've got 18 very quick minutes uh together and uh during that time I'm going to be talking about the uh different maturity levels that I see people go through as they perform evals for their agents. Uh before we get into that, um just roughly quick agenda today, um I'll explain a little bit about myself, the company that I work for. We'll spend most of the time today on on more theoretical concepts, not product concepts. And then um we'll we'll talk about um where I I think this field is going in the future. Uh I'll also make sure to leave enough t…

More from ai.engineer