Why Tejal Patwardhan stopped underestimating the models - Episode 21

Categories: AI, Product

Summary

OpenAI's research lead reveals why traditional benchmarks are becoming obsolete as models saturate them—the real opportunity is building frontier evals that measure capability overhang, the gap between what models can do and what users actually adopt. Early reasoning models trained only on math outperformed human baselines on science benchmarks, proving founders systematically underestimate AI model capabilities.

Key Takeaways

  1. Move beyond saturated benchmarks like GPQA to frontier evals that measure real-world utility and capability overhang—the period when models are capable of things before cultural, legal, or regulatory adoption happens.
  2. Never underestimate model capabilities when designing evals. Early reasoning models trained exclusively on math achieved human-level performance on biology, chemistry, and physics problems, defying initial expectations.
  3. Capability overhang is a strategic advantage for early adopters. Most people judge AI by current ChatGPT outputs showing hallucinations, but the slope of improvement means transformative capabilities arrive faster than perceived.
  4. Build evals that measure preparedness and threat modeling alongside capability gains. OpenAI's preparedness team ran threat modeling exercises to understand release implications before reasoning model breakthroughs.
  5. Focus on slope, not current state. Even when individual model outputs appear mediocre, steep improvement curves mean exponential capability gains within 6-month windows, requiring proactive forecasting.

Related topics

Transcript Excerpt

Andrew Mayne: Hello, I'm Andrew Mayne, and welcome to the OpenAI podcast. Andrew Mayne: On today's episode, we're talking to the research lead, Andrew Mayne: Tejal Patwardhan, about the need to build frontier evals Andrew Mayne: as old benchmarks get saturated. Tejal Patwardhan: Generally bad. Benchmarking is bad. Tejal Patwardhan: How can we make these models useful for people in their real work? Tejal Patwardhan: We were really nervous because we were like, Tejal Patwardhan: this human baseline is kind of hard. Tejal Patwardhan: We don't know if the model is going to beat it. Tejal Patwardhan: But we should never underestimate the model. Andrew Mayne: Tejal, I have a question. Andrew Mayne: How did you end up where you were? Andrew Mayne: What brought you into OpenAI? Tejal Patwardhan: O…

More from OpenAI