Really Big Test-Time Compute in AI Changes Benchmarks, Safety and Research with OpenAI's Noam Brown

By No Priors Podcast

Categories: AI, VC

Summary

Modern AI models like o1 can productively think for weeks before plateauing on benchmarks—far beyond what's practical to test. Noam Brown argues current evaluation frameworks are fundamentally broken because they ignore test-time compute as a variable, making performance comparisons meaningless without controlling for inference budget.

Key Takeaways

Model capability is now a direct function of inference budget. A $10M compute budget produces substantially different results than $10K, but existing benchmarks report single performance numbers without specifying the compute spent—making cross-model comparisons invalid.
Current benchmark grids are misleading because they don't control for test-time compute. o1 showed only 'a few percentage points' improvement over o1-preview on paper, but was substantially better in practice due to superior thinking efficiency at equivalent budgets.
Performance plateaus are now unreasonably far out to test. GPT-3 models plateaued quickly, but modern models can improve for 100+ million tokens. Evaluation requires either explicit token/cost budgets or performance curves showing improvement slopes, not single-point measurements.
Projection-based evaluation could replace exhaustive testing. For tasks where 100M token runs show consistent improvement slopes, researchers can evaluate up to a reasonable budget and extrapolate performance—avoiding infinite eval cycles without sacrificing accuracy.
Safety and responsible scaling policies don't account for test-time compute variables. Existing frameworks evaluate static model capability, but miss the dynamic scaling behavior that emerges at inference time—creating policy gaps for frontier research.

Related topics

Transcript Excerpt

With GBT3, you couldn't scale test time compute. Like if you gave it a budget of $10 million and said, "Okay, well, let's see what GB3 can do." It really can't do that much. The precurren frameworks and responsible scaling policies, they don't really account for the amount of testime compute. They just say, "Okay, well, what's the capability of the model?" The problem is we're in a world now where the capability of the model is a function of how much money you put into it. Basically, if you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. Give it a budget of $10 million, you can do even more. At what budget should you evaluate these models? The policies that exist today don't really address that question. Hi listeners, I'm Sarah Goa and welcome ba…

More from No Priors Podcast