I benchmarked the NEW Sonnet 5. The results shocked me.
Summary
Claude Sonnet 5 delivers near-Opus performance at 71% cheaper pricing ($2-10 per million tokens), making it viable for everyday agentic tasks—but the real value is learning how to build repeatable, human-graded benchmarks for your own model evaluations instead of relying on one-off vibe checks.
Key Takeaways
- Sonnet 5 achieves 69% on Agentic Coding SWE-Bench vs Opus's 82%, but is 71% cheaper at $2 input/$10 output tokens through summer 2025—close enough performance that most teams won't notice the difference.
- Build repeatable, human-graded benchmarks rather than one-off vibe tests. Include frozen inputs, blind scoring, rubrics, and tasks relevant to your specific workflow (PRDs, bug-fixing, design).
- Leverage Claude Code's session history feature to let Claude review your past work patterns and recommendations for future benchmarks—creates institutional memory for eval design.
- Sonnet 5 excels at agentic tool use and computer vision tasks (80%+ pass rate on browser/computer use), making it suitable for multi-step automation at fraction of Opus cost.
- Don't lose subjective taste in evaluations—maintain human judgment in benchmark scoring rather than using LLM-as-judge to preserve perspective on what actually matters for your use cases.
Related topics
Transcript Excerpt
We've got a new model, people, and it's from Anthropic. Now, is it Mythos? No. Is it Fable? No. But it is Claude Sonnet 5. Anthropic is claiming it's the most agentic sonnet model yet, and we will get Opus level tasks at sonnet level prices. Now, I've been testing a lot of models, and I'm starting to get bored of doing the vibe check. What I want to start developing is a set of benchmarks. you can regularly test these new models against that you'll care about. So today I'm going to be introducing the howi AI bench a set of AI and Clarvo graded benchmarks that are going to tell us if this model and any model is good at writing PRDS solving bugs and one-shotting designs. I'm going to show you exactly how I built this benchmark using claude code and we're going to see on a blind test what com…