Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc

Categories: AI, Tools

Summary

Static AI benchmarks are fundamentally broken for agentic systems shipping at lightning speed. Vincent Koc argues evaluations must become adaptive and malleable like the software they test, shifting from handcrafted test suites to chaos engineering approaches that catch failures before production.

Key Takeaways

  1. Traditional static benchmarks can't keep pace with rapidly evolving AI systems; OpenClaw demonstrates harnesses that self-adapt as application capabilities change, requiring benchmarks to evolve simultaneously rather than remain fixed.
  2. AI evaluation still relies on pre-deployment static testing (unit tests, manual regression) while missing chaos engineering practices that stress-test systems in production, leaving organizations vulnerable to unforeseen failures.
  3. The industry is over-fixated on publishing new benchmarks (e.g., 'I created a benchmark for adding numbers') without solving the core problem: how benchmarks actually help builders ship safer systems faster.
  4. Adaptive testing frameworks that selectively test based on system behavior offer a mindset shift from static validation toward continuous, intelligent evaluation—treating AI like malleable software rather than fixed artifacts.
  5. As software becomes malleable and ships at scale (like OpenClaw), evaluation strategies must mirror this velocity; observability and real-time feedback loops matter more than perfect pre-deployment test coverage.

Topics

Transcript Excerpt

[music] >> Cool. Hey everyone. Thanks for joining the session. Sorry if my sound's a little croaky. I've done three talks back-to-back. So, one on Wednesday, one yesterday keynote, and then workshop style session today. So, I'm Vincent. I'm going to be talking about malleable evals um from static AI measuring uh to adaptive systems. Now, let's jump into who I am, what I do. Uh call myself the friendly can car. I use AI, use technology. I'm always on the edge. Um for those of you that haven't seen my keynote, I do um yeah, I just live on the edge and and just do some fun stuff. So, this is me using VR goggles in um like back in 2013 when like people hadn't even heard of VR. It came with a warning label, said only use it for 5 minutes. I used it for 3 hours, then I vomited for 3 hours after …