Inference Chips for Agent Workflows
Categories: VC, Startup, Design
Summary
Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization because of it.
Transcript Excerpt
Most AI chips are designed for a world where inference means prompt in response out. Agents don't work that [music] way. They loop, calling tools, branching, backtracking, holding context across dozens of steps. That's a completely [music] different hardware problem. Current GPUs hit 30 to 40% of peak utilization on these workloads because the work is bursty, bouncing between memory bound model calls, IO bound tool use, and CPU bound orchestration. That gap is where purpose-built silicon wins. [music] Nvidia bought Groq for 20 billion because it saw this coming. Google built TPU v7 for inference specifically, but nobody's designing for the agent loop itself. Fast context switching between models, native speculative decoding, memory built for KB caches that persist across an entire executio…