Inference Chips for Agent Workflows

Name: Inference Chips for Agent Workflows
Uploaded: 2026-05-07T16:15:37.294Z
Duration: 1 min 19 s
Channel: Y Combinator
Description: Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization because of it. That gap is where purpose-built silicon wins. Apply to YC Summer 2026 at ycombinator.com/apply.

By Y Combinator

Categories: VC, Startup, Design

Summary

Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization because of it.

Transcript Excerpt

Most AI chips are designed for a world where inference means prompt in response out. Agents don't work that [music] way. They loop, calling tools, branching, backtracking, holding context across dozens of steps. That's a completely [music] different hardware problem. Current GPUs hit 30 to 40% of peak utilization on these workloads because the work is bursty, bouncing between memory bound model calls, IO bound tool use, and CPU bound orchestration. That gap is where purpose-built silicon wins. [music] Nvidia bought Groq for 20 billion because it saw this coming. Google built TPU v7 for inference specifically, but nobody's designing for the agent loop itself. Fast context switching between models, native speculative decoding, memory built for KB caches that persist across an entire executio…