Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, CEO, Trigger.dev

By ai.engineer

Categories: AI, Tools

Summary

Agents fundamentally break 30 years of stateless backend architecture—the LLM now orchestrates code instead of vice versa. Eric Allam presents two competing durability models (Replay vs. Snapshot) to solve the critical challenge of making agent loops production-ready with error recovery and state persistence.

Key Takeaways

The 'shared nothing architecture' (request + DB = response) dominated backends for 30 years across CGI, PHP/LAMP, Rails, Node.js, and serverless. Agents break this paradigm because the LLM orchestrates code execution, not HTTP requests.
Replay model wraps every side effect as a cached step in execution journals, enabling resume-from-failure without duplicate operations (e.g., charging credit card twice). Tradeoff: rigid code structure and versioning complexity when deploying new code.
The agent loop (LLM → tool calls → code execution → repeat) requires both LLM calls and tool calls to become durable steps. This is fundamentally different from workflow engines that only managed side effects outside request/DB cycles.
Two competing durability strategies exist: Replay (journals every step, resumes from checkpoints) vs. Snapshot (likely captures state at points in time). Choosing between them affects code flexibility, versioning, and production complexity.
Trigger.dev's core mission is lowering deployment friction for production agents—making long-running meaningful work durable across code versions with automatic error recovery on backend servers.

Related topics

Transcript Excerpt

[music] >> How's everyone doing? It's a full room. Look at this thing. >> [laughter] >> Um okay, let's get started. Um okay. So, here is our, you know, agent. You know, it's got the turn loop, it's got the LM loop. You know, this little example sort of works well enough running on your own machine, but what if we want to sort of deploy these two production backends and, you know, run them on our servers? So, what do we want them to do, right? When they run on our servers, we want them to do, you know, long-running meaningful work. Uh should be durable across turns and and versions of our code, and it should be able to, you know, recover from errors. So, I'm Eric, I'm one of the founders of trigger.dev, and we've been sort of trying to make it easy to deploy these types of agents to product…

More from ai.engineer