Text Diffusion — Brendon Dillon, Google DeepMind
Summary
Text diffusion generates entire token sequences in parallel over multiple iterations rather than one-at-a-time, achieving significantly faster inference on GPUs/TPUs while enabling bidirectional attention for self-correcting generation—Google DeepMind's Gemini Diffusion matched Gemini 2.0 Flash quality with better latencies in a research preview a year ago.
Key Takeaways
- Diffusion-based text generation initializes a long sequence (hundreds to thousands of tokens) as random noise, then iteratively refines it in parallel over multiple denoising steps, versus autoregressive's sequential one-token-at-a-time generation.
- Text diffusion enables bidirectional attention to future tokens, allowing models to perform self-correcting generation by detecting errors in reasoning and backtracking to fix them—impossible in causal autoregressive architectures.
- Diffusion models can implement adaptive computation by training to spend more denoising steps on harder problems and fewer on easier ones, optimizing compute allocation per query.
- Main tradeoff: text diffusion achieves better per-token latency and hardware utilization but suffers from lower throughput on large batches compared to autoregressive models, limiting cost efficiency for high-volume serving.
- Text diffusion enables in-place editing capabilities—fixing specific tokens and generating corresponding prefixes—and parallel generation of multiple output variations simultaneously.
Related topics
Transcript Excerpt
Some people are still filtering into the room. But it's mostly intro stuff for the first couple of slides so they won't miss anything. Okay, welcome everybody. My name is Brendan. I'm I'm a research scientist at DeepMind. I'm talking today about text diffusion, which is kind of a more forward-looking research area at DeepMind. So you're probably familiar with image and video diffusion, which is kind of state of the art for these modalities right now where you you know, you take ground truth, say image, you add noise to it in training, and then you train a neural network to remove that noise gradually, and then at inference time you just initialize the the picture with pure noise, and then you iteratively refine out the noise to recover back to, you know, whatever image or video or audio or…
More from ai.engineer
- The maturity phases of running evals — Phil Hetzel, Braintrust
- How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS
- Why Rust is the Ideal Language for Vibe-Coding — Daniel Szoke, Sentry
- Stop babysitting your agents... — Brandon Walsenuk, Unblocked
- From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google