Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind
By ai.engineer
Categories: AI, Tools
Summary
Google DeepMind's Gemma 4 achieves state-of-the-art performance with models 20x smaller than competitors by combining interleaved local/global attention layers with grouped query attention—now available under Apache 2.0 license for unrestricted commercial deployment.
Key Takeaways
- 31B dense model ranks #3 on LM Arena while outperforming models 20x its size through architectural innovations, demonstrating efficiency breakthrough for open-source development.
- 5:1 ratio of local-to-global attention layers with 1,024 token sliding windows in larger models dramatically reduces memory overhead while maintaining information flow to preceding layers.
- Grouped query attention scales to 8 queries per key-value head in global layers with doubled key-value length (512 vs 256) to recover performance lost from parameter reduction.
- 26B mixture-of-experts model activates only 3.9 billion parameters from 128 total experts per forward pass, enabling efficient edge deployment on phones/laptops without dense model overhead.
- Apache 2.0 licensing removes commercial restrictions, allowing developers to integrate Gemma across full development lifecycle from testing through production deployment without legal friction.
Topics
- Grouped Query Attention
- Mixture of Experts Architecture
- On-Device AI Models
- Local-Global Attention Interleaving
- Open Source LLM Efficiency
Transcript Excerpt
Hi everyone, my name is Cassidy and I'm a researcher at Google DeepMind. Today I'm really excited to share with you some of the technical improvements and architecture that we have with Gemma 4. Last week we launched Gemma 4 which is the latest addition to our family of open source models. Gemma 4 brought incredible improvements at a scale that has not been seen before. We have a family of very small models with incredible performance setting a new precedent for what's possible with small open s...