Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

By ai.engineer

Categories: AI, Tools

Summary

Google DeepMind's Gemma 4 achieves state-of-the-art performance with models 20x smaller than competitors by combining interleaved local/global attention layers with grouped query attention—now available under Apache 2.0 license for unrestricted commercial deployment.

Key Takeaways

31B dense model ranks #3 on LM Arena while outperforming models 20x its size through architectural innovations, demonstrating efficiency breakthrough for open-source development.
5:1 ratio of local-to-global attention layers with 1,024 token sliding windows in larger models dramatically reduces memory overhead while maintaining information flow to preceding layers.
Grouped query attention scales to 8 queries per key-value head in global layers with doubled key-value length (512 vs 256) to recover performance lost from parameter reduction.
26B mixture-of-experts model activates only 3.9 billion parameters from 128 total experts per forward pass, enabling efficient edge deployment on phones/laptops without dense model overhead.
Apache 2.0 licensing removes commercial restrictions, allowing developers to integrate Gemma across full development lifecycle from testing through production deployment without legal friction.

Topics

Grouped Query Attention
Mixture of Experts Architecture
On-Device AI Models
Local-Global Attention Interleaving
Open Source LLM Efficiency

Transcript Excerpt

Hi everyone, my name is Cassidy and I'm a researcher at Google DeepMind. Today I'm really excited to share with you some of the technical improvements and architecture that we have with Gemma 4. Last week we launched Gemma 4 which is the latest addition to our family of open source models. Gemma 4 brought incredible improvements at a scale that has not been seen before. We have a family of very small models with incredible performance setting a new precedent for what's possible with small open s...