We’re introducing three audio models in the API

Categories: AI, Product

Summary

OpenAI released three real-time audio models enabling live translation across 70 languages and voice agents with reasoning capabilities. The breakthrough is natural conversation flow—models wait for key words before translating and can interrupt gracefully, making voice the primary interface for apps without editing latency.

Key Takeaways

  1. GPT Realtime Translate handles 70 languages with sentence-level understanding, waiting for verbs before translating to enable natural dialogue instead of word-by-word output. Critical for global media, customer support, and education platforms.
  2. Voice agents now support parallel tool calling and real-time reasoning communication. Use preambles to make models explain reasoning and acknowledge actions taking seconds—keeps users informed during background processing.
  3. Models maintain conversation context passively without interrupting. They listen during human exchanges and only respond when signaled, creating seamless multi-turn interactions versus forced turn-taking in traditional voice interfaces.
  4. Voice agents can connect directly to dashboards, external services, and connected devices for live action—CRM updates, calendar reads, and system commands execute within conversation flow without modal breaks.
  5. Raw audio output is captured directly from API with zero editing, enabling low-latency voice as primary UI. Builders can now prioritize voice-first products instead of voice-as-secondary interaction.

Topics

Transcript Excerpt

Hey everyone, we're introducing new real-time audio models in the OpenAI API. In this demo, I'll show two of them. GPT Realtime Translate for live translations and GPT Realtime 2 for voice agents that can follow instruction and take actions. Let's start with translations cuz that one feels so magical. I speak French, but say I need to present to an audience around the world. The English you'll hear is the model's live audio output captured directly from this laptop with transcriptions. Now, as I start speaking in French, we'll lower the volume of my mic and increase the one from the model so you can have a real feel for it. No edit to the audio. Let's give it a try. What's really impressive is that the model can listen to me and translate while I'm speaking. It waits for the key word like …