OpenAI launched three new voice models through its Realtime API on May 7, 2026, and this is the kind of release that actually changes what voice AI can do in production. The flagship model, GPT-Realtime-2, brings GPT-5-class reasoning into live conversational voice for the first time, while two companion models handle real-time translation and transcription.
What Launched
OpenAI released three models simultaneously through its Realtime API developer platform:
GPT-Realtime-2 is the main event. Unlike its predecessor GPT-Realtime-1.5, this model routes reasoning through the audio loop itself rather than switching between transcription and synthesis steps. That matters because the previous architecture added latency and broke the conversational flow when the model needed to think. GPT-Realtime-2 also expands the context window from 32K to 128K tokens, making longer sessions and multi-step agentic workflows viable without external state management.
The model supports five reasoning effort levels (minimal, low, medium, high, and xhigh), letting developers tune the tradeoff between response speed and reasoning depth. It can also call multiple tools in parallel while making those actions audible with natural phrases like “checking your calendar” or “looking that up now,” which keeps interactions from going silent mid-task.
On OpenAI’s benchmarks, GPT-Realtime-2 at high effort scores 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio and 13.8% higher on Audio MultiChallenge for instruction following at xhigh effort.
GPT-Realtime-Translate handles live translation across more than 70 input languages into 13 output languages. The model is designed to keep pace with natural speech rather than lagging behind, which has been the limiting factor for voice-based translation in customer support and global commerce settings.
GPT-Realtime-Whisper is a streaming speech-to-text model built for low-latency transcription. It handles specialized terminology, proper nouns, and domain-specific vocabulary better than previous versions, which matters for industries like healthcare, legal, and financial services where generic transcription accuracy falls short.
The Realtime API also officially exits beta with this release and is now generally available, adding support for remote MCP servers, image inputs, and phone calling through Session Initiation Protocol (SIP).
Pricing
GPT-Realtime-2 is priced at $32 per million audio-input tokens, $0.40 per million for cached input tokens, and $64 per million audio-output tokens.
GPT-Realtime-Translate runs at $0.034 per minute. GPT-Realtime-Whisper is $0.017 per minute.
The cached token pricing is significant. For enterprise voice agents handling repeat interactions, the ability to cache context reduces cost substantially in high-volume production deployments.
What This Means for Business
Voice AI has been stuck in demo-land for most businesses because the technology worked in controlled conditions but fell apart in real conversations. The core problem was that voice models with genuine reasoning capability were too slow for live use, while fast models were too shallow to handle anything beyond simple queries.
GPT-Realtime-2 closes that gap by running GPT-5-class reasoning in the audio loop. The practical implication is that voice agents can now handle the kind of complex, multi-step requests that previously required a human or a handoff to a text-based interface.
A few specific shifts this enables:
Customer service that actually resolves issues. Current voice AI excels at routing and FAQ responses. With deeper reasoning and parallel tool calling, voice agents can now pull context from multiple systems, reason across them, and complete a resolution during a single call rather than transferring the customer.
Multilingual operations without interpreter costs. For businesses serving customers across language barriers, real-time translation at conversational pace changes the economics of international support entirely.
Longer, context-aware sessions. The 128K context window means a voice agent can maintain coherent context across an extended interaction, a product configuration call, a complex support session, or a guided onboarding flow, without losing thread.
Phone-based automation. SIP support means voice agents can now receive and make calls through standard telephony infrastructure without custom integration work, which removes one of the biggest friction points for enterprise deployments.
The Realtime API moving to general availability signals that OpenAI considers this production-ready. That shifts the conversation for any business exploring voice AI from “is it mature enough” to “what would this do for our customer interactions.”
For businesses that handle high call volumes, serve multilingual customers, or want to extend AI capabilities beyond chat interfaces, this release is worth examining seriously.
Enterprise DNA builds AI voice employees through Omni Voice, helping businesses deploy production-ready voice AI without the infrastructure overhead.
Source
TechCrunch