
Perplexity expands its AI assistant to all Mac users
May 7, 2026
Former Voi founders raise $16 million for AI startup Pit
May 7, 2026OpenAI has launched three new voice models through its API that promise to make voice interactions with AI feel more natural and intelligent. The new models can reason through complex requests, translate conversations in real-time, and transcribe speech as people speak.
The release marks a significant step forward in voice AI technology, moving beyond simple back-and-forth exchanges toward systems that can actually understand context and take meaningful action during conversations. This development comes as voice interfaces become increasingly important for everything from customer service to travel planning, where users expect AI to handle complex, multi-step tasks through natural speech.
The three new models each serve different aspects of voice interaction. GPT-Realtime-2 is the flagship model that brings GPT-5-level reasoning capabilities to voice conversations, allowing it to handle difficult requests while maintaining natural conversation flow. GPT-Realtime-Translate enables live translation between more than 70 input languages and 13 output languages, keeping pace with speakers in real-time. GPT-Realtime-Whisper provides streaming speech-to-text transcription that works as people talk, rather than after they finish speaking.
These capabilities address a growing need in the software industry. Voice has become one of the most natural ways people interact with technology, especially in situations where typing isn’t practical – like while driving, walking through airports, or needing help in a different language. However, building useful voice products has required more than just fast responses or natural-sounding speech.
Companies are already testing these models in real-world scenarios. Zillow is building an assistant that can listen to complex housing requests like “find me homes within my budget, avoid busy streets, and schedule a tour for Saturday,” then reason through the requirements and take action. Priceline is working toward voice-managed travel experiences where customers can handle entire trips through conversation, from booking to managing changes and getting real-time updates.
The improvements in GPT-Realtime-2 are substantial. The model can now:
- Use short phrases like “let me check that” to keep users informed while processing requests
- Call multiple tools simultaneously while narrating its actions
- Recover gracefully from errors with natural explanations
- Handle context windows of 128K tokens, up from 32K, for longer conversations
- Better retain specialized terminology and proper nouns
- Adjust its tone based on the situation – calm during problem-solving, empathetic when users are frustrated
- Scale reasoning effort from minimal to extra-high depending on request complexity
The performance gains are measurable. GPT-Realtime-2 scores 15.2% higher than its predecessor on Big Bench Audio tests for audio intelligence and 13.8% higher on Audio MultiChallenge tests for following instructions in conversations.
Live translation capabilities fill a crucial gap in global communication. Deutsche Telekom is testing the translation model for multilingual customer support, where lower latency and better fluency can make cross-language conversations feel natural rather than stilted. The model needs to preserve meaning while keeping pace with speakers, even when people use regional pronunciation or industry-specific language.
The streaming transcription model addresses the lag that makes current voice interfaces feel clunky. Instead of waiting for someone to finish speaking before transcribing, GPT-Realtime-Whisper works continuously, enabling faster captions for meetings and broadcasts, real-time note-taking, and more responsive voice agents.
Safety remains a priority with multiple layers of protection. OpenAI uses active classifiers to monitor conversations in real-time, stopping sessions that violate content guidelines. The company also requires developers to clearly indicate when users are interacting with AI, unless it’s obvious from context.
The models are available now through OpenAI’s Realtime API. GPT-Realtime-2 costs $32 per million audio input tokens and $64 per million output tokens. GPT-Realtime-Translate is priced at $0.034 per minute, while GPT-Realtime-Whisper costs $0.017 per minute. All three models support EU data residency requirements for European applications.
This release puts OpenAI in direct competition with other tech giants working on voice AI, including Google’s conversation AI and Amazon’s Alexa improvements. The ability to reason through complex requests while maintaining natural conversation flow could be a significant differentiator, especially for business applications where current voice assistants often fall short of user expectations.




