Introducing OpenAI gpt-realtime: Say Goodbye to Latency in AI Voice Conversations
OpenAI announces its latest voice model, gpt-realtime, and a major update to the Realtime API. Experience unprecedented low latency, high fidelity, and multimodal interaction with support for SIP calls, image input, and a 20% price reduction, opening a new chapter for developers and enterprises in building next-generation voice assistants.
Have you ever been fed up with AI voice assistants that sound robotic and are slow to respond? That sense of delay and stiff intonation constantly reminds you, “This isn’t a real person.” Honestly, that experience is far from a smooth “conversation.”
However, that era may be officially coming to an end.
On August 28, 2025, OpenAI dropped a bombshell, officially launching its most advanced voice-to-voice model to date—gpt-realtime—and simultaneously opening up the Realtime API to all. This isn’t just a routine update; it’s a complete revolution aimed at enabling developers and enterprises to build truly reliable, production-ready voice AI agents.
What does this mean? Simply put, we are one giant step closer to the natural, real-time, and emotionally rich AI interactions seen in the movie Her.
Not Just Dialogue, but “Conversation”: The Core Breakthrough of gpt-realtime
Previous voice AIs mostly followed a traditional process: Speech-to-Text, processing the text, and then Text-to-Speech. This chain was not only lengthy but also lost many of the subtle emotions and tones of speech during conversion.
gpt-realtime completely subverts this model.
It uses a single, end-to-end model that directly processes and generates audio. It’s like shifting from hearing a story retold by someone else to listening to the original storyteller. The benefits of this architecture are obvious:
- Extremely low latency: Conversations have almost no delay, with responses as quick as a real person’s.
- Preserves tonal details: It can capture and reproduce the tone, emotion, and rhythm of speech, making the voice sound more natural and expressive.
- New voices: This update also introduces two brand-new voices designed specifically for the Realtime API—Cedar and Marin—offering more diverse voice options.
It Truly “Understands”: A Leap in Intelligence and Comprehension
A good conversation partner not only needs to sound good but, more importantly, needs to understand. gpt-realtime demonstrates astonishing progress in intelligence and comprehension.
It can now:
- Capture non-verbal cues: For instance, the model understands laughter in a conversation as an expression of emotion, not just noise.
- Adapt its tone: Developers can give more nuanced instructions, such as asking the model to speak in a “lively and professional” or “gentle and empathetic” tone.
- Switch languages seamlessly: The model can handle different languages mixed within a single sentence smoothly.
- Accurately identify complex information: Its accuracy in recognizing alphanumeric sequences like phone numbers and Vehicle Identification Numbers (VINs) has significantly improved, with excellent performance in languages like Spanish, Chinese, Japanese, and French.
The data speaks for itself. In the Big Bench Audio benchmark, which measures reasoning ability, gpt-realtime achieved an accuracy of 82.8%, far surpassing the 65.6% of its predecessor. This proves it’s not just “parroting” but truly possesses stronger understanding and reasoning capabilities.
Precise Instruction Following and Smarter Tool Calling
For developers, the biggest concern is whether the model “listens.” gpt-realtime has been significantly optimized for instruction following, accurately capturing and executing even minor commands.
More importantly, the Function Calling feature has also become more powerful. A capable voice assistant must know when to call the right tool to solve a problem. gpt-realtime has made three major improvements in this area: calling relevant functions, calling them at the right time, and using the correct parameters, leading to a significant increase in overall accuracy.
Most exciting is the native support for asynchronous function calling. This solves a long-standing pain point: when the AI needs time to look up information, the conversation no longer has to fall into an awkward silence. Now, the model can continue to converse smoothly with the user while waiting for results, ensuring an uninterrupted interactive experience.
Making Development Easier: Killer New Features of the Realtime API
After talking so much about the model’s strengths, what new tools can developers actually use? This Realtime API update brings several killer features.
Remote MCP Server Support
This makes extending the capabilities of a voice agent easier than ever. Developers simply need to point the API to a remote MCP server’s URL to automatically handle tool calls, eliminating the need for tedious manual integration. Want to add a new feature? Just change the server address.
Image Input: Letting the AI See What You See
This is a game-changing feature. Users can now add images, photos, or screenshots to voice or text conversations. This allows the AI’s conversation to be based on real visual information.
You can ask it:
- “What do you see?”
- “Help me read the text in this screenshot.”
The system treats the image as a photo in the conversation, not a live video stream, giving developers full control over what the model “sees” and when it responds.
SIP Support: Direct Connection to the Telephone Network
Session Initiation Protocol (SIP) support means you can connect your AI voice agent directly to the public telephone network, enterprise Private Branch Exchanges (PBX), or other SIP endpoints. This paves the way for building enterprise-grade AI call centers, automated response systems, and more.
Reusable Prompts
Developers can now save and reuse prompts composed of developer messages, tools, variables, and examples, significantly simplifying the development process and improving efficiency.
Security, Privacy, and More Affordable Pricing
With great power comes great responsibility. OpenAI emphasizes that the Realtime API has multiple layers of built-in security safeguards and actively detects conversations that violate its content policy. At the same time, the API uses default voices to prevent malicious actors from impersonating others. For European users, the API fully supports EU Data Residency regulations.
Finally, what everyone cares about most—the price. The good news is, the more powerful gpt-realtime has received a 20% price reduction.
- Audio Input: $32 per million tokens
- Audio Output: $64 per million tokens
Additionally, the API has added more granular conversation context control, allowing developers to intelligently set token limits, thereby significantly reducing the cost of long conversations.
Conclusion: The Future of Voice Interaction Is Here
gpt-realtime and the new Realtime API are not just a technological evolution; they are redefining the way we interact with AI. From real estate tours (as Zillow is exploring) to personal assistants and interactive education, a more natural, efficient, and even more interesting era of voice AI has arrived.
For developers, now is undoubtedly the best time to explore and innovate. Experiencing the power of this new model firsthand and starting to build your own next-generation voice applications is no longer a distant dream.
More information: https://openai.com/index/introducing-gpt-realtime/