Microsoft's VibeVoice is here: 90-minute-long audio, multi-person conversations, is this the future of AI podcasts?

Explore Microsoft’s latest open-source text-to-speech (TTS) model, VibeVoice. Available in 1.5B and 7B versions, it supports up to 90 minutes of speech generation, conversations with up to 4 people, excellent Chinese language performance (with a slight foreign accent), and background music, revolutionizing the way audiobooks and podcasts are created.


Have you ever imagined that creating a high-quality podcast episode or an entire audiobook could be as simple as typing text? In the past, this sounded like a fantasy, but now, Microsoft seems to have a resounding answer.

Recently, the field of AI speech technology welcomed a heavyweight player—Microsoft’s open-source text-to-speech (TTS) model, VibeVoice. Even more exciting is that it launched with two different-sized versions, 1.5B and 7B (the latter not yet available as of the update date), to meet various needs. Its arrival is not just a small update, but more like a technological storm, setting a new starting point for long-form audio, multi-person conversations, and even Chinese speech synthesis with its powerful features and amazing performance.

Honestly, the potential of this technology is truly exciting.

The promise of “long-form,” finally delivered

For content creators, one of the biggest pain points has always been the time limit of speech generation. Traditional TTS models can often only handle a few minutes of audio. Creating long-form content, such as a 30-minute podcast episode or an audiobook chapter, requires constant generation, splicing, and adjustment, a process that is both tedious and time-consuming.

VibeVoice directly breaks this shackle.

Its most striking breakthrough is its support for generating up to 90 minutes of continuous speech in one go. What does this mean? It means that from short stories to complete online courses, from in-depth interviews to entire audiobooks, creators can do it all in one go, greatly enhancing creative freedom and efficiency. It’s like upgrading from only being able to send short messages to suddenly being able to write a full-length novel in one breath—a completely different dimensional experience.

No longer a one-man show: Let AI host a roundtable discussion

In the past, AI speech was mostly a “one-person recitation.” Even if some models supported two-person conversations, it was difficult to achieve natural fluency, often sounding like two robots having a stiff conversation.

VibeVoice takes multi-person conversations to a whole new level, capable of fluently generating conversations with up to 4 different characters. More importantly, it has been deeply optimized in handling voice consistency and natural turn-taking between speakers.

You can imagine using it to generate a multi-person roundtable forum, a radio drama, or an interactive scene with virtual characters. The effect is almost comparable to live recording, with smooth and natural transitions between voices, allowing the audience to be fully immersed in the conversation.

Chinese speech, this time it’s not just “perfectly enunciated”

For Chinese users, whether an AI speech model is “down-to-earth” depends on its Chinese performance. Many foreign models, while having standard pronunciation when handling Chinese, always lack that “human touch,” sounding flat and emotionless.

VibeVoice demonstrates impressive strength in this area. It not only supports high-quality Chinese speech synthesis but also reaches a very high level in the natural intonation, pronunciation accuracy, and emotional richness. This gives VibeVoice huge application potential in fields such as Chinese podcasts, online education, and smart customer service, providing developers with a truly useful localized speech solution.

Maxing out the atmosphere! What’s it like to have a podcast with its own BGM?

For good audio content, besides the voice itself, the background atmosphere is equally important. VibeVoice also has a surprise feature—it supports adding background music while generating speech.

This feature allows creators to easily add finishing touches of background sound effects to their podcasts or stories, creating a more immersive and professional listening experience. Whether you need a relaxed background melody or want to create a tense and suspenseful atmosphere, VibeVoice can seamlessly blend vocals and music, making your work sound like it was produced by a professional team.

The data speaks for itself: VibeVoice’s amazing performance

Talk is cheap, and VibeVoice’s power is not just a description of its features, but is also supported by objective data. From the published charts, we can clearly see VibeVoice’s leading position, especially its powerful 7B version.

In the subjective evaluation, VibeVoice was compared with Google’s Gemini-2.5-Pro-Preview-TTS and the well-known Eleven-V3 (Alpha). The evaluation was divided into three dimensions:

  • Preference: VibeVoice-7B leads with a score of 3.75.
  • Realism: VibeVoice-7B wins again with a score of 3.71, and its smaller 1.5B version also performs well, indicating that its voice is extremely close to a real person.
  • Richness: In terms of voice richness and expressiveness, VibeVoice-7B also ranks first with a high score of 3.81.

Subjective Evaluation

This table compares the subjective scores of four models in three dimensions: preference, realism, and richness.

ModelPreferenceRealismRichness
VibeVoice-7B3.753.713.81
Gemini-2.5-Pro-Preview-TTS3.433.583.58
VibeVoice-1.5B3.653.553.77
Eleven-V3 (Alpha)3.373.333.47

Model Output Speech Length Trend

This table lists each model and its approximate output speech length (in seconds), based on the trend lines and scatter plots in the figure.

Approx. TimeModelOutput Speech Length (seconds)
2023VALL-E~50
2023NaturalSpeech-2~200
2024CosyVoice~500
2024SpeechSSM~900
2025MoonCast~1000
2025HiggsAudio-V2~200
2025Eleven-V3 (Alpha)~300
2025Gemini-2.5-Pro-Preview-TTS~400
2025MOSS-TTSD~600
2025Nari-Labs-Dia~800
2025SesameAILabs-CSM~1100
2025VibeVoice~5500

The power of open source: Everyone can be a voice magician

What’s even more exciting is that Microsoft has chosen to open-source VibeVoice. This model has been officially released on GitHub and Hugging Face, which means that developers, researchers, and even individual creators around the world can freely access, modify, and integrate this cutting-edge technology.

Microsoft’s move has undoubtedly injected strong vitality into the entire AI developer community. It greatly lowers the barrier to entry for high-quality TTS technology, so that innovation is no longer the patent of large companies. Whether you want to develop a unique voice application or just want to dub your own videos, VibeVoice provides you with an excellent starting point.

In summary, the birth of VibeVoice is not just another new AI tool. By solving core pain points such as duration, multi-person conversations, and localization, it truly brings revolutionary changes to the creation of audio content. The future of AI podcasts and audiobooks may be coming sooner than we think.


Seeing is believing, experience the shock of VibeVoice for yourself!

Share on:
DMflow.chat Ad
Advertisement

DMflow.chat

DMflow.chat: Your intelligent conversational companion, enhancing customer interaction.

Learn More

© 2025 Communeify. All rights reserved.