Microsoft's VibeVoice is here: 90-minute-long audio, multi-person conversations, is this the future of AI podcasts?

Posted on: 2025-08-26 • Updated on: 2025-08-26 • 6 min read

Explore Microsoft’s latest open-source text-to-speech (TTS) model, VibeVoice. Available in 1.5B and 7B versions, it supports up to 90 minutes of speech generation, conversations with up to 4 people, excellent Chinese language performance (with a slight foreign accent), and background music, revolutionizing the way audiobooks and podcasts are created.

Have you ever imagined that creating a high-quality podcast episode or an entire audiobook could be as simple as typing text? In the past, this sounded like a fantasy, but now, Microsoft seems to have a resounding answer.

Recently, the field of AI speech technology welcomed a heavyweight player—Microsoft’s open-source text-to-speech (TTS) model, VibeVoice. Even more exciting is that it launched with two different-sized versions, 1.5B and 7B (the latter not yet available as of the update date), to meet various needs. Its arrival is not just a small update, but more like a technological storm, setting a new starting point for long-form audio, multi-person conversations, and even Chinese speech synthesis with its powerful features and amazing performance.

Honestly, the potential of this technology is truly exciting.

The promise of “long-form,” finally delivered

For content creators, one of the biggest pain points has always been the time limit of speech generation. Traditional TTS models can often only handle a few minutes of audio. Creating long-form content, such as a 30-minute podcast episode or an audiobook chapter, requires constant generation, splicing, and adjustment, a process that is both tedious and time-consuming.

VibeVoice directly breaks this shackle.

Its most striking breakthrough is its support for generating up to 90 minutes of continuous speech in one go. What does this mean? It means that from short stories to complete online courses, from in-depth interviews to entire audiobooks, creators can do it all in one go, greatly enhancing creative freedom and efficiency. It’s like upgrading from only being able to send short messages to suddenly being able to write a full-length novel in one breath—a completely different dimensional experience.

No longer a one-man show: Let AI host a roundtable discussion

In the past, AI speech was mostly a “one-person recitation.” Even if some models supported two-person conversations, it was difficult to achieve natural fluency, often sounding like two robots having a stiff conversation.

VibeVoice takes multi-person conversations to a whole new level, capable of fluently generating conversations with up to 4 different characters. More importantly, it has been deeply optimized in handling voice consistency and natural turn-taking between speakers.

You can imagine using it to generate a multi-person roundtable forum, a radio drama, or an interactive scene with virtual characters. The effect is almost comparable to live recording, with smooth and natural transitions between voices, allowing the audience to be fully immersed in the conversation.

Chinese speech, this time it’s not just “perfectly enunciated”

For Chinese users, whether an AI speech model is “down-to-earth” depends on its Chinese performance. Many foreign models, while having standard pronunciation when handling Chinese, always lack that “human touch,” sounding flat and emotionless.

VibeVoice demonstrates impressive strength in this area. It not only supports high-quality Chinese speech synthesis but also reaches a very high level in the natural intonation, pronunciation accuracy, and emotional richness. This gives VibeVoice huge application potential in fields such as Chinese podcasts, online education, and smart customer service, providing developers with a truly useful localized speech solution.

Maxing out the atmosphere! What’s it like to have a podcast with its own BGM?

For good audio content, besides the voice itself, the background atmosphere is equally important. VibeVoice also has a surprise feature—it supports adding background music while generating speech.

This feature allows creators to easily add finishing touches of background sound effects to their podcasts or stories, creating a more immersive and professional listening experience. Whether you need a relaxed background melody or want to create a tense and suspenseful atmosphere, VibeVoice can seamlessly blend vocals and music, making your work sound like it was produced by a professional team.

The data speaks for itself: VibeVoice’s amazing performance

Talk is cheap, and VibeVoice’s power is not just a description of its features, but is also supported by objective data. From the published charts, we can clearly see VibeVoice’s leading position, especially its powerful 7B version.

In the subjective evaluation, VibeVoice was compared with Google’s Gemini-2.5-Pro-Preview-TTS and the well-known Eleven-V3 (Alpha). The evaluation was divided into three dimensions:

Preference: VibeVoice-7B leads with a score of 3.75.
Realism: VibeVoice-7B wins again with a score of 3.71, and its smaller 1.5B version also performs well, indicating that its voice is extremely close to a real person.
Richness: In terms of voice richness and expressiveness, VibeVoice-7B also ranks first with a high score of 3.81.

Subjective Evaluation

This table compares the subjective scores of four models in three dimensions: preference, realism, and richness.

Model	Preference	Realism	Richness
VibeVoice-7B	3.75	3.71	3.81
Gemini-2.5-Pro-Preview-TTS	3.43	3.58	3.58
VibeVoice-1.5B	3.65	3.55	3.77
Eleven-V3 (Alpha)	3.37	3.33	3.47

Model Output Speech Length Trend

This table lists each model and its approximate output speech length (in seconds), based on the trend lines and scatter plots in the figure.

Approx. Time	Model	Output Speech Length (seconds)
2023	VALL-E	~50
2023	NaturalSpeech-2	~200
2024	CosyVoice	~500
2024	SpeechSSM	~900
2025	MoonCast	~1000
2025	HiggsAudio-V2	~200
2025	Eleven-V3 (Alpha)	~300
2025	Gemini-2.5-Pro-Preview-TTS	~400
2025	MOSS-TTSD	~600
2025	Nari-Labs-Dia	~800
2025	SesameAILabs-CSM	~1100
2025	VibeVoice	~5500

The power of open source: Everyone can be a voice magician

What’s even more exciting is that Microsoft has chosen to open-source VibeVoice. This model has been officially released on GitHub and Hugging Face, which means that developers, researchers, and even individual creators around the world can freely access, modify, and integrate this cutting-edge technology.

Microsoft’s move has undoubtedly injected strong vitality into the entire AI developer community. It greatly lowers the barrier to entry for high-quality TTS technology, so that innovation is no longer the patent of large companies. Whether you want to develop a unique voice application or just want to dub your own videos, VibeVoice provides you with an excellent starting point.

In summary, the birth of VibeVoice is not just another new AI tool. By solving core pain points such as duration, multi-person conversations, and localization, it truly brings revolutionary changes to the creation of audio content. The future of AI podcasts and audiobooks may be coming sooner than we think.

Seeing is believing, experience the shock of VibeVoice for yourself!

Try the online demo: No need to install any software, just enter text in your browser and experience the speech generated by VibeVoice.
- Online Demo Experience
Explore the model and code: For developers and tech enthusiasts, you can delve into the technical details behind it and even integrate it into your own projects.
- Official GitHub Repository
- Hugging Face 1.5B Model Page

Share on:

DMflow.chat

DMflow.chat: Your intelligent conversational companion, enhancing customer interaction.

Learn More

The promise of “long-form,” finally delivered

No longer a one-man show: Let AI host a roundtable discussion

Chinese speech, this time it’s not just “perfectly enunciated”

Maxing out the atmosphere! What’s it like to have a podcast with its own BGM?

The data speaks for itself: VibeVoice’s amazing performance

Subjective Evaluation

Model Output Speech Length Trend

The power of open source: Everyone can be a voice magician

Seeing is believing, experience the shock of VibeVoice for yourself!

DMflow.chat

Related Posts

KittenTTS: A 25MB AI Voice Model? Open-Source, Free, and Runs on Your Phone!

Not Just Speech Synthesis! Higgs Audio v2 Open-Sourced, How Powerful is an Audio Model Trained on 10 Million Hours?

MegaTTS 3 Voice Cloning Finally a Reality! Open Source Community Releases Key Encoder for Everyone to Experience

StyleTTS 2 Author Strikes Again! DMOSpeech2 Open-Source Model Delivers a New Milestone in Speech Synthesis with Double the Speed and Enhanced Stability

IndexTTS2 In-Depth: Not Just Cloning Your Voice, but Your Emotions Too? The Era of Film-Quality TTS Has Arrived

Chatterbox TTS Has Arrived: Open Source, Real-Time, and Can Clone Your Voice in a Second?