AI voice synthesis has a new challenger. SoulX-Podcast claims to be able to generate up to 90 minutes of AI podcast conversations that support multiple dialects and have natural emotions. Can this new technology really overcome the awkward performance of previous models in multi-speaker scenarios? This article will delve into the technical details and potential behind it.
In the wave of artificial intelligence, text-to-speech (TTS) technology is nothing new. We are used to the clear guidance of mobile navigation and the gentle responses of smart speakers. However, when we try to get an AI to simulate a real, fluent, multi-person podcast conversation, the results are often unsatisfactory—stiff voices, flat tones, and a sense of chaos when speakers switch, all act as an invisible wall, reminding us that there is still a distance between AI and real people.
In the past, some models, such as VibeVoice-1.5B, although well-intentioned, always fell short when handling rapid multi-person dialogue switching. This has also made many developers and content creators wonder: how far are we from an AI that can generate truly convincing multi-person conversations?
At this moment, a new model called SoulX-Podcast has entered our field of vision. Judging from its demonstration page, it seems to be taking a big step towards solving this “nightmare-level” problem.
Not Just Mono: Born for Real Conversation
Traditional TTS systems are mostly designed for a single speaker, you can think of it as an actor performing a monologue. But a podcast or a real conversation is more like a stage play with multiple characters, full of interaction, interruption, and emotional exchange.
The core design concept of SoulX-Podcast is to generate this kind of multi-turn, multi-speaker conversational speech. It no longer simply converts text into sound, but understands the context of the conversation, allowing the tone and rhythm of each “speaker” to change naturally as the conversation progresses. This means that the AI not only knows what to say, but also how to say it, which is a huge leap in improving naturalness.
Can it handle accents? Amazing dialect and tone control
It’s not difficult to make an AI speak, but to make it speak with a “human touch”, or even with a local accent, is a big challenge. SoulX-Podcast brings a surprise in this regard.
It not only supports standard Chinese and English, but also integrates a variety of Chinese dialects, including Sichuanese, Henanese, and Cantonese. From the examples shown in the official demonstration, the AI-generated dialects sound quite authentic, retaining the unique charm and intonation of the dialects.
More important is the addition of “paralinguistic control”. What does this mean? Simply put, it refers to those non-verbal sound signals, such as:
- Laughter (
<laughter>) - Sigh (
<sigh>) - Throat clearing (
<throat_clearing>) - Coughing (
<coughing>)
These details are the key to making a conversation lively. Imagine that when discussing an interesting topic, the AI host laughs naturally instead of saying “haha” in a flat tone. The appeal of the two is on a completely different level.
Stability for 90 consecutive minutes without “schizophrenia”
Long-form speech generation is another huge technical hurdle. Many models start to drift in voice stability (i.e., timbre) after generating a few minutes of audio, making it sound like the person has been changed midway.
The technical report of SoulX-Podcast states that it can continuously generate more than 90 minutes of conversation while maintaining a stable speaker timbre and smooth transitions. This is undoubtedly a very attractive feature for creators of podcasts, audiobooks, or long-form educational content. This means that in the future, it may be possible to automatically generate an entire season of a show just from a script, without worrying about inconsistent sound quality.
The secret behind it: powerful data processing and model architecture
Sounds amazing, right? The credit for this goes to a complex and sophisticated system.
First is its SoulX-Data-Pipeline. Before training the model, the team meticulously processed a large amount of speech data, including speech enhancement, audio segmentation, speaker diarization (determining who is speaking), text transcription, and quality filtering. This is like a team of chefs meticulously washing, selecting, and processing every ingredient before cooking a big meal to ensure the final taste is the best.
At the core of the model, SoulX-Podcast is likely trained on a large language model (LLM) such as Qwen3-1.7B. This allows the model to not only process sound, but also to understand the deep structure of language and conversation, thereby producing more natural tones and rhythms.
So, is it really different this time?
Judging from the official examples and technical details, SoulX-Podcast does indeed show impressive strength. It not only reaches the top level in single-person speech synthesis, but also makes breakthroughs in the extremely challenging scenarios of multi-person, multi-dialect, and long-form conversations.
Of course, the examples shown are always selected. Its performance in more complex and unpredictable real-world applications still needs to be more widely tested by the community and developers (Hugging Face page is now open).
But in any case, the emergence of SoulX-Podcast paints an exciting future for the field of AI speech synthesis, especially for the content creation industry. Perhaps in the near future, when we listen to a wonderful multi-person podcast, we will no longer be able to tell whether the voice in our headphones is from a human or an AI.


