dots.tts In-Depth: A Next-Gen Open Source TTS Model Ditching Discrete Tokens

Ditching Discrete Tokens: Analyzing the Fully Continuous Architecture and Practical Tips for dots.tts, the Open Source Speech Synthesis Star

Many might wonder if speech synthesis technology has reached a bottleneck in its development. Frankly speaking, a new and highly discussed face has recently appeared in the open-source community: dots.tts, released by RedNote. This model boasts up to 2 billion (2B) parameters and utilizes a Fully Continuous architecture design. This might sound a bit abstract, but in simple terms, it completely discards the commonly used discrete tokens of the past, making speech generation smoother and more natural than ever before.

For developers who want to experience this technology firsthand, you can visit the official dots.tts demo page or head over to the dots.tts GitHub project to access the source code. This project is open-sourced under the Apache-2.0 license, which means it is very friendly to commercial licensing.

Next, let’s take a closer look and uncover the secrets behind the system that has ignited such enthusiastic discussions.

Why Abandon Discrete Tokens? Uncovering the Secrets of the Full-Process Architecture

Traditional speech synthesis systems often utilize audio quantization technology. This is like forcibly converting a high-quality gradient image into an 8-bit pixelated image with only a few colors. This process inevitably loses a lot of detail.

The emergence of dots.tts aims to solve this pain point. It adopts a full-process design that generates continuous audio latent variables directly from text. The entire operating mechanism is built on the close integration of several key components:

First is the AudioVAE, which is responsible for processing audio. This is a module operating at 48kHz, specifically designed to compress monaural waveforms into continuous latent variables, ensuring the final output voice retains extremely high fidelity and detail. Next is the language model backbone, initialized from Qwen2.5-1.5B-Base. Notably, this language model does not process traditional phonemes but directly reads BPE text to generate corresponding hidden states.

So, how are text and audio connected? This relies on the Causal Semantic Encoder. It strips away acoustic details that are too variable or trivial in the sound, allowing the language model to focus more on understanding the meaning and coherence of the entire sentence. Finally, it is handed over to the Autoregressive (AR) Flow-matching Head for patch-by-patch prediction and denoising in continuous space.

This method of continuous modeling completely avoids the problem of quantization distortion. It is indeed a very clever approach.

Testing Data Speaks: How Strong Is This Model Actually?

Objective test data often best reflects real capabilities. In the Seed-TTS-Eval comprehensive evaluation, this system performed remarkably well in zero-shot voice cloning.

Compared to other models of similar scale, such as the 1.5B parameter CosyVoice 3 or the 1.7B Qwen3-TTS, dots.tts achieved a Word Error Rate (WER) of 0.94% on the Chinese evaluation set, while the average speaker similarity (SIM) reached as high as 79.2. This not only surpasses open-source models of the same level but also maintains extremely high stability in multilingual testing.

Even more astonishing is its performance in the Emergent-TTS-Eval evaluation. When faced with sentences of extremely high syntactic complexity, it achieved a high score of 65.7%, even surpassing some well-known closed-source commercial systems. Meanwhile, in the Emotions category, it also secured a score of 72.7%. This means the generated speech is no longer a cold, robotic voice; it can capture the fluctuations and emotions within the tone.

Overview of the Three Major Model Versions: Which One Should Beginners Choose?

Faced with the three different weight versions provided officially, developers are often confused. How should they choose the model that best suits them? Actually, the classification is very clear.

People often ask, if I only want to get the strongest voice cloning effect, which one should I choose? The answer is unanimous: the official highly recommended version is dots.tts-soar. This version has undergone Self-Correction Alignment (SCA) processing, and its voice restoration and stability are the highest.

If it is for academic research or architectural verification, you can choose the base pre-trained version, dots.tts-base.

And what if device computing power is limited, or there is an extreme requirement for generation speed? In this case, you can choose the dots.tts-mf student model based on MeanFlow knowledge distillation. This version defaults to needing only 4 steps to complete sampling, making it very lightweight and fast to run.

Practical Practice: Operation Suggestions to Avoid Common Pitfalls

Having mastered the theory, the practical implementation session is equally profound. To let the system exert its best potential, there are some operational details that must not be ignored.

When performing zero-shot cloning, the system provides two main modes. The first is “Continuation Mode,” which is the top choice for obtaining the highest similarity. As long as you provide a piece of reference audio and input the precise text corresponding to this audio, the model can perfectly continue speaking with the original tone. The second is “X-vector-only Mode.” This mode only requires providing reference audio, and the model will automatically extract the speaker’s timbre characteristics to generate new content.

When preparing the Prompt Audio, it is best to control the length to about 10 seconds. Many people mistakenly think that the longer the audio, the better, but this is a misunderstanding. Audio that is too long might interfere with the generation process. In addition, it is necessary to ensure that the audio is clear and free of background noise.

Another problem many people encounter is why the model sometimes mispronounces polyphonic characters. When encountering this situation, do not change the underlying code. The simplest and most effective solution is to directly replace that character in the input text with pinyin with tones. For example, write “好” as “hào”. Please note that you should not add numbers to mark the tone (e.g., hao4 is invalid); you must use standard tone marks.

If you are not satisfied with the generated tone or rhythm, just change the --seed value in the command, and the model will provide a completely different cadence. Try a few more times, and you will always find the most pleasing version.

Active Community Support and Limitations That Cannot Be Ignored

For an open-source project to develop in the long term, the activity of the community plays a key role. Currently, the community has developed dedicated Apple Silicon optimized versions for this model (including dots-tts-mlx and mlx-swift-dots-tts), allowing iOS and macOS users to deploy it easily. Creators who prefer graphical interfaces can also find corresponding ComfyUI extension nodes.

Of course, any technology has its limitations. Because it relies on the BPE text model at the bottom, when dealing with long-tail languages with less data (such as Arabic, Hindi, or Vietnamese), although voice similarity is unaffected, the word error rate will indeed be higher. In addition, its training data is concentrated on speech, and it currently does not have the ability to generate singing voices or special sound effects.

Finally, accompanying the powerful cloning capability is the unavoidable responsibility of safety and ethics. The voice generated by this technology is extremely realistic. Developers must add AI-generated tags and watermarks when using it, and must never use it for any forgery or fraud without consent.

dots.tts has indeed brought a brand-new direction for the field of speech generation. By abandoning discrete tokens, it has successfully retained the rich details of audio, demonstrating extremely high similarity and emotional expression, and making people look forward to future voice interaction applications.

Q&A

Q1: What is dots.tts? What are its biggest features? A1: dots.tts is a fully continuous, end-to-end autoregressive (AR) Text-to-Speech system with 2 billion (2B) parameters. Its biggest innovation is that the entire process does not use “discrete tokens” at all. Its architectural foundation combines a Causal Semantic Encoder, a Large Language Model (LLM) based on Qwen2.5, and an autoregressive Flow-matching acoustic head, paired with a 48kHz AudioVAE to ensure extremely high audio fidelity.

Q2: Official released three different versions of the model (base, soar, mf), how should I choose? A2:

dots.tts-base: The basic pre-trained version.
dots.tts-soar: The version that has undergone Self-Correction Alignment (SCA) processing. Officially most recommended, with the strongest voice cloning and emotional expression capabilities.
dots.tts-mf: A student model based on MeanFlow knowledge distillation technology. If you are very concerned about inference speed and computing power consumption, it is recommended to choose this version, which defaults to needing only 4 steps of sampling to complete generation.

Q3: When performing voice cloning, how long should the Prompt Audio be? A3: It is recommended to control the length of the Prompt Audio to about 10 seconds. Audio that is too long will not bring better results but may waste computing power. In addition, it is necessary to ensure that the “Prompt Text” of the audio is exactly the same as the content actually spoken, otherwise it will affect generation stability and even cause word-level errors.

Q4: If the model mispronounces polyphonic characters, how to solve it? A4: You can directly replace that Chinese character in the input text with “pinyin with tone marks” to forcibly correct the pronunciation. For example, if you want to force “好” to be pronounced in the fourth tone, write it as hào. Please note that the system only supports standard tone marks (such as hǎo, hào), and does not support numerical tone marking (e.g., inputting hao4 is invalid).

Q5: What if I am not satisfied with the rhythm or audio quality of the generated speech? A5: You can try changing the --seed value in the command. Different seeds will produce completely different rhythms and intonations, and trying a few more times usually leads to finding the most suitable version. If you feel that the audio quality is not ideal, you can increase --num-steps to increase the sampling steps, exchanging more computing power for cleaner and more expressive audio quality.

Q6: Does dots.tts support multilingual and low-latency streaming? A6: Yes. When processing multilingual or mixed Chinese and English, you can use --language auto_detect to let the system automatically detect, or force a specific language (such as EN, ZH). In addition, the system architecture supports low-latency streaming generation, capable of outputting audio chunk by chunk, which is very suitable for integration with conversational language models.

Q7: What are the technical limitations or ethical risks of dots.tts that need attention? A7:

Technical limitations: Although the timbre cloning capability is extremely strong, when processing long-tail languages with less data (such as Arabic, Hindi, Vietnamese, etc.), the word error rate (WER) will be higher. In addition, the current training data is mainly speech, and it cannot yet generate singing or special sound effects.
Ethical risks: Due to its extremely realistic zero-shot voice cloning, officials strongly urge that it must be clearly marked as “AI generated” when used, and it is strictly forbidden to use it for forgery, fraud, or spreading false information without consent. The project is open-sourced under the Apache-2.0 license, suitable for research and legally authorized commercial deployment.

dots.tts In-Depth: A Next-Gen Open Source TTS Model Ditching Discrete Tokens

Ditching Discrete Tokens: Analyzing the Fully Continuous Architecture and Practical Tips for dots.tts, the Open Source Speech Synthesis Star

Why Abandon Discrete Tokens? Uncovering the Secrets of the Full-Process Architecture

Testing Data Speaks: How Strong Is This Model Actually?

Overview of the Three Major Model Versions: Which One Should Beginners Choose?

Practical Practice: Operation Suggestions to Avoid Common Pitfalls

Active Community Support and Limitations That Cannot Be Ignored

Q&A

videoweaver.app

scribis.app

DMflow.chat

DMflow.chat

videoweaver.app

scribis.app

DMflow.chat

DMflow.chat

dots.tts In-Depth: A Next-Gen Open Source TTS Model Ditching Discrete Tokens

Ditching Discrete Tokens: Analyzing the Fully Continuous Architecture and Practical Tips for dots.tts, the Open Source Speech Synthesis Star

Why Abandon Discrete Tokens? Uncovering the Secrets of the Full-Process Architecture

Testing Data Speaks: How Strong Is This Model Actually?

Overview of the Three Major Model Versions: Which One Should Beginners Choose?

Practical Practice: Operation Suggestions to Avoid Common Pitfalls

Active Community Support and Limitations That Cannot Be Ignored

Q&A

videoweaver.app

scribis.app

DMflow.chat

DMflow.chat

videoweaver.app

scribis.app

DMflow.chat

DMflow.chat

Recommended for You

What is Higgs Audio v3 TTS? AI TTS Technology Supporting Emotional Speech, Voice Cloning, and 100+ Languages

AI Voices No Longer Sound Like Robots! Analyzing MOSS-TTS-v1.5's 31-Language Support and Precise Pause Control

Precisely Capturing Timbre and Emotion! An In-depth Look at NetEase Youdao Confucius4-TTS Cross-Lingual Voice Engine