Abandoning Traditional Spectrograms! Meituan Open Sources 3.5B Parameter LongCat-AudioDiT, Analyzing Waveform-Space Speech Generation Technology
Speech synthesis technology has achieved a breakthrough. Meituan’s LongCat team has officially launched LongCat-AudioDiT, a new non-autoregressive text-to-speech model that operates directly in the waveform latent space, completely solving the pain point of error accumulation in traditional architectures. This article provides developers with a comprehensive analysis of its core technology, the exclusive optimization of Adaptive Projection Guidance (APG), and extremely developer-friendly open-source resources.
Generating synthesized speech that sounds almost identical to a real person has historically been a challenging engineering feat. Traditional speech synthesis systems usually require multiple conversion steps, from input text to acoustic features, and then from those features back into sound waveforms. This process is not only cumbersome but often results in the loss of precious vocal details during conversion.
This is precisely the challenge that Meituan’s LongCat team aims to overcome with their latest open-source project. They have introduced LongCat-AudioDiT, a non-autoregressive (NAR) text-to-speech (TTS) model based on a diffusion architecture. Upon its release, it quickly caught the eye of the global developer community with its stunning zero-shot voice cloning capabilities.
To be honest, the level of voice restoration it demonstrates is truly impressive. In the highly challenging Seed test set, the 3.5-billion-parameter version, LongCat-AudioDiT-3.5B, successfully outperformed Seed-TTS, which was previously considered the industry gold standard. Most notably, it completely abandons complex multi-stage training pipelines and does not rely on massive amounts of time-consuming, manually annotated high-quality data. The research team achieved this remarkable feat using an extremely simplified, one-stop architecture.
Next, let’s break down the technical brilliance behind this innovation.
Farewell to Mel-spectrograms? The Magic of Operating Directly in Waveform Space
Traditional speech diffusion models often face a persistent pain point. Most models (such as the well-known F5-TTS) rely heavily on “Mel-spectrograms” as an intermediate feature during the generation process. This means the system must be equipped with an additional vocoder to convert the predicted spectrogram data back into an actual waveform.
While this process might sound trivial, it hides significant risks. Multi-stage data conversion is highly susceptible to “error accumulation.” Imagine photocopying a piece of paper and then photocopying the copy—each iteration inevitably loses original clarity. In the realm of speech, this translates to the loss of high-frequency details and a decline in overall audio quality.
LongCat-AudioDiT proposes an incredibly sleek solution: it simply abandons the traditional tool of Mel-spectrograms.
The entire architecture retains only two core components: a Waveform Variational Autoencoder (Wav-VAE) and a Diffusion Transformer (DiT). During the training phase, the model directly compresses raw audio into continuous latent representations. In the inference phase, these latent variables are decoded directly back into waveforms. This significantly simplifies the workflow while preserving the original, delicate texture of the sound.
Two Major Inference Optimizations: Saving Audio from Clipping and Distortion
Beyond architectural simplification, the LongCat team put significant effort into the inference algorithms of the diffusion model. They identified two long-standing hidden issues in the generation process and provided elegant solutions.
The first pain point is the “train-inference mismatch.” When given an audio prompt for voice cloning, the diffusion model’s predictions for the prompt area during inference often gradually deviate from the true trajectory as computation steps increase. Over time, the synthesized voice becomes unnatural. To correct this, the team adopted a forced-override strategy: at each inference step, the values in the prompt area are forcibly replaced with the true noise latent variables. This small change successfully stabilizes the model’s computational path.
The second innovation is the replacement of traditional Classifier-Free Guidance (CFG). While CFG is effective at improving generation quality, even slightly increasing the guidance scale often introduces “oversaturation” noise and annoying distortion.
To solve this, they introduced Adaptive Projection Guidance (APG) technology. APG intelligently decomposes the guidance signal and suppresses the parallel components that tend to cause distortion. This technology significantly enhances the naturalness of synthesized speech, making the overall auditory experience smoother and more pleasant.
Surprising Experimental Results: Better Encoders Don’t Always Mean Better Output?
For many engineers, intuition suggests that “finer compression leads to better final generation quality.” However, while conducting ablation experiments, the LongCat team discovered an extremely counter-intuitive phenomenon.
Experimental data showed that as the reconstruction fidelity of the Wav-VAE increased—meaning the dimensionality of the latent space was significantly raised—the generation quality of the downstream TTS model actually decreased rather than improved. An excessively large latent dimensionality seemed to impose an unbearable learning burden on the diffusion model. This is a critical insight: blindly pursuing the limits of a single component does not necessarily benefit the overall system.
After repeated testing, the team finally found a perfect “sweet spot.” They set the latent space to 64 dimensions with a frame rate of 11.72 Hz. This combination achieved the optimal balance between computational efficiency and audio quality.
They also showed unique ingenuity in handling multilingual text. To seamlessly support both Chinese and English, the team chose UMT5 as the text encoder. Interestingly, they found that simply using the hidden state of the final layer led to a serious loss of underlying phonetic spelling details, resulting in a significant drop in the clarity of synthesized speech. Therefore, they cleverly summed the original word embedding values with the final layer’s hidden state. This effectively compensated for the low-level speech features, making pronunciation crystal clear.
Impressive Evaluation Data and Developer-Friendly Open-Source Resources
With all these technical details, how does the model actually perform?
The results are outstanding. LongCat-AudioDiT-3.5B achieved a speaker similarity of 0.818 in the Seed-ZH (Chinese) test set and a remarkable score of 0.797 in the Seed-Hard test set. This not only surpasses many closed-source commercial models but also sets a new standard for the open-source community.
For the global developer community, the most exciting news is the full openness of resources. The Meituan team has completely open-sourced the code and model weights, including a 1B version suitable for lightweight applications and a 3.5B version for those seeking ultimate quality. Better yet, all resources are released under the extremely friendly MIT license, allowing anyone to freely use and modify them.
You can go directly to the LongCat-AudioDiT HuggingFace page to download the required weights. For a look at the full architecture, the LongCat-AudioDiT GitHub project page also provides detailed documentation and scripts.
If you want to quickly implement this in your own environment, the official Python API is very intuitive. With just a few lines of code, you can easily load the model and start generating speech:
from audiodit import AudioDiTModel
# Load the 1B model and enable fp16 inference to save memory
model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda")
model.vae.to_half()
# Now you can pass in text and prompt audio to start your speech generation task
FAQ for Developers
To help everyone get started faster, here are answers to some highly discussed technical questions in the community, based on the research paper.
Q: Why not use the popular ByT5 as the text encoder for multilingual processing? A: While ByT5 supports many languages, it uses byte-level tokenization. This leads to exceptionally long sequences for languages like Chinese, which not only slows down computation but also creates difficulties in training alignment. UMT5 uses subword tokenization, resulting in more reasonable sequence lengths that perfectly fit the practical needs of this architecture.
Q: Can a standard consumer-grade graphics card run this model? A: Absolutely. This is why the official release includes two versions. If hardware resources are limited, it is recommended to use the 1B parameter version with half-precision (fp16) operations, which can run smoothly on most modern consumer GPUs. If you are pursuing commercial-grade audio quality, you may then consider using server resources to run the 3.5B version.
Q: Does the REPA module used in the model directly help with final audio quality? A: According to the official experimental observations, the REPA (Representation Alignment) module does not directly improve the synthesized audio quality. However, it plays another critical role: it significantly accelerates the convergence speed in the early stages of training. This can save considerable computational cost and time for developers who want to fine-tune or train from scratch.
Summary and Next Technical Steps
The emergence of LongCat-AudioDiT strongly proves that waveform-level latent modeling indeed holds greater potential than traditional intermediate features. It uses the purest architecture to solve the complexity issues that have long plagued the field of speech synthesis.
Through the official announcement on X, we can glimpse the team’s future ambitions. They plan to introduce reinforcement learning (RLHF for audio) that doesn’t rely on timeline correspondence to further push the limits of naturalness in speech generation. Meanwhile, to meet the massive demand for real-time applications, significantly accelerating inference speed through knowledge distillation has already been included in the development roadmap.
What other surprises will future speech generation technology bring? Let’s wait and see.


