The Strongest Competitor to GPT-4o Audio? StepFun Open-Sources Step-Audio 2 mini, with Full Performance Data Revealed!

The world of AI voice models has welcomed another heavyweight contender! The latest open-source end-to-end large voice model, Step-Audio 2 mini, launched by StepFun, has not only won the top spot in several international evaluations but has also surpassed the highly anticipated GPT-4o Audio in some key indicators. This article will take you deep into what makes this model so powerful and the innovative technology behind it.


The AI circle has been very lively recently. Just as the major giants finished showing off their muscles, a startup company called “StepFun” quietly released a big move - officially open-sourcing its latest end-to-end large voice model, Step-Audio 2 mini.

You might be thinking, another voice model? Is there anything special about it?

To be honest, this time it’s really different. Step-Audio 2 mini is not just “another” model. It has directly achieved SOTA (State-of-the-Art) results in multiple internationally authoritative benchmark tests, causing a considerable stir in the open-source community. It cleverly integrates audio understanding, reasoning, and generation into a unified architecture, providing a very attractive solution for various applications from real-time voice translation to delicate emotional analysis.

Not just “understanding,” but also “being able to chat”

A good voice model is by no means as simple as converting sound into text. It needs to be able to understand the subtext, tone, and emotion in a conversation. This is precisely Step-Audio 2 mini’s specialty.

On the MMAU test set, which measures multimodal audio understanding capabilities, Step-Audio 2 mini scored a high of 73.2, firmly securing its position as the top open-source voice model.

What’s more interesting is its performance in the URO Bench test, which is specifically designed to evaluate spoken dialogue capabilities. Whether it’s the basic track that simulates daily conversations or the difficult track full of professional terminology, Step-Audio 2 mini’s performance is amazing, achieving the highest scores among open-source models in both. What does this mean? It means that it can not only understand what you say, but also conduct logical and in-depth conversations like a real person.

Let’s look at the data directly and compare its performance with other well-known models:

ModelMMAUURO BenchCoVoST 2CVSSStepEval-Audio-Paralinguistic
AllEN basicZH basicEN proZH proZH-ENZH-ENAll
Open-Source LALMs
Step-Audio 2 mini73.274.477.861.369.639.329.180.0
Qwen-Omni71.570.669.051.059.135.415.444.2
Kimi-Audio69.660.073.649.866.1//49.6
Proprietary LALMs
GPT-4o Audio58.184.578.667.567.129.623.743.5
Step-Audio 278.083.983.366.168.339.330.983.1

As can be clearly seen from the table, Step-Audio 2 mini even surpasses top closed-source models like GPT-4o Audio in comprehensive understanding (MMAU) and Chinese-English translation (ZH-EN) tasks.

Proficient in translation and recognition, the data speaks for itself

In addition to its excellent dialogue capabilities, Step-Audio 2 mini is also not inferior in traditional speech recognition (ASR) and translation tasks.

On the authoritative evaluation sets for Chinese-English mutual translation, CoVoST2 and CVSS, it achieved high scores of 39.3 and 29.1 respectively, once again leading a group of competitors including GPT-4o Audio.

And in terms of speech recognition, which tests basic skills the most, its performance is even more impressive. In terms of accuracy indicators (the lower the error rate, the better):

  • Chinese recognition: The character error rate (CER) on the open-source Chinese test set is as low as 3.19%.
  • English recognition: The word error rate (WER) on the open-source English test set is 3.50%.

These two results are on average more than 15% better than similar open-source models. To put it bluntly, it hears more accurately and is less prone to errors. What’s more, it also has good adaptability to dialects and accents from different regions, which is crucial for developing applications for a broad market.

CategoryTest setDoubao LLM ASRGPT-4o TranscribeKimi-AudioQwen-OmniStep-Audio 2Step-Audio 2 mini
EnglishCommon Voice9.202.717.838.335.956.76
FLEURS English7.229.304.475.053.033.05
LibriSpeech clean2.921.751.492.931.171.33
LibriSpeech other5.324.232.915.072.422.86
Average6.174.504.185.353.143.50
ChineseAISHELL0.983.520.641.170.630.78
AISHELL-23.104.262.672.402.102.16
FLEURS Chinese2.922.622.917.012.682.53
KeSpeech phase16.4826.805.116.453.633.97
WenetSpeech meeting4.9031.405.216.614.754.87
Average3.8114.053.754.813.083.19
MultilingualFLEURS ArabianN/A11.72N/A25.1314.2216.46
Common Voice yue9.2011.1038.907.897.908.32
FLEURS JapaneseN/A3.27N/A10.493.184.67
In-houseAnhui accent8.8350.5522.1718.7310.6111.65
Guangdong accent4.997.833.764.033.814.44
Guangxi accent3.377.094.293.354.113.51
Shanxi accent20.2655.0334.7125.9512.4415.60
Sichuan dialect3.0132.855.265.614.354.57
Shanghai dialect47.4989.5882.9058.7417.7719.30
Average14.6640.4925.5219.408.859.85

Unveiling the black technology behind it: abandoning the traditional three-jump architecture

The success of Step-Audio 2 mini is largely due to its innovative architectural design.

The traditional speech processing flow is like a production line, requiring three independent steps:

  1. ASR (Automatic Speech Recognition): Convert audio to text.
  2. LLM (Large Language Model): Understand the text and generate a text response.
  3. TTS (Text-to-Speech): Convert the text response back to audio.

This process is not only cumbersome, but each step can also cause delays and information loss.

Step-Audio 2 mini breaks this “three-jump” framework and achieves true “end-to-end” processing. It can directly generate an audio response from the original audio input in one step. This is like integrating three independent factories into a highly automated smart factory, which not only has a simpler architecture but also a faster response speed, resulting in a more fluid interactive experience.

In addition, the model also introduces a joint optimization technology of “Chain-of-Thought (CoT) reasoning” and reinforcement learning. This allows it to perform step-by-step logical thinking like a human when processing information, so as to better understand the nuances of tone and emotion and make more natural and appropriate responses.

Solving AI hallucinations? It can also surf the Internet for information!

A common problem with large language models is “hallucination” - that is, talking nonsense with a straight face. This is because their knowledge is limited to the training data.

Step-Audio 2 mini cleverly solves this problem with a feature called “audio knowledge enhancement.” When it encounters a question outside its knowledge scope, it can use external tools (such as search engines) to conduct real-time online searches, find the most accurate and up-to-date information, and then answer you in a natural voice.

This innovation greatly enhances the model’s practicality and reliability, and also opens up broader avenues for its application in various real-world scenarios.

Experience it now and participate together

As an open-source model, the greatest charm of Step-Audio 2 mini is that it opens its doors to everyone. Whether you are a developer, researcher, or AI enthusiast, you can experience its powerful features for yourself.

StepFun has uploaded the model-related code and resources to major platforms, and everyone is welcome to try it out, contribute code, and jointly promote the development of voice AI technology.

In summary, the emergence of Step-Audio 2 mini not only brings an excellent performance tool to the open-source community, but also proves once again that on the AI track, innovation and openness are the core driving forces for technological advancement.

Share on:

© 2025 Communeify. All rights reserved.