The Strongest Competitor to GPT-4o Audio? StepFun Open-Sources Step-Audio 2 mini, with Full Performance Data Revealed!

Posted on: 2025-09-02 • Updated on: 2025-09-02 • 6 min read

The world of AI voice models has welcomed another heavyweight contender! The latest open-source end-to-end large voice model, Step-Audio 2 mini, launched by StepFun, has not only won the top spot in several international evaluations but has also surpassed the highly anticipated GPT-4o Audio in some key indicators. This article will take you deep into what makes this model so powerful and the innovative technology behind it.

The AI circle has been very lively recently. Just as the major giants finished showing off their muscles, a startup company called “StepFun” quietly released a big move - officially open-sourcing its latest end-to-end large voice model, Step-Audio 2 mini.

You might be thinking, another voice model? Is there anything special about it?

To be honest, this time it’s really different. Step-Audio 2 mini is not just “another” model. It has directly achieved SOTA (State-of-the-Art) results in multiple internationally authoritative benchmark tests, causing a considerable stir in the open-source community. It cleverly integrates audio understanding, reasoning, and generation into a unified architecture, providing a very attractive solution for various applications from real-time voice translation to delicate emotional analysis.

Not just “understanding,” but also “being able to chat”

A good voice model is by no means as simple as converting sound into text. It needs to be able to understand the subtext, tone, and emotion in a conversation. This is precisely Step-Audio 2 mini’s specialty.

On the MMAU test set, which measures multimodal audio understanding capabilities, Step-Audio 2 mini scored a high of 73.2, firmly securing its position as the top open-source voice model.

What’s more interesting is its performance in the URO Bench test, which is specifically designed to evaluate spoken dialogue capabilities. Whether it’s the basic track that simulates daily conversations or the difficult track full of professional terminology, Step-Audio 2 mini’s performance is amazing, achieving the highest scores among open-source models in both. What does this mean? It means that it can not only understand what you say, but also conduct logical and in-depth conversations like a real person.

Let’s look at the data directly and compare its performance with other well-known models:

Model	MMAU	URO Bench	CoVoST 2	CVSS	StepEval-Audio-Paralinguistic
	All	EN basic	ZH basic	EN pro	ZH pro	ZH-EN	ZH-EN	All
Open-Source LALMs
Step-Audio 2 mini	73.2	74.4	77.8	61.3	69.6	39.3	29.1	80.0
Qwen-Omni	71.5	70.6	69.0	51.0	59.1	35.4	15.4	44.2
Kimi-Audio	69.6	60.0	73.6	49.8	66.1	/	/	49.6
Proprietary LALMs
GPT-4o Audio	58.1	84.5	78.6	67.5	67.1	29.6	23.7	43.5
Step-Audio 2	78.0	83.9	83.3	66.1	68.3	39.3	30.9	83.1

As can be clearly seen from the table, Step-Audio 2 mini even surpasses top closed-source models like GPT-4o Audio in comprehensive understanding (MMAU) and Chinese-English translation (ZH-EN) tasks.

Proficient in translation and recognition, the data speaks for itself

In addition to its excellent dialogue capabilities, Step-Audio 2 mini is also not inferior in traditional speech recognition (ASR) and translation tasks.

On the authoritative evaluation sets for Chinese-English mutual translation, CoVoST2 and CVSS, it achieved high scores of 39.3 and 29.1 respectively, once again leading a group of competitors including GPT-4o Audio.

And in terms of speech recognition, which tests basic skills the most, its performance is even more impressive. In terms of accuracy indicators (the lower the error rate, the better):

Chinese recognition: The character error rate (CER) on the open-source Chinese test set is as low as 3.19%.
English recognition: The word error rate (WER) on the open-source English test set is 3.50%.

These two results are on average more than 15% better than similar open-source models. To put it bluntly, it hears more accurately and is less prone to errors. What’s more, it also has good adaptability to dialects and accents from different regions, which is crucial for developing applications for a broad market.

Category	Test set	Doubao LLM ASR	GPT-4o Transcribe	Kimi-Audio	Qwen-Omni	Step-Audio 2	Step-Audio 2 mini
English	Common Voice	9.20	2.71	7.83	8.33	5.95	6.76
	FLEURS English	7.22	9.30	4.47	5.05	3.03	3.05
	LibriSpeech clean	2.92	1.75	1.49	2.93	1.17	1.33
	LibriSpeech other	5.32	4.23	2.91	5.07	2.42	2.86
	Average	6.17	4.50	4.18	5.35	3.14	3.50
Chinese	AISHELL	0.98	3.52	0.64	1.17	0.63	0.78
	AISHELL-2	3.10	4.26	2.67	2.40	2.10	2.16
	FLEURS Chinese	2.92	2.62	2.91	7.01	2.68	2.53
	KeSpeech phase1	6.48	26.80	5.11	6.45	3.63	3.97
	WenetSpeech meeting	4.90	31.40	5.21	6.61	4.75	4.87
	Average	3.81	14.05	3.75	4.81	3.08	3.19
Multilingual	FLEURS Arabian	N/A	11.72	N/A	25.13	14.22	16.46
	Common Voice yue	9.20	11.10	38.90	7.89	7.90	8.32
	FLEURS Japanese	N/A	3.27	N/A	10.49	3.18	4.67
In-house	Anhui accent	8.83	50.55	22.17	18.73	10.61	11.65
	Guangdong accent	4.99	7.83	3.76	4.03	3.81	4.44
	Guangxi accent	3.37	7.09	4.29	3.35	4.11	3.51
	Shanxi accent	20.26	55.03	34.71	25.95	12.44	15.60
	Sichuan dialect	3.01	32.85	5.26	5.61	4.35	4.57
	Shanghai dialect	47.49	89.58	82.90	58.74	17.77	19.30
	Average	14.66	40.49	25.52	19.40	8.85	9.85

Unveiling the black technology behind it: abandoning the traditional three-jump architecture

The success of Step-Audio 2 mini is largely due to its innovative architectural design.

The traditional speech processing flow is like a production line, requiring three independent steps:

ASR (Automatic Speech Recognition): Convert audio to text.
LLM (Large Language Model): Understand the text and generate a text response.
TTS (Text-to-Speech): Convert the text response back to audio.

This process is not only cumbersome, but each step can also cause delays and information loss.

Step-Audio 2 mini breaks this “three-jump” framework and achieves true “end-to-end” processing. It can directly generate an audio response from the original audio input in one step. This is like integrating three independent factories into a highly automated smart factory, which not only has a simpler architecture but also a faster response speed, resulting in a more fluid interactive experience.

In addition, the model also introduces a joint optimization technology of “Chain-of-Thought (CoT) reasoning” and reinforcement learning. This allows it to perform step-by-step logical thinking like a human when processing information, so as to better understand the nuances of tone and emotion and make more natural and appropriate responses.

Solving AI hallucinations? It can also surf the Internet for information!

A common problem with large language models is “hallucination” - that is, talking nonsense with a straight face. This is because their knowledge is limited to the training data.

Step-Audio 2 mini cleverly solves this problem with a feature called “audio knowledge enhancement.” When it encounters a question outside its knowledge scope, it can use external tools (such as search engines) to conduct real-time online searches, find the most accurate and up-to-date information, and then answer you in a natural voice.

This innovation greatly enhances the model’s practicality and reliability, and also opens up broader avenues for its application in various real-world scenarios.

Experience it now and participate together

As an open-source model, the greatest charm of Step-Audio 2 mini is that it opens its doors to everyone. Whether you are a developer, researcher, or AI enthusiast, you can experience its powerful features for yourself.

StepFun has uploaded the model-related code and resources to major platforms, and everyone is welcome to try it out, contribute code, and jointly promote the development of voice AI technology.

GitHub: https://github.com/stepfun-ai/Step-Audio2
Hugging Face: https://huggingface.co/stepfun-ai/Step-Audio-2-mini
Online test: https://realtime-console.stepfun.com

In summary, the emergence of Step-Audio 2 mini not only brings an excellent performance tool to the open-source community, but also proves once again that on the AI track, innovation and openness are the core driving forces for technological advancement.

Share on:

DMflow.chat

Discover DMflow.chat and unlock the new era of AI-powered customer service.

Learn More

DMflow.chat

DMflow.chat: Your intelligent AI partner for exceptional customer engagement.

Learn More

videoweaver.app

Video Weaver: Professional video editing directly in your browser. No downloads …

Learn More

DMflow.chat

Discover DMflow.chat and unlock the new era of AI-powered customer service.

Learn More

DMflow.chat

DMflow.chat: Your intelligent AI partner for exceptional customer engagement.

Learn More

videoweaver.app

Video Weaver: Professional video editing directly in your browser. No downloads …

Learn More

Not just “understanding,” but also “being able to chat”

Proficient in translation and recognition, the data speaks for itself

Unveiling the black technology behind it: abandoning the traditional three-jump architecture

Solving AI hallucinations? It can also surf the Internet for information!

Experience it now and participate together

DMflow.chat

DMflow.chat

videoweaver.app

DMflow.chat

DMflow.chat

videoweaver.app

Related Posts

Microsoft Copilot Labs Unveils a Secret Weapon: Audio Expressions Lets Text Speak, with Emotions!

Microsoft AI Makes a Big Move: Two In-House Models, MAI-Voice-1 and MAI-1-preview, Make a Stunning Debut

Microsoft's VibeVoice is here: 90-minute-long audio, multi-person conversations, is this the future of AI podcasts?

KittenTTS: A 25MB AI Voice Model? Open-Source, Free, and Runs on Your Phone!

Not Just Speech Synthesis! Higgs Audio v2 Open-Sourced, How Powerful is an Audio Model Trained on 10 Million Hours?

MegaTTS 3 Voice Cloning Finally a Reality! Open Source Community Releases Key Encoder for Everyone to Experience