OpenMOSS team released MOSS-Transcribe-Diarize at the beginning of 2026, an end-to-end multimodal large language model. It not only performs accurate speech transcription but also solves the long-standing problems of “multi-person overlapping dialogue” and “emotional speech” recognition. This article takes you deep into how this technology surpasses GPT-4o and Gemini and its practical application in complex speech scenarios.
(This article is a reserved post and will be updated later)
Have you ever had this experience? When reviewing video conference recordings or organizing interview audio, once two or three people speak at the same time, the subtitle software starts “speaking gibberish,” producing a pile of unintelligible text. Even when the speaker uses some dialect or gets emotional, AI often just waves the white flag.
This situation might soon become history.
Just on the first day of 2026, the OpenMOSS team from MOSI.AI released a new model named MOSS-Transcribe-Diarize. This is not just another speech recognition tool; it adopts a brand-new multimodal architecture, claiming to understand who is speaking and what they are saying in noisy environments just like a human, even accurately capturing the emotion in the tone.
What is unique about this technology? Let’s take a closer look.
What is MOSS-Transcribe-Diarize?
Simply put, this is an “End-to-End” multimodal model specifically designed to handle complex speech transcription tasks.
Past speech processing systems often needed to separate “dictation” and “recognizing people” (Speaker Diarization) into two steps. This is like finding one person to write down the words heard, and then finding another person to guess who said the sentence. This division of labor is prone to errors, especially when the conversation pace is fast.
MOSS-Transcribe-Diarize chose a different path. It adopts a Unified Audio-Text Multimodal Architecture. Imagine that this model directly projects multiple people’s voice signals into the feature space of a pre-trained Large Language Model (LLM). This means that while understanding the sound, it is also performing semantic analysis, speaker attribution judgment, and timestamp prediction.
All these tasks are completed within a single framework, which greatly improves its stability when dealing with complex dialogues. You can visit the Official HuggingFace Demo to experience its capabilities yourself.
Solving “Talking Past Each Other”: Breakthrough in Multi-person Overlapping Dialogue
In real-world conversations, people rarely speak obediently in turns. Interruptions, talking over each other, and background noise are the norm. For traditional models, this is simply a nightmare.
The most impressive capability of MOSS-Transcribe-Diarize lies in its handling of Highly Overlapping Multi-speaker Dialogue.
In the official demo clip “Hua Qiang Buys Melons,” the dialogue rhythm between the two characters is extremely fast, with obvious voice overlap. The model not only accurately transcribed the speech into text but also precisely marked the time segment of each sentence (e.g., 00:01.08-00:02.96) and the corresponding speaker label (such as [S01], [S02]). This ability is undoubtedly a huge boon for generating meeting minutes, call analysis, or long video content processing.
Emotion and Dialect: Understanding the “Temperature” Behind Words
Language is not just a combination of words; tone, intonation, and local slang often carry more information.
This model performs quite well in capturing High-dynamic Emotional Speech. Whether it’s a fierce quarrel (like the argument scene in “Tiny Times”), loud screaming, or crying, it can accurately perform speech segmentation. This was a blind spot for many speech recognition systems in the past because pronunciation characteristics often deform severely when emotions are high.
In addition, it also demonstrates strong robustness for Regional Accents and Informal Slang recognition. This means that even if the speaker is not speaking standard broadcast Mandarin or mixes in internet slang, the model can still accurately understand and transcribe.
Friends who want to know more technical details can refer to their paper on Arxiv.
Challenging Extreme Speed: From “Sloth” to “Fast Mouth”
Human speaking speeds vary greatly. Sometimes we fire like a machine gun, and sometimes we are as slow as Flash the Sloth in “Zootopia.”
MOSS-Transcribe-Diarize demonstrates its ability to handle Extreme Speech Rate Variations. In tests, it successfully transcribed extremely slow, almost paused sentences like a sloth, while also keeping up with rapid turn-taking. This shows that the model is not just “identifying words by sound,” but truly understands the flow logic of the dialogue.
Performance Showdown: Surpassing GPT-4o and Gemini?
The question everyone cares about most is definitely: How does it compare with top models on the market?
According to data charts released by MOSI.AI, in key metrics like Character Error Rate (CER) and Concatenated Permutation Character Error Rate (cpCER), MOSS-Transcribe-Diarize performs better than Doubao, ElevenLabs, GPT-4o, as well as Gemini 2.5 Pro and Gemini 3 Pro.
Especially in the cpCER metric for handling multi-person mixed conversations, MOSS’s error rate is significantly lower than other competitors, which directly proves its advantage in complex scenarios. This data has extremely high reference value for professional users who need high-precision transcription. More detailed data can be viewed on the MOSI Official Website.
FAQ
To help everyone understand this new technology more quickly, we have compiled a few most common questions:
Q1: What problem does MOSS-Transcribe-Diarize mainly solve?
It mainly solves the problem where traditional speech recognition models cannot accurately distinguish speakers and transcribe content when facing “multiple people speaking at the same time,” “noisy backgrounds,” or “strong emotions or accents.” It can simultaneously output precise text, speaker labels (who said it), and timestamps.
Q2: Is this model free for commercial use?
Current information shows that this model is released by MOSI.AI (OpenMOSS Team). For specific licensing terms, it is recommended to refer directly to instructions on their official website or GitHub page to confirm whether commercial use is allowed and relevant restrictions.
Q3: Which languages does it support?
Judging from the official demo, the model can already smoothly handle Chinese (including dialects), English, and Japanese. Considering its architecture based on Large Language Models (LLM), the possibility of expanding to more languages in the future is very high.
Q4: Where can I try this model?
The OpenMOSS team has provided an online Demo on HuggingFace for the public to experience. You can click here to try it out, upload your own audio files, or use default examples to test its effect.
The emergence of this technology marks another big step for AI in the field of speech understanding. It is no longer just coldly turning sound into text but starting to try to understand the context and flow of dialogue. For developers, creators, and even general users, this will bring a significant improvement in work efficiency.


