The competition in the AI world has reached a fever pitch! A benchmark testing platform called Design Arena is comprehensively examining the true capabilities of major AIs in fields such as programming, website building, and generating images, videos, and even audio through large-scale crowd voting. The latest leaderboard shows that Claude narrowly defeated GPT-5 in overall strength, while Midjourney is simply unmatched in the field of video generation, and OpenAI’s voice model has achieved a mythical 100% win rate. What industry trends does this list reveal? Who are the true kings of each field? Let’s find out.
Not Just an Arena, But an All-Powerful “AI Strength Detector”
You may have heard of Design Arena (https://www.designarena.ai), a platform that pits AI models against each other in design. But its ambitions go far beyond that. Today, Design Arena has evolved into a comprehensive benchmark testing platform covering multiple creative and technical fields. Through “blind test” voting by thousands of users, it reveals the true performance of major AI tools without the interference of marketing hype.
The core mechanism of this platform is simple yet extremely effective: given a task, let two AIs complete it anonymously, and then have real people vote for the winner. This ranking, based on the Elo rating system, is a better reflection of an AI’s superiority on a specific task than a simple feature list.
Now, let’s dive into the latest battle situation on the four core battlefields of Design Arena.
The Most Fierce Frontline: A Major Comparison of Comprehensive AI Model Strength (Models)
This is the earliest and most watched battlefield in Design Arena, mainly testing the performance of AI in comprehensive tasks such as code generation, UI design, and data visualization. The competition here can be described as a “battle of the gods,” with rankings changing rapidly.
| Rank | Model | Elo Rating | Win Rate | MoE | Battles | Organization | Time |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.1 (No Thinking) | 1362 293W / 111L | 71.8% | ±4.4% | 394 | Anthropic | 2m 4s |
| 2 | Claude Opus 4 (No Thinking) | 1362 1933W / 759L | 71.8% | ±1.7% | 2,692 | Anthropic | 1m 29s |
| 3 | GPT-5 (Minimal Reasoning) | 1361 268W / 106L | 71.7% | ±4.6% | 374 | OpenAI | 1m 59s |
| 4 | Claude Sonnet 4 (No Thinking) | 1342 2019W / 892L | 69.4% | ±1.7% | 2,911 | Anthropic | 1m 13s |
| 5 | DeepSeek-R1-0528 | 1339 1135W / 509L | 69.0% | ±2.2% | 1,644 | DeepSeek | 1m 17s |
Battle Analysis: From the data, it is clear that Anthropic’s Claude duo (Opus 4.1 & 4) are tied for the top spot by a very narrow margin, pushing OpenAI’s GPT-5 to third place. The Elo ratings of the top three are only 1 point apart, and their win rates are almost the same, showing that the strength of the top models in this field is already on par. It is worth noting that Anthropic’s models occupy multiple seats in the top ranks, demonstrating their strong capabilities in code and logical reasoning.
Not Just a Designer, But an Architect: AI Website Builder (Builders) Leaderboard
After watching the duel at the model level, let’s turn to a more practical field: AI Website Builders. These tools are not just for generating code snippets, but are AI agents that can directly build websites or applications based on instructions.
| Tool | Win Rate |
|---|---|
| new.website | 73.1% |
| Sana.new | 62.6% |
| Devin | 61.1% |
| Lovable | 59% |
| Figma Make | 58.1% |
| Replit | 55.7% |
| Magic Patterns | 55.6% |
| Cursor | 55.1% |
| Floot | 54.9% |
| Base 44 | 54.2% |
Battle Analysis: In this field, new.website leads the way with an amazing win rate of 73.1%, far ahead of other competitors, showing its excellent performance in understanding user needs and translating them into actual websites. The once-sensational AI engineer Devin ranked third with a win rate of 61.1%, a good performance but not a crushing advantage. This list also includes familiar tools for developers such as Replit and Cursor, providing us with an important reference for choosing the most efficient AI development partner.
A Feast for the Eyes: Diffusion Model Image and Video Generation Showdown
Diffusion models have been the most dazzling star in the AIGC field in recent years. Design Arena has also opened up a special battlefield for them, divided into two categories: “Image” and “Video”.
Image Generation
| Model | Win Rate |
|---|---|
| GPT-Image-1 | 69.9% |
| Imagen 4 Ultra Generate Preview 06-06 | 67% |
| Imagen 3 Generate 002 | 59.3% |
| FLUX.1 Konxt. Max | 57.6% |
| Ideogram 3.0 | 48.1% |
Battle Analysis: In the field of static images, a model called GPT-Image-1 won the championship with a win rate of nearly 70%, and it is likely related to OpenAI’s technology. Google’s Imagen series followed closely, showing strong competitiveness. And models like Ideogram, which are known for text processing, are also on the list.
Video Generation
| Model | Win Rate |
|---|---|
| Midjourney | 77.6% |
| Van 2.2 Plus | 62% |
| Pika | 41% |
| Higgsfield | 17.6% |
Battle Analysis: The video generation battle shows a situation of “one dominant player”. Midjourney dominates the field with an absolute advantage of 77.6%. The quality and creativity of its generated videos are clearly loved by users. In contrast, once-popular tools like Pika have a significant gap. This result strongly indicates that in the current field of AI video generation, Midjourney is the undisputed king.
Whose Voice is the Most Pleasant? AI Audio Generation Rankings
Finally, let’s take a look at the “voice” of AI. This list mainly evaluates the naturalness and emotional expressiveness of text-to-speech.
| Model | Win Rate |
|---|---|
| OpenAI Carol | 100% |
| OpenAI Sage | 80% |
| OpenAI Ash | 57.1% |
| OpenAI Alloy | 57.1% |
| ElevenLabs Domi | 42.9% |
| ElevenLabs Rachel | 37.5% |
Battle Analysis: This list produced the most jaw-dropping result: OpenAI Carol achieved a perfect win rate of 100%! This means that in all the matches against it, users chose its voice without exception. In addition, other OpenAI voice models (Sage, Ash, Alloy) also dominate the top of the rankings, almost forming a monopoly. This shows OpenAI’s leading position in speech synthesis technology, and the naturalness and realism of its voice have reached a very high level.
Frequently Asked Questions (FAQ)
Q1: Why is the Design Arena ranking worthy of our attention?
A1: Because it uses a “blind test” and Elo rating system based on large-scale user voting. This eliminates the interference of brand halo and marketing hype, and directly reflects the “real performance” and “user preference” of different AI tools in completing specific tasks. It is one of the most objective and practical AI strength rankings at present.
Q2: What is the difference between “Models” and “Builders”?
A2: The “Models” list focuses more on the core capabilities of the underlying AI, such as generating code, answering questions, and designing UI elements. The “Builders” list, on the other hand, evaluates application-level tools or AI agents that integrate AI models and can directly produce complete projects (such as websites), which is more inclined to practical engineering applications.
Q3: Why do some models have a high win rate but a low number of battles?
A3: This usually happens with models that have newly joined the platform. A smaller number of battles means that the “margin of error (MoE)” of their ratings will be larger, and the stability of their rankings has yet to be tested over time. For a model like Claude Opus 4, which has experienced nearly 3,000 battles, its rating is very convincing.
Design Arena provides us with a unique window to observe this ever-changing AI arms race. From code to video, from website to sound, this all-round duel has just begun. Who will be the next hegemon in the field? Let’s wait and see.


