news

AI Daily: AI Reasoning Breakthrough: Gemini 3 Deep Think Arrives, Major Updates from Cursor and Anthropic

December 5, 2025
Updated Dec 5
6 min read

In late 2025, with AI technology evolving rapidly, we seem to witness a technological mini-revolution every few days. It’s not just about model parameters getting larger, but about them becoming ‘smarter’ and how we coexist with these digital brains. Today’s news is exciting, from Google’s new mode challenging human logic limits, to Cursor’s fundamental overhaul for GPT-5.1, and Anthropic’s sociological experiment attempting to understand the human heart—each is worth savoring.

Google Gemini 3 Deep Think: Breaking the Logic Ceiling with Parallel Reasoning

To be honest, watching AI solve complex math problems always brings an inexplicable sense of healing. Google has just announced the rollout of Gemini 3 Deep Think mode to AI Ultra subscribers in the Gemini App. This isn’t just a “stronger” version; it represents a fundamental shift in how problems are processed.

You might have encountered this: asking AI a super hard logic question, and while it gives an answer, something feels off. The core of Gemini 3 Deep Think lies in its adoption of advanced “Parallel Reasoning.” What does this mean? Simply put, it no longer walks down a single path into the dark. When facing complex mathematical, scientific, or logical puzzles, this model explores multiple hypotheses simultaneously, much like a team brainstorming rather than a person fighting alone.

The effect of this approach is astonishing. In the industry-recognized high-difficulty benchmark “Humanity’s Last Exam,” it scored 41.0% without using external tools. Even more exaggerated is its performance in the ARC-AGI-2 test, where it reached an unprecedented 45.1% when combined with code execution. Keep in mind that the ARC test has always been considered a touchstone for testing whether AI possesses general reasoning capabilities. Achieving this score means its ability to handle unfamiliar, abstract patterns has already left many competitors behind. This technology is built on the foundation of previous Gemini 2.5 Deep Think variants, which had just won gold medals in the International Mathematical Olympiad.

Cursor Integrates GPT-5.1-Codex-Max: The Hardcore Developer Spirit Returns to Shell

For developers wrestling with code daily, Cursor is definitely one of the hottest tools recently. Their newly released changelog reveals how they tamed OpenAI’s latest and most powerful GPT-5.1-Codex-Max model.

This update is interesting as it reflects a “back to basics” trend. OpenAI’s team discovered that this new Codex model relies heavily on the Shell (command-line interface) during training. Therefore, Cursor decided to go with the flow, adjusting their Agent framework to make the model more inclined to use Shell commands for searching, reading files, and making edits, rather than relying on embedded Python scripts.

Why do this? Imagine if the model struggles with complex edits, it might try to write a Python script to solve the problem. While powerful, this is sometimes like using a sledgehammer to crack a nut, and might not even be safe enough. By adjusting tool definitions (such as naming search tools more like ripgrep), Cursor guides the model to call tools directly when appropriate, enhancing both safety and fluidity.

Another point worth noting is the preservation of the “reasoning process.” Did you know? OpenAI’s reasoning models generate a series of internal monologues (Chain of Thought) when thinking. Cursor’s experiments found that if these reasoning traces are discarded, GPT-5-Codex’s performance plummets by 30%! It’s like taking away an engineer’s scratchpad and only allowing them to write the final answer—they would naturally be at a loss. So, Cursor has now added an alert mechanism to ensure these precious thought processes are fully preserved, keeping the model from getting lost in multi-turn conversations.

Anthropic Interviewer: When AI Starts Interviewing Humans

Technology always comes from humanity, but do we really understand how humans feel in the AI era? This time, Anthropic isn’t releasing a new model, but rather a research tool called Anthropic Interviewer, and has published interview data from 1,250 professionals.

This study is fascinating because the interviewer itself is an AI. Powered by Claude, it conducts deep conversations lasting 10 to 15 minutes with humans. The results show that people’s feelings are actually quite complex. General office workers are generally optimistic; they are happy to offload repetitive, boring work to AI so they can focus on more valuable things. Sounds reasonable, right?

But in the creative field, the atmosphere is more tense. Although many writers and artists admit AI can improve productivity, they are also deeply troubled by “imposter syndrome” and peer pressure. One writer even said that while novels written by AI have perfect structure, they always feel like they lack the delicate emotions unique to humans. As for scientists, they crave a powerful assistant that can help generate hypotheses, but current AI hasn’t yet won their complete trust—after all, in scientific research, accuracy is everything.

Anthropic has opened this tool for public testing. If you are a long-time user of Claude, you might receive an interview invitation recently. This is not just a tech showcase, but an important attempt to let the public’s voice feed back into the model development process.

Hugging Face OpenEvals: An Evaluation Guide for Model Builders

Finally, as we watch these powerful models clash like titans, you might wonder: “How do we actually define if a model is good?” Hugging Face offers a great perspective. Their OpenEvals Guide provides a set of evaluation standards for those building models.

This guide is not just a list of test data; it’s more like a playbook, guiding developers to think: How does my model perform on specific tasks? Does it truly solve user pain points? In a time when models are blooming everywhere, having the correct evaluation mindset might be more important than blindly chasing benchmark scores.


FAQ

Q: How do I use the Gemini 3 Deep Think mode? Currently, this mode is only open to Google AI Ultra subscribers. If you are already a subscriber, just open the Gemini App, select “Deep Think” in the prompt input box, and confirm that “Gemini 3 Pro” is selected in the model dropdown menu to experience it.

Q: Will Cursor’s updates for the Codex model affect my existing usage habits? Most changes happen under the hood. You will feel that the Agent becomes smarter and makes fewer mistakes when executing tasks, especially when handling complex file edits. You don’t need to change how you operate, but you might find that it now does the “right thing” more often without needing you to repeatedly correct it.

Q: Can I participate in this interview research by Anthropic? Yes! Anthropic is conducting a public pilot. If you are an existing user of Claude.ai (including Free, Pro, or Max plans) and have been registered for more than two weeks, you might see a pop-up window for participating in the interview on the web page. This is a good opportunity to share your views on AI.

Q: Why is preserving AI’s “reasoning process” so important for coding? Imagine solving a math problem; if you forget how the first few steps were derived, later calculations are prone to errors. AI is the same, especially in highly logical tasks like coding. Preserving the thought trail of “why it did this” helps it maintain goal consistency across continuous steps, avoiding the creation of self-contradictory code.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.