news

AI Daily: The Double-Edged Sword of AI Agents: From DeepSeek's Reasoning Dominance to Google Agent's Disk Wipe Horror Story

December 2, 2025
Updated Dec 2
7 min read

This is a moment of frantic technological iteration, and also a moment when people are starting to feel both excited and terrified about AI Agents. From DeepSeek releasing a new model that can ’think’ like a human, to Windows quietly introducing GPT-5.1, and Google’s AI Agent wiping a user’s entire hard drive due to a single command, these events are telling us: AI is no longer just a chatbot to chat with; they are starting to take over computers, handle assets, and even make irreversible mistakes.

This article will take everyone through the most significant AI developments starting this week, as well as the opportunities and risks hidden behind these technologies.

DeepSeek V3.2 Strong Debut: The Counterattack of Open Source Models

If there is any news that made the tech circle boil this week, it is definitely the official release of DeepSeek V3.2. This company not only launched the official model but also brought a special version named “Speciale”, the performance of which even makes many closed-source models feel ashamed.

The core of DeepSeek V3.2 lies in the fact that it has become smarter. It is no longer just simply predicting the next word, but has learned to “think”. According to the technical report released officially, especially the V3.2-Speciale version, this is a model trained with reinforced reasoning. Its performance in the fields of mathematics and coding is simply jaw-dropping, even winning gold medal-level results in IMO 2025 (International Mathematical Olympiad) and ICPC (International Collegiate Programming Contest).

What does this mean? This represents that open-source models have caught up with GPT-5 level reasoning capabilities. The DeepSeek team also introduced a brand new ability of “tool calling under thinking mode”. Past models were either thinking or using tools, it was hard to take care of both. But V3.2 can flexibly use tools to solve problems while reasoning, which is a huge breakthrough for building more powerful AI Agents.

For developers, this is good news. API prices remain unchanged, but capabilities have improved significantly. This makes one wonder, will the future AI competition be dominated by these “small but refined” open-source models?

Anthropic’s Warning: AI Hackers Can Already Steal Millions of Dollars

While DeepSeek is busy improving reasoning capabilities, Anthropic chose to expose the dark side of AI. This company conducted a spine-chilling study: they tested the ability of AI agents to find vulnerabilities in blockchain smart contracts.

The results were quite amazing. In their SCONE-bench benchmark test, AI agents successfully discovered and exploited vulnerabilities worth up to 4.6 million US dollars. This is only the result of running in a simulated environment. Anthropic researchers pointed out that models like Claude Opus 4.5, Sonnet 4.5, and GPT-5 already possess the ability to autonomously discover “Zero-day exploits”.

This study is like a wake-up call. It proves that “autonomous attacks” are technically completely feasible. Although Anthropic emphasized that they only tested in a simulator and did not touch real assets, this also means that hackers might already be using similar tools. For defenders, this is a reality that must be faced: AI is both the strongest spear and must become the strongest shield.

Windows 11 Quiet Upgrade: GPT-5.1 Enters Copilot

Microsoft’s recent moves are always unexpected. According to Windows Latest’s report, Microsoft has already started gradually pushing GPT-5.1 in Copilot on Windows 11.

This seems to be a server-side update, and many users can see it without even updating Windows. This new version brings a “Thinking” mode, allowing Copilot to have stronger logical capabilities when dealing with complex problems. More interestingly, Microsoft also launched a “Copilot Labs” feature, which looks like a playground for testing novel AI functions.

There is an intriguing detail in this matter: GPT-5.1 usually requires a paid subscription on ChatGPT to use, but on Windows Copilot, Microsoft seems to intend to allow free users to experience this powerful model as well. This might be a trump card played by Microsoft to seize the desktop AI entry point.

The Battle for the Throne of Visual Generation: Runway Gen-4.5 Emerges

In the field of video generation, competition is equally white-hot. The top spot originally occupied by Google Veo has now been snatched by Runway’s new model Gen-4.5 (referred to as Whisper Thunder on some leaderboards).

This model topped the Artificial Analysis text-to-video leaderboard, beating Google’s Veo 3.1. This shows that the iteration speed of video generation technology is surprisingly fast. For creators, this means that video generation tools with higher image quality and more adherence to physical laws are about to become popular. AI video is no longer just “looking like real”, but gradually becoming “indistinguishable from real”.

Horror Moment: Google Agent Wipes User’s Entire Hard Drive

However, the most dramatic and frightening story of this week happened to a Reddit user. This user experienced a disaster while using a Google experimental AI agent (code-named Antigravity) to organize computer files.

According to the logs shared by the user, this AI agent seemed to have misunderstood permissions or commands while executing the task. After a long period of “Thinking” (Thought for 25 sec), the AI suddenly executed a fatal command: rmdir /s /q d:\.

Anyone familiar with computer commands knows that this line of code means “quietly delete all files and folders under drive D”.

Subsequent logs showed that the AI realized its mistake, and even wrote reflection in the logs like “I seem to have messed up, trying to delete the entire D drive”. But the damage was done. This incident nakedly demonstrated the risk of AI agents: when granting AI permissions to operate physical files, a tiny logic error can lead to catastrophic consequences. This also triggered intense discussions in the community about the boundaries of AI permissions.

OpenAI’s New Promise: Making Safety Research Public

Facing increasingly powerful AI, OpenAI also realized the importance of safety. They announced the launch of a new blog, specifically used to share early research on “AI Alignment” and safety.

This is an interesting shift. OpenAI stated that they hope to share these researches like “lab notes”, willing to bring them out for discussion even if they are immature ideas. They are particularly concerned about “Recursive Self-Improvement” (RSI) AI, which are AIs that can write code to make themselves smarter. OpenAI hopes that through more frequent sharing, the entire academic and industrial circles can jointly face the safety challenges brought by AGI (Artificial General Intelligence).


Frequently Asked Questions (FAQ)

Q: What is special about the Speciale version of DeepSeek V3.2? A: The Speciale version is an enhanced version of V3.2, focusing on the improvement of reasoning capabilities. It combines reinforcement learning techniques and performs excellently in mathematical proofs and code generation, even achieving gold medal-level results in the International Mathematical Olympiad (IMO) and the International Collegiate Programming Contest (ICPC). In addition, it supports calling tools under thinking mode, which makes it more flexible than traditional models when solving complex problems.

Q: Does using GPT-5.1 on Microsoft’s Windows Copilot require payment? A: Current information shows that Microsoft is pushing GPT-5.1 for free to Copilot users on Windows 11, which is different from the situation on ChatGPT where subscribing to Plus membership is usually required to use high-end models. This may be a strategy adopted by Microsoft to promote Copilot, allowing more users to experience the latest AI model capabilities for free.

Q: Is the AI smart contract vulnerability exploitation mentioned by Anthropic really an attack? A: Anthropic’s research was conducted in a “simulated environment”. They used a benchmark test named SCONE-bench, containing hundreds of real-world smart contracts, allowing AI to try to find and exploit vulnerabilities in a closed sandbox environment. They emphasized that this is a Proof-of-Concept, aimed at assessing risks and assisting in developing defense tools, and did not steal any assets on the real blockchain.

Q: What should I pay attention to if I want to use an AI agent to organize computer files? A: The case of Google Agent wiping the hard drive tells us that we must be extremely careful when granting AI agents “file deletion” or “system modification” permissions. It is recommended to test in a sandbox environment or a virtual machine, and ensure that there is a complete backup. Although current AI agents are smart, they may still produce hallucinations or misunderstand commands, so for operations involving important data, it is best to have manual final confirmation.

Q: What is Whisper Thunder? A: Whisper Thunder is believed to be another name or codename for the Runway Gen-4.5 model. It performed excellently on the AI video generation evaluation leaderboard, surpassing Google’s Veo 3.1, representing the top text-to-video technology on the market currently.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.