news

AI Daily: DeepSeek OCR 2 Open Sourced, Google AI Plus Rollout: New Battleground for Vision Models and Subscriptions

January 28, 2026
Updated Jan 28
7 min read

This week’s AI developments can only be described as “dazzling.” This is not just an arms race of model parameters, but a technological revolution regarding “how AI views the world like a human.”

DeepSeek has once again demonstrated the open-source spirit by releasing the OCR 2 model introducing “Visual Causal Flow,” attempting to break the deadlock of traditional visual scanning; meanwhile, Google is not to be outdone, launching a more affordable AI Plus subscription plan on one hand, and showcasing Agentic Vision in Gemini 3 Flash capable of “active investigation” on the other. Of course, there is also the Z-Image foundation model brought by Tongyi Lab, injecting new vitality into the field of image generation.

Let’s take a closer look at the details and impact behind these technological updates.

Evolution of Visual Logic: DeepSeek-OCR 2’s “Causal Flow” Revolution

If you follow document processing technology, you surely know the pain points of traditional OCR (Optical Character Recognition): they usually scan rigidly from top left to bottom right. But humans don’t read like that. When we look at a complex report or magazine, our gaze jumps according to semantic logic.

This is exactly the core problem DeepSeek-OCR 2 attempts to solve. The DeepSeek team didn’t just improve recognition rates; they introduced an architecture concept very similar to humans: Visual Causal Flow.

Why is “Causal Flow” Important?

Imagine the model is no longer passively receiving pixels, but actively “deciding” where the next visual block to look at is based on context. DeepSeek-OCR 2 introduces “Causal flow query,” equipping the visual encoder with reasoning capabilities. This means that when interpreting complex layouts, formulas, or tables, the model can more accurately reorganize visual information instead of outputting a bunch of gibberish.

In terms of technical details, this model is full of sincerity:

  • Powerful Architecture: Adopts a Vision Tokenizer (based on SAM-base) coupled with an LLM-like visual encoder (Qwen2 0.5B).
  • High Performance: Supports input resolutions up to 1024x1024 and can compress visual tokens to between 256 and 1120. This happens to benchmark against Gemini 3 Pro’s visual processing budget, yet achieved excellent results on the OmniDocBench benchmark.
  • Open Source Spirit: The code and weights are currently available on GitHub and HuggingFace.

For developers who need to process a large number of complex documents, this is undoubtedly a powerful tool. It proves that even small parameter models, with proper architectural design, can demonstrate amazing “reading comprehension” capabilities.

Google’s Double-Sided Strategy: Affordable Subscription and Active Vision

Turning the lens to Google, the tech giant is playing a delicate balancing game. On one hand, expanding market share through new subscription tiers, and on the other, flexing muscles through stronger technology.

Google AI Plus: Filling the Middle Ground

For a long time, users lacked a compromise choice between the free version and the expensive Pro version. Google finally heard this call and launched Google AI Plus.

This new plan is priced at $7.99 per month (with a half-price offer for the first two months for new users), and its positioning is very precise:

  • Upgrade Privileges: Access to stronger models like Gemini 3 Pro and Nano Banana Pro.
  • Creative Tools: Includes access to Flow’s AI movie-making tools and advanced features of NotebookLM.
  • Family Sharing: Comes with 200GB of storage space and can be shared with up to five family members.

This service has fully launched in 35 new countries/regions including the US Full Availability. For users who find the Pro version too expensive but feel limited by the free version’s features, this is a highly attractive entry ticket.

Gemini 3 Flash Introduces Agentic Vision

If AI Plus is a commercial layout, then Agentic Vision is a technical show-off.

Current AI models usually view images “statically”—take a look, then guess the details. What if the serial number in the picture is too small to see clearly? Traditional models can only guess blindly. But the Agentic Vision introduced by Google in Gemini 3 Flash changes this.

This feature gives the model “agent-like” mobility. It follows a “Think -> Act -> Observe” cycle.

  1. Think: The model analyzes user needs.
  2. Act: The model writes and executes Python code to manipulate the image (e.g., crop, rotate, zoom in on specific areas).
  3. Observe: Check the processed image to obtain more precise information.

For example, if you ask it to “count the number of fingers in the picture,” it doesn’t count by feeling, but writes code to draw a box on each finger to ensure accurate counting. This “active investigation” capability transforms visual understanding from passive to active.

Developers Note: The Shrinking Free Lunch

However, behind these good news, there is also a change that gives developers a bit of a headache. Google’s Developer Relations Lead Logan Kilpatrick confirmed that the UI usage limits (Limits) for the free tier in Google AI Studio have been lowered, and are expected to continue to be lowered in the future.

The official advice is clear: if you want to continue high-intensity use, please switch to API Key mode, or consider upgrading to a paid plan. The good news is that the “Vibe Coding” experience in AI Studio is temporarily unaffected. This reflects a reality—AI computing costs are high, and the era of completely free playgrounds may be slowly coming to an end.

Tongyi Z-Image: Returning to Pure Image Generation

In the field of image generation, many models are highly distilled or specifically tuned, which is convenient but limits the space for secondary development. Z-Image released by Tongyi Lab takes a different path.

Z-Image bills itself as an “undistilled foundation model.” This sounds technical, but it means a lot for creators and developers. It means it retains complete training signals and supports full Classifier-Free Guidance (CFG), which is crucial for professional workflows requiring fine control over prompts (Prompt Engineering).

According to its GitHub page, Z-Image’s advantages lie in:

  • Extreme Diversity: Whether it’s hyper-realistic photography or anime style, it can handle it, and performs excellently in randomly generated compositions and lighting.
  • Fine-tuning Friendly: Because it retains original characteristics, it is very suitable as a training base for LoRA or ControlNet.
  • Negative Prompt Control: It is very sensitive to Negative Prompts, effectively suppressing image collapse.

Compared to its Turbo version (pursuing speed but sacrificing some controllability), the standard version of Z-Image takes more steps (28-50 steps), but in exchange for higher visual quality and editing flexibility.

Frequently Asked Questions (FAQ)

Q: What is the fundamental difference between DeepSeek-OCR 2 and traditional OCR software? A: Traditional OCR usually scans in a fixed order (such as top-left to bottom-right), easily messing up complex layouts. DeepSeek-OCR 2 mimics human visual logic, possessing “Visual Causal Flow,” and can actively judge the reading order based on content semantics, making it particularly suitable for handling complex magazines, forms, or academic papers.

Q: I already have the Google One 2TB plan, do I need to purchase AI Plus additionally? A: No. Google stated that existing Google One Premium 2TB subscribers will automatically receive all AI Plus benefits in the coming days.

Q: How does Agentic Vision make Gemini see more clearly? A: It doesn’t just “look,” it “acts.” Agentic Vision allows the model to write Python code to zoom, crop, or annotate images. This is like when a human can’t see something clearly, they lean in or use their finger to point and count, obtaining precise information through interaction.

Q: Should I choose Z-Image or Z-Image-Turbo? A: If you are a developer looking to train your own style models (LoRA) or need extremely high image control, please choose the standard version Z-Image. If you just need to generate high-quality images quickly and don’t need too complex negative prompt control, the Turbo version will be more efficient.

Q: What should developers do after the free limits of Google AI Studio are lowered? A: It is recommended that developers start getting used to using API Keys for calls, as UI interface (Playground) limits will become stricter. If you are a high-frequency user, you may need to evaluate whether to upgrade to the paid AI Pro or Ultra plans.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.