tool

0.9B Parameters Challenging SOTA! Zhipu GLM-OCR Open Source: Accelerating Document Parsing by 10x

February 3, 2026
Updated Feb 3
5 min read

Zhipu AI open sources the GLM-OCR model, achieving SOTA performance in complex table and formula recognition with only 0.9B parameters. Its performance rivals GPT-5.2 and Gemini-3-Pro, with inference costs only one-tenth of traditional OCR. Learn how to deploy this lightweight document parsing tool and achieve direct Markdown and JSON structured output!


Honestly, the development of AI in the past few years seems to have created a myth: as long as the model parameters are large enough, all problems can be solved. Tech giants are racing to launch multi-modal large models with tens or even hundreds of billions of parameters. However, when developers and enterprises actually want to apply these giants to real-world applications, high computing costs and frustrating latency often become the biggest stumbling blocks.

Is there no lighter, smarter solution?

Zhipu AI’s latest GLM-OCR breaks this deadlock. This lightweight professional OCR model has a parameter scale of only 0.9B. Think about it carefully, a volume of less than 1B sounds insignificant. But according to the latest data from the authoritative leaderboard OmniDocBench V1.5, this “small size” model topped the list with a score of 94.62, even surpassing closed-source large models like GPT-5.2 and Gemini-3-Pro in many core scenarios.

This is not just a technical update, but a comprehensive reshaping of efficiency.

Punching Above Its Weight: Ultimate Cost-Effectiveness and Speed

To measure the utility of a tool, speed is definitely a hard indicator. Under the same hardware environment and single-copy test conditions, GLM-OCR demonstrated amazing throughput. When processing PDF documents, it can reach a speed of 1.86 pages per second, and 0.67 images per second for single images. Such performance is significantly better than similar models.

More importantly, deployment flexibility. Since the parameter count is only 0.9B, it perfectly supports mainstream frameworks such as vLLM and SGLang. This means enterprises can run this system on local servers or even edge devices with limited computing power. According to Zhipu’s official technical documentation, this lightweight design drastically reduces inference latency and computing overhead, making the overall operating cost only about one-tenth of traditional OCR solutions.

Tackling “Tough” Documents, Even Handwritten Formulas

Traditional OCR tools perform okay with neat printed text, but often fall short when facing chaotic layouts in real business scenarios. Whether it is skewed scans, invoices covered with stamps, or handwritten mathematical formulas, these have always been disaster zones for document parsing.

GLM-OCR has been specifically optimized for these complex scenarios. In tests containing code documents, complex tables, stamps, and other elements, its recognition accuracy remains outstanding. Taking the most headache-inducing mathematical formula recognition as an example, in the UniMERNet benchmark test, GLM-OCR scored a high 96.5, even surpassing GPT-5.2’s 90.5.

Imagine a student taking a picture of a notebook full of messy calculus formulas, and the system accurately recognizing and converting it into digital text within seconds. This undoubtedly solves a long-standing pain point for the fields of educational technology and research assistance.

Say Goodbye to Tedious Post-Processing: Direct Markdown and JSON Structured Output

For developers, OCR recognizing plain text is only the first step. How to re-typeset these scattered words and build structure is the real time-consuming work.

Here is a very practical highlight. GLM-OCR supports direct export of Markdown documents and image links. This means that the document’s original heading hierarchy, paragraphs, and lists can be perfectly preserved. In addition, it also has powerful information structured extraction capabilities, capable of returning JSON data that conforms to a predefined format.

Most current Large Language Model applications rely on RAG (Retrieval-Augmented Generation) systems. With structured Markdown and JSON outputs, this data can be seamlessly connected to vector databases, completely eliminating tedious text cleaning steps. Technical personnel who want to study the source code can go directly to the GLM-OCR GitHub project page to obtain relevant resources.

The Technical Code Hidden Behind 0.9B

So, exactly what kind of architecture gives this small model such powerful “vision”?

The answer lies in the self-developed CogViT visual encoder. This architecture is based on large-scale image-text data pre-training combined with a 0.5B language decoder. The development team cleverly introduced a multi-token prediction loss function and a full-task reinforcement learning strategy. This design improves the model’s generalization ability, allowing it to accurately understand documents with extremely complex layouts.

This technology has been completely open-sourced. Interested developers can download the model weights on the Hugging Face platform to actually experience the technical charm behind it.

Multi-Language and Super Large File Support: Max Practicality

Many people might be curious, is this model only optimized for Chinese? The answer is no. GLM-OCR supports a wide range of languages, covering Chinese, English, French, Spanish, Russian, German, Japanese, Korean, and many other languages, making it capable of handling cross-border business scenarios with ease.

Regarding input limits, the system also offers great tolerance. A single image supports up to 10 MB, and PDF files support up to 50 MB or 100 pages. This specification is enough to handle the vast majority of financial reports, prospectuses, or large contract documents.

API Call: What Can You Do with One Yuan?

Finally, let’s talk about the price that everyone cares about. For users who do not want to deploy the model themselves, Zhipu provides highly competitive API services. Input and output prices are the same, only needing 0.2 RMB / million Tokens.

How cheap is this? Calculated down, 1 RMB can process approximately 2000 A4-sized scanned images, or 200 simple layout PDFs of ten pages each. This near-free pricing strategy allows even small startups with limited budgets to easily achieve document digitization.

Whether it is an enterprise pursuing the ultimate cost-effectiveness or a researcher needing precise parsing of complex formulas, this model combining “small size” and “high precision” is worth putting in your toolbox. After all, solving complex problems sometimes only requires a lightweight and smart answer.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.