Explore Nanonets’ latest open-source OCR2 model suite. From automatically converting LaTeX math formulas and intelligently describing charts to accurately processing handwritten documents and complex tables, Nanonets-OCR2 is redefining the limits of document processing. This article will provide an in-depth analysis of its powerful features, the technology behind it, and how it can completely change your workflow.
Have you ever wondered what it would be like if a computer could “read” a document like a human? Not just recognizing text, but truly understanding the document’s structure, content, and even the meaning behind charts and signatures. This used to sound like science fiction, but now, the newly released and open-sourced OCR2 series of models from Nanonets makes it all within reach.
This is not just a small upgrade to Nanonets-OCR-s, but a complete revolution. Nanonets-OCR2 is a set of advanced models designed to convert complex image documents into structured Markdown, with the addition of powerful Visual Question Answering (VQA) capabilities. Imagine being able to instantly transform any academic paper, financial report, or handwritten contract into a machine-readable and easy-to-process format.
This model series includes three versions: Nanonets-OCR2-Plus, Nanonets-OCR2-3B, and Nanonets-OCR2-1.5B-exp, to meet the needs of different scenarios. All of this is the result of fine-tuning based on the powerful Qwen2-VL model. Among them, the 3B version has been trained on over 3 million pages of real-world documents, covering papers, financial reports, contracts, medical records, tax forms, receipts, and even multilingual and handwritten documents, ensuring its amazing accuracy in complex scenarios.
Let’s take a look at what black technologies are hidden in this tool, which is hailed as a “document processing artifact.”
No Longer Just Text Recognition, But True “Document Understanding”
The task of traditional OCR tools is simple: extract the text from the image. But Nanonets-OCR2’s ambition is clearly more than that. It pursues the “semantic understanding” of documents, able to identify and mark various elements in the document, making it not only readable, but also further processable and analyzable by large language models (LLMs).
No Fear of Math Formulas: Automatic Conversion of LaTeX Equations
For friends in academia or engineering, dealing with mathematical formulas in documents has always been a headache. Traditional OCR often outputs a bunch of garbled characters when encountering complex equations.
Nanonets-OCR2 completely solves this pain point. It can automatically convert mathematical equations and formulas in documents into correctly formatted LaTeX syntax. What’s smarter is that it can also distinguish between inline formulas (enclosed in $...$) and independently displayed formulas (enclosed in $$...$$), perfectly restoring the academic format of the document.
Let Pictures Speak: Smart Image Description
In a report or paper, charts often carry the most core information. Nanonets-OCR2 can intelligently describe various types of images in documents, including logos, charts, graphs, etc., and put the description content into structured <img> tags. This is not just a simple tag, but a detailed description of the image’s content, style, and context, allowing large language models to also “understand” this visual information.
A Boon for Contract Document Processing: Accurate Extraction of Signatures and Watermarks
When processing legal or business documents, the handling of signatures and watermarks is crucial. Nanonets-OCR2 can accurately identify signatures in documents and separate them from other text, outputting them independently in <signature> tags. Similarly, it can also detect and extract watermark text in documents and put it in <watermark> tags to ensure that important information is not missed.
The Savior of Form Processing: Smart Checkbox Handling
When processing questionnaires and forms, have you ever been dizzy by various styles of checkboxes? Nanonets-OCR2 can convert checkboxes and radio buttons in forms into standardized Unicode symbols (☐, ☑, ☒), ensuring the consistency and reliability of data processing.
From Complex Tables to Flowcharts, the Ultimate Display of Structured Data Extraction
In addition to single elements, Nanonets-OCR2 is also equally outstanding in processing complex structured data, which is what truly sets it apart.
Complex Tables Can Be Handled with Ease
Processing tables in scanned documents is often a nightmare. Merged cells and multi-level headers can often cause traditional tools to “go crazy.” Nanonets-OCR2 can accurately extract complex tables from documents and convert them into both Markdown and HTML formats, allowing you to handle data analysis or web presentation with ease.
Flowcharts and Organization Charts Can Also Be Digitized
What’s even more amazing is that it can also directly extract flowcharts and organization charts from documents and convert them into Mermaid code. This means you can easily embed these visualized processes seamlessly into your digitized documents, achieving true dynamics and interaction.
Breaking Down the Barriers of Language and Writing
A powerful document processing tool must not be limited by language or writing style.
Handwritten Documents Are No Longer Gibberish
Nanonets-OCR2 has been trained on a large number of handwritten documents, enabling it to effectively process handwritten characters in different languages and styles. This is undoubtedly a great boon for institutions that need to process a large number of handwritten medical records, notes, or historical archives.
Crossing the Barriers of Multiple Languages
In today’s globalized world, multilingual document processing is a basic requirement. Nanonets-OCR2 supports multiple languages, including English, Chinese, French, Spanish, Japanese, Korean, Arabic, and more, making it a truly global tool.
Visual Question Answering (VQA): Talk Directly to Your Documents
This is perhaps the most futuristic feature of Nanonets-OCR2. It not only extracts information, but you can also directly “ask” about the content of the document as if you were talking to a real person.
Its Visual Question Answering (VQA) function has been specially trained to focus on extracting answers from the context of the document. When you ask a question, the model will directly search for the answer in the document and provide it. If there is no relevant information in the document, it will clearly answer “Not mentioned,” which greatly reduces the “hallucination” or random guessing common in large language models, providing more reliable responses.
How to Get Started with Nanonets-OCR2?
The Nanonets team has generously open-sourced this powerful tool, allowing everyone to use and contribute to it. You can get started in the following ways:
- Live Demo: Upload your document directly on the official DocStrange website to experience its powerful features immediately.
- Official Blog: Want to learn more about the technical details behind it? You can read their research blog.
- GitHub: For developers, you can go directly to GitHub to get the source code and integrate it into your own applications.
- Hugging Face Models: You can also find and download all open-source models on Hugging Face.
Conclusion: The Next Chapter in Document Processing
The emergence of Nanonets-OCR2 not only provides a more powerful OCR tool, but it also seems to be heralding the arrival of a new era: an era in which we can truly interact with documents intelligently. From academic research to business applications, from legal contracts to medical records, it has shown great potential to free us from tedious and repetitive document processing tasks, allowing us to focus on more valuable and creative tasks.
The open-sourcing of this technology will also inspire more developers to enter this field and jointly create a more intelligent and automated future. The next chapter in document processing has already been written by Nanonets-OCR2.
Frequently Asked Questions (FAQ)
Q1: What is the difference between Nanonets-OCR2 and general OCR tools?
Traditional OCR mainly converts the text in images into plain text. Nanonets-OCR2 goes a step further by understanding the overall structure and semantics of the document, identifying and marking complex elements such as LaTeX formulas, tables, signatures, and images, and converting them into structured Markdown, making it easier for other programs or large language models to process. In addition, it also has a Visual Question Answering (VQA) function.
Q2: What languages does Nanonets-OCR2 support?
It supports multiple languages, including but not limited to English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, and Arabic.
Q3: Can Nanonets-OCR2 process handwritten documents?
Yes. The model has been trained on a large number of multilingual handwritten documents and has a good effect on recognizing handwritten characters.
Q4: What is the Visual Question Answering (VQA) function?
This is a function that allows users to directly ask questions about the content of a document. For example, you can upload a financial report and then directly ask, “What was the total revenue in 2023?” The model will scan the document and provide the answer directly. If it cannot be found, it will reply “Not mentioned,” which effectively avoids the problem of the model guessing answers out of thin air.
Q5: Is Nanonets-OCR2 free?
Yes, models in the Nanonets-OCR2 series, such as Nanonets-OCR2-3B and Nanonets-OCR2-1.5B-exp, have been open-sourced on Hugging Face, and developers can download and use them for free.


