Title: Beyond Fragmented Scanning: A Practical Guide to Baidu’s Unlimited-OCR with Constant KV Cache
Does processing long PDFs crash your server’s memory? This article explores Baidu’s 2026 open-source project, Unlimited-OCR, focusing on its R-SWA attention mechanism, Constant KV Cache technology, and providing a complete SGLang deployment guide for high-concurrency 32K token parsing.
Processing long documents has always been a technical nightmare. When development teams attempt to feed a fifty-page financial report or a complex technical manual into a model, server memory is inevitably overwhelmed. Engineers are often forced to write scripts to fragment the document, leading to broken tables and lost logical connections across context, followed by complex code to piece the fragmented information back together.
To be honest, this compromise is incredibly frustrating.
However, a breakthrough has arrived. On June 22, 2026, Baidu officially unveiled the Unlimited-OCR project, focusing on “single-pass long-context parsing.” This open-source solution directly addresses the memory constraints that have historically plagued OCR technology. The project quickly gained over 550 stars and 43 forks on GitHub. Today, we’ll break down the logic behind this technology to see how it allows models to ingest up to 32,000 tokens in one pass.
Memory No Longer a Monster: The Magic of Constant KV Cache
Developers often ask how this model differs from traditional workflows. The answer lies in its memory management mechanism.
Traditional models experience linear or geometric growth in Key-Value (KV) Cache as input length increases during long-sequence generation. It’s like trying to memorize a long string of numbers—the brain crashes at the end. To prevent crashes, systems are forced to reduce concurrency or limit input length.
Unlimited-OCR introduces “Constant KV Cache” as a key innovation. Through highly optimized cache management strategies, the model locks memory consumption during the decoding process within a near-constant range. This means whether you process a ten-page contract or a hundred-page specification, the GPU memory consumption remains stable. Server stability has significantly improved, avoiding unexpected downtime caused by sudden long documents.
Human-like Reading: R-SWA Reference Sliding Window Mechanism
Compressing memory isn’t enough for long-context parsing; the model must “understand” the context. This is where R-SWA (Reference-based Sliding Window Attention) comes in.
Imagine how humans read thick books. When reading a specific technical term on page fifty, you might use your finger to hold the table of contents or glossary page to reference global architecture. R-SWA does exactly that.
Traditional sliding window mechanisms save computational resources but often suffer from “amnesia.” R-SWA smartly replaces the traditional attention layer in the base model’s decoder. It retains global reference tokens while processing local details through the sliding window. Because of this mechanism, when the model parses data on the last page, it still maintains the context of the first page, solving the issue of context fragmentation.
Standing on the Shoulders of Giants: Tech Fusion
The industry already has excellent visual parsing models. The research team didn’t reinvent the wheel but fused the valuable insights of frontier models.
This architecture’s base multimodal understanding ability draws heavily from Deepseek-OCR and Deepseek-OCR-2, particularly benefiting from precision in complex layout identification. Additionally, the team leveraged the industrial-proven stability of PaddleOCR. Integrating these advantages successfully birthed this monster-level application capable of parsing 32K tokens in one pass.
Practical Exercise: High-Concurrency Deployment from Huggingface to SGLang
Now for the hard-core implementation. This powerful model uses a very developer-friendly MIT open-source license.
Another common question is whether the system can directly ingest PDF files and if special hardware is required. The answer is clear. The project not only natively integrates the PyMuPDF package for PDF-to-image conversion but also offers high deployment flexibility. Just prepare an NVIDIA GPU with sufficient memory, Python 3.12.3, and CUDA 12.9, to start inference via the Huggingface transformers interface.
If you intend to deploy to production, using SGLang to set up a local inference server is highly recommended. SGLang provides an OpenAI-compatible API endpoint, making streaming requests as natural as drinking water.
To ensure a pristine environment, using uv to manage virtual environments is a smart choice. You can refer to the following configuration logic:
# Set up and activate virtual environment with uv
uv venv
source .venv/bin/activate
# Install specific versions of SGLang and PDF processing packages
pip install ./wheel/sglang-*.whl
pip install kernels==0.9.0 PyMuPDF
# Launch high-efficiency inference server on port 30000
python -m sglang.launch_server --model-path ./path_to_model --port 30000
Once the server is running, the real time-saving begins. The project includes a small tool called infer.py, a lifesaver for processing massive amounts of files. It automatically starts the server and sends high-concurrency requests directly to a directory filled with historical PDF files or images. This clean, batch-processing architecture definitely keeps server load manageable and engineers’ stress levels down.
Future Potential Beyond OCR
Unlimited-OCR’s impact goes beyond parsing dozens of pages of financial reports. R-SWA is a general parsing attention mechanism. Since it solves long-sequence difficulties in visual documents with extremely low computational cost, this logic can naturally be applied to other domains. Imagine extending this mechanism to Automatic Speech Recognition (ASR) to process hours of meeting recordings, or applying it to machine translation, allowing models to maintain character personality and tone consistency while translating entire novels. The potential of this technology is just beginning to unfold.
When single-pass ultra-long-context processing becomes the norm, developers can finally focus on business logic rather than battling memory overflow errors every day. It’s highly recommended to pull the source code from GitHub and run it yourself to experience the fluidity of parsing fifty pages in one go.
Questions & Answers (Q&A)
Q: What is Baidu Unlimited-OCR, and what pain point does it solve? A: Unlimited-OCR is an open-source OCR project launched by Baidu on June 22, 2026, focusing on the “era of single-pass long-context parsing.” It solves the pain point where traditional OCR models experience memory explosions when processing long documents (like multi-page PDFs), forcing engineers to “fragment” files and lose logical connections.
Q: What is the core technology of Unlimited-OCR? How does it handle 32K tokens? A: Its core technology introduces “Reference-based Sliding Window Attention (R-SWA)” and “Constant KV Cache.” This locks GPU memory consumption during decoding, drastically reducing computational cost of the attention mechanism, and allows the model to retain global reference tokens while processing local details in a sliding window, ensuring context remains unbroken.
Q: Which frameworks are recommended for local deployment? A: Deployment is highly flexible. Developers can use Huggingface transformers directly on NVIDIA GPU environments (supporting Python 3.12.3/CUDA 12.9). For high-concurrency production environments, using SGLang is highly recommended to set up a local server, providing OpenAI-compatible APIs.
Q: Does the project support batch processing for many PDF files?
A: Yes. During environment setup, installing PyMuPDF for PDF-to-image conversion is recommended. The project also includes an infer.py tool that can automatically launch an SGLang server and send high-concurrency batch requests directly to a directory of files.
Q: Is the project’s open-source license friendly for commercial use? A: Yes, it is very friendly. Unlimited-OCR uses the MIT license, allowing enterprises and developers to download and apply it to commercial projects freely.



