1. Project Positioning
Unlimited-OCR is an OCR model for long-view document parsing. README clearly writes that it wants to take the Deepseek-OCR a step further and welcome the one-shot long-horizon parsing era.
Traditional OCR is often processed by page, block, and line, and then spliced through rules or post-processing structures. The selling point of Unlimited-OCR is to use visual language model to process longer document context at one time, especially suitable for semantic analysis of multi-page PDF, scanned documents and complex layout materials.
2. What does it mostly do?
| Competence | Official Material Basis | Pre-Sales Understanding |
|---|---|---|
| Single image document resolution | 'model.infer(... image_file =...)' | Suitable for single-page scans, screenshots, bills, and forms |
| Multi-page parsing | 'model.infer_multi(... image_files =[...])' | Suitable for multi-page documents such as contracts, reports, manuals, tenders, etc. |
| PDF Analysis | README uses PyMuPDF to transfer pictures and then goes through multi-page parsing | Accessible to document management, knowledge base and file digitization process |
| Long context output | 'max_length = 32768' | Suitable for preserving long document content and structure |
| Multi-inference backend | Transformers, vLLM, SGLang | The deployment path from PoC to production is more flexible |
| OpenAI-compatible API | The SGLang example provides '/v1/chat/completions' | It is convenient to connect to the existing enterprise model gateway/application layer |
3. Applicable Scenario
Document digitization and file storage
Suitable for government, finance, manufacturing, education, medical and other industries of historical archives, scanned PDF, contract, manual storage. Pre-sales can be said: first convert unstructured scans into retrievable text, and then enter the RAG, full-text search or business review process.
Contract/Report/Tender Analysis
Multi-page PDFs often have headers and footers, chapters, tables, and pictures. The Unlimited-OCR multi-page parsing capability is suitable for the first stage extraction, followed by the LLM for clause identification, risk summary, and chapter indexing.
Knowledge Base/RAG Pre-Processing
Many RAG projects fail not because of the retrieval model, but because of poor parsing quality of the source document. Unlimited-OCR can be used as a front-end capability of "scanned documents-> Markdown/text/structured content" to improve the quality of knowledge base storage.
Bill/Form/Screenshot Analysis
The 'gundam' or 'base' configuration of a single figure can be used for bills, page screenshots, and table pictures. Formal PoC should compare accuracy and cost between traditional OCR, PaddleOCR, DeepSeek-OCR, and document large models.
4. Not quite the scene
| Scenario | Reason |
|---|---|
| Only low-cost batch recognition of simple printed text | Traditional OCR is cheaper and faster |
| Strong real-time mobile OCR | Model inference relies on GPU, high deployment pressure on the end side |
| 100% of structured field extraction is required. | OCR is only the first step. Field schema, checksum and business rules are also required. |
| Very high-security documents but cannot deploy GPU locally | Cloud model calls may not meet compliance |
5. Deploy and use
Transformers
from transformers import AutoModel, AutoTokenizer
model_name = "baidu/Unlimited-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
torch_dtype="bfloat16",
).eval().cuda()
A single graph supports two configurations:
-'gundam': 'base_size = 1024, image_size = 640, crop_mode = True'
-'base': 'base_size = 1024, image_size = 1024, crop_mode = False'
Multi-page/PDF only use' base',README example uses PyMuPDF to convert PDF into 300 DPI PNG and then call' infer_multi '.
vLLM
The project has announced support for vLLM inference in 2026-06-28 and provides mirroring:
docker pull vllm/vllm-openai:unlimited-ocr
docker pull vllm/vllm-openai:unlimited-ocr-cu129
SGLang
The SGLang route can start the OpenAI-compatible services:
python -m sglang.launch_server \
--model baidu/Unlimited-OCR \
--served-model-name Unlimited-OCR \
--context-length 32768 \
--host 0.0.0.0 \
--port 10000
Suitable for connecting OCR services to existing application platforms.
6. What can be said before sales
Unlimited-OCR is not an ordinary OCR for single-line text recognition, but a visual language model for long document parsing. It is suitable for solving the problems of "many pages, complex layout, long context and subsequent knowledge base" when enterprises scan PDF, report, contract and file storage. PoC should not only look at a few screenshots, but use customer real PDF samples to evaluate accuracy, structural fidelity, throughput and GPU cost.
7. PoC Recommendations
| Phases | Work | Indicators |
|---|---|---|
| Sample Preparation | Select 50-100 real scans to cover clear/fuzzy/form/long documents | Sample representativeness |
| baseline comparison | contrast PaddleOCR, traditional OCR, DeepSeek-OCR, manual annotation | character accuracy, structure fidelity |
| Multi-page Test | Multi-page parsing after PDF to Picture | Page Order, Chapter, Table Loss Rate |
| Cost Evaluation | Transformers/vLLM/SGLang Three-Route Test | Single Page Time-consuming, Display and Storage, Concurrency |
| Business Closed Loop | RAG or Field Extraction | Search Hit Rate, Field Accuracy |
8. Risks and Considerations
-The README dependency version is relatively new, such as Python 3.12, CUDA 12.9, torch 2.10, etc. The customer environment should be checked in advance.
-'trust_remote_code = True' is a sensitive point for enterprise security review and requires source code audit or isolation environment.
-Document OCR results still need to be manually sampled and accepted, and the model output cannot be directly used as the legal/financial final result.
-PDF to image DPI, cropping, page orientation will significantly affect the effect.
-Multi-page output is very long, and subsequent LLM/retrieval links need to handle chunk, page number, and reference positioning.
9. My Pre-Sales Judgment
Unlimited-OCR is very suitable as a "enterprise document intelligence" front-end capability, rather than selling OCR alone. The best program narrative is:
扫描 PDF / 图片 -> Unlimited-OCR 长文档解析 -> 结构化清洗 -> RAG / 审核 / 字段抽取 / 检索
If the customer has a large number of scanned documents, files, contracts and reports, this project is worth PoC. If the customer only wants ordinary invoice/ID card/simple form OCR, traditional OCR or mature commercial OCR may be more economical.
10. REFERENCE
-GitHub:baidu/Unlimited-OCR
-Hugging Face:baidu/Unlimited-OCR
-arXiv:Unlimited OCR Works
-vLLM Recipe:recipes.vllm.ai/baidu/Unlimited-OCR
-Illustration:Unlimited-OCR.png
-Demo animation:long-horizon-ocr.gif
Information verification date: 2026-06-30. GitHub API anonymous access triggers stream limiting, this note does not write real-time stars/forks.