PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing¶
Conference: CVPR 2026 arXiv: 2603.24326 Code: https://github.com/PaddlePaddle/PaddleOCR Area: Multimodal VLM / Document Understanding Keywords: Document parsing, coarse-to-fine visual processing, vision-language model, OCR, visual token compression
TL;DR¶
This paper proposes PaddleOCR-VL, a coarse-to-fine document parsing framework. The coarse stage employs a lightweight VRFM module to identify effective visual regions, while the fine stage applies a compact 0.9B VLM to process only those regions. With minimal visual tokens and parameters, the framework achieves state-of-the-art performance on OmniDocBench v1.5, substantially reducing latency and resource consumption.
Background & Motivation¶
- Background: Document parsing is a critical technology for constructing LLM training corpora and RAG systems. High-resolution inputs are essential for document parsing but cause quadratic growth in the number of visual tokens.
- Limitations of Prior Work: Pipeline-based methods (detection + recognition + structure reconstruction) are prone to error propagation; general-purpose VLMs hallucinate on handwritten or highly structured documents; specialized VLMs either have large parameter counts or suffer from coordinate drift.
- Key Challenge: High resolution is necessary for fine-grained recognition → visual token count explodes → high computational cost. Yet effective information in documents is highly non-uniform in distribution—PPT slides contain only ~39% effective regions, and newspapers approximately ~60%.
- Goal: Eliminate visual redundancy while preserving high-resolution accuracy, thereby improving efficiency.
- Key Insight: The sparsity of effective visual regions—large portions of background and decorative elements carry no useful information.
- Core Idea: A coarse stage rapidly identifies effective regions (localization + contextual relation prediction), and a fine stage processes only those regions.
Method¶
Overall Architecture¶
The framework consists of two stages: (1) Coarse stage—VRFM (Valid Region Focusing Module) rapidly localizes document elements (text, formulas, tables, etc.); (2) Fine stage—PaddleOCR-VL-0.9B performs detailed recognition on the extracted effective regions. The decoupled design allows independent optimization of each stage.
Key Designs¶
-
VRFM (Valid Region Focusing Module):
- Function: Rapidly locates semantically effective regions within a document.
- Mechanism: A lightweight detector that simultaneously predicts region locations and inter-region contextual relationships (reading order). Compared to general-purpose object detectors, VRFM is optimized for document elements, achieving higher speed and accuracy.
- Design Motivation: Standard VLMs that directly process full-page images waste substantial computation on background regions. Locating regions prior to recognition significantly reduces the input volume for the fine stage.
-
PaddleOCR-VL-0.9B:
- Function: Performs detailed text/formula/table recognition on effective visual regions.
- Mechanism: A compact VLM with only 0.9B parameters, guided by VRFM outputs to process cropped effective regions rather than full-page images. The deficit in model scale is compensated through a high-quality data pipeline (30M+ samples).
- Design Motivation: Small model + precise input > large model + coarse-grained input. The 0.9B scale enables edge deployment.
-
Large-Scale High-Quality Data Pipeline:
- Function: Provides sufficient training signal for the small model.
- Mechanism: Over 30 million broadly distributed samples are collected from public sources and synthetic data, constituting one of the key factors behind model performance.
- Design Motivation: Smaller models have lower data efficiency and require more high-quality data to compensate.
Loss & Training¶
VRFM: standard detection loss + reading order prediction loss. VLM: standard language modeling loss.
Key Experimental Results¶
Main Results¶
| Metric | PaddleOCR-VL | GOT-OCR | Qwen2.5-VL-7B | InternVL3 |
|---|---|---|---|---|
| Text Score | SOTA | 2nd | 3rd | 4th |
| Formula Score | SOTA | 2nd | 3rd | 4th |
| Table Score | SOTA | 2nd | 3rd | 4th |
| Reading Order | SOTA | 2nd | 3rd | 4th |
| Visual Tokens | Fewest | More | Many | Most |
| Parameters | 0.9B | Larger | 7B | Larger |
Ablation Study¶
| Configuration | Overall Score | Notes |
|---|---|---|
| Full PaddleOCR-VL | SOTA | VRFM + 0.9B VLM |
| w/o VRFM (direct full-page) | Drops + slower | Token count explodes |
| w/o data pipeline | Significant drop | Data is critical for small models |
Key Findings¶
- VRFM reduces the number of effective tokens by 40–60% while simultaneously improving performance.
- A 0.9B model focused on effective regions can outperform a 7B general-purpose model.
- The data pipeline is one of the key factors enabling the 0.9B model to achieve SOTA performance.
Highlights & Insights¶
- The design philosophy of "small model + precise input > large model + coarse-grained input" has important implications for resource-constrained deployment.
- The coarse-stage cost of VRFM is far lower than that of the VLM, yet it yields substantial savings in fine-stage computation—yielding an ROI of approximately 10–100×.
- The decoupled design allows VRFM and the VLM to be independently optimized and upgraded.
Limitations & Future Work¶
- Detection errors in VRFM propagate to the fine stage (missed regions cannot be recognized).
- Robustness to extremely complex layouts (e.g., nested tables + formulas + images) requires further improvement.
- Reading order prediction may fail under multi-column layouts.
Related Work & Insights¶
- vs. GOT-OCR: GOT-OCR processes entire pages end-to-end with high computational cost; PaddleOCR-VL reduces redundancy through the coarse-to-fine strategy.
- vs. Qwen2.5-VL: General-purpose VLMs have large parameter counts (7B+) and are over-engineered for document scenarios; PaddleOCR-VL achieves superior results with only 0.9B parameters.
- vs. Traditional Pipelines: Traditional pipelines lack global semantic understanding; the VLM fine stage in PaddleOCR-VL provides semantic comprehension capability.
Rating¶
- Novelty: ⭐⭐⭐⭐ The coarse-to-fine document parsing framework and VRFM design demonstrate practical innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation on OmniDocBench across four dimensions, achieving SOTA on all.
- Writing Quality: ⭐⭐⭐⭐ Efficiency analysis is intuitive and comparisons are clearly presented.
- Value: ⭐⭐⭐⭐⭐ The open-source framework offers significant value to the document AI community.