Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models¶
Conference: NeurIPS 2025 arXiv: 2506.20168 Code: https://huggingface.co/datasets/bytedance-research/KIE-HVQA (dataset publicly available) Area: Multimodal VLM Keywords: OCR hallucination, document understanding, reinforcement learning, GRPO, visual degradation
TL;DR¶
This paper addresses OCR hallucinations in MLLMs under degraded document conditions. It introduces KIE-HVQA, the first benchmark for evaluating hallucinations in degraded document scenarios, and proposes a multi-objective reward reinforcement learning framework based on GRPO. The resulting 7B-parameter model achieves approximately 28% higher hallucination-suppression accuracy than GPT-4o.
Background & Motivation¶
- Background: Multimodal large language models (MLLMs) have made significant progress in document understanding, handling diverse document types such as identity cards, invoices, and contracts.
- Limitations of Prior Work: Existing models exhibit a fundamental paradigm flaw in real-world scenarios: when confronted with visual degradation (e.g., blur, occlusion, low contrast), models fail to strictly follow visual signals, instead over-relying on language priors or generating cross-modal hallucinated content.
- Key Challenge: The problem stems from three levels of challenge: (1) the pretraining stage lacks key information extraction (KIE) data and clean annotations for degraded scenes; (2) the instruction fine-tuning stage broadly ignores degraded visual scenarios, as researchers typically assume clean OCR inputs; (3) the evaluation stage lacks benchmarks specifically quantifying OCR hallucinations in document understanding. Consequently, models default to language priors rather than observable visual evidence when processing reflectively occluded identity cards or low-contrast reports.
- Goal: This work frames OCR hallucinations as a precisely rewardable problem, leveraging the quantifiable nature of KIE task answers to train models toward visually faithful reasoning via reinforcement learning.
- Key Insight: Model OCR hallucinations as a precisely rewardable fundamental problem, and use reinforcement learning to teach models visually faithful reasoning by exploiting the quantifiable nature of KIE task answers.
Method¶
Overall Architecture¶
The framework consists of three stages: data collection and construction, cold-start supervised fine-tuning (SFT), and rule-based reinforcement learning (GRPO). Cross-modal reasoning data containing visual image descriptions is first collected; the model's reasoning capability is then initialized via SFT; finally, GRPO with a carefully designed degraded OCR reward function enhances the model's generalization ability.
Key Designs¶
-
KIE-HVQA Benchmark: The first benchmark dataset specifically designed to evaluate OCR hallucinations in degraded documents. It contains 2,000 training samples and 400 test instances, covering three document types: identity cards, receipts, and invoices. Data sources include OCRBench (100 queries), WildReceipt (entity answer reconstruction), and synthetic templates generated by GPT-4o (200 privacy-compliant virtual documents). Each sample simulates real degradation conditions such as motion blur and low contrast, with pixel-level annotations and OCR reliability scores. Evaluation metrics include clear character accuracy, degraded character accuracy, and global OCR performance.
-
Cold-start Initialization: This addresses the inability of language-only reasoning models to directly process multimodal data. GPT-4o first converts image–question–answer triples into plain-text pseudo chain-of-thought (CoT) data (comprising image descriptions and reasoning traces), which are then merged with detailed image descriptions generated by the MLLM and fed into DeepSeek-R1 to produce high-quality CoT data. The textual CoT is finally paired with corresponding images to construct a multimodal CoT dataset for cold-start initialization. This approach ensures that the reasoning process closely mirrors human cognitive behavior.
-
Multi-objective Reward Function for Degraded OCR: This is the paper's most central design. Characters are categorized into three classes according to their degree of degradation: (a) fully clear characters—must be accurately recognized and retained; (b) partially occluded but human-readable characters—marked as "anomalous" but still included in the output; (c) completely unrecognizable characters—must not appear in the OCR output, replaced by spaces to prevent hallucinations. For example, in "Beautiful", "B, a, u, f, u, l" are clear, "e" is partially occluded, and "t, i" are completely occluded. The reward function jointly considers three dimensions: clear-character distance, unclear-character distance, and final answer distance, using Levenshtein edit distance as the base metric.
Loss & Training¶
Training follows a two-stage strategy:
- SFT Stage: Qwen-2.5-VL-7B-Instruct is used as the backbone and fine-tuned on cold-start data for 5 epochs, with a learning rate of 1e-6, batch size of 512, using the LLaMA-Factory framework, taking approximately 4 hours.
- GRPO Stage: Built on the SFT model, reinforcement learning is performed on a mixture of TextOCR, WildReceipt, and other OCR datasets. GRPO generates \(G\) candidate responses for each input, computes advantage values via intra-group normalization, and optimizes the policy model to produce higher-reward outputs while constraining deviation from the reference model via KL divergence. Implemented using the Easy-R1 framework.
Key Experimental Results¶
Main Results¶
| Model | Clear (Clr) | Unclear (Nc) | Final OCR | Avg |
|---|---|---|---|---|
| GPT-4o | 22.78 | 36.13 | 31.74 | 30.21 |
| Claude3.7-Sonnet | 19.77 | 33.73 | 26.17 | 26.56 |
| Gemini2.5-pro | 36.94 | 34.64 | 33.53 | 35.03 |
| Qwen2.5-VL-72B | 20.02 | 24.19 | 20.37 | 21.53 |
| InternVL3-78B | 6.09 | 8.59 | 6.43 | 7.04 |
| Ours (SFT+RL) | 55.45 | 61.34 | 57.35 | 58.05 |
Ablation Study¶
| Configuration | Clear (Clr) | Unclear (Nc) | Final OCR | Note |
|---|---|---|---|---|
| Clear reward only | 50.64 | 44.15 | 53.34 | Significant drop on unclear characters |
| Final reward only | 51.06 | 54.06 | 54.24 | Inferior to combined reward |
| All rewards | 55.45 | 61.34 | 57.35 | Best across all dimensions |
| SFT only | 49.65 | 57.25 | 49.72 | Already reasonably strong baseline |
| SFT + RL | 55.45 | 61.34 | 57.35 | RL yields additional gains |
Key Findings¶
- The 7B-parameter model achieves an absolute improvement of approximately 28% over GPT-4o in degraded document hallucination suppression (58.05 vs. 30.21).
- On unclear character recognition, the proposed model (61.34%) substantially outperforms GPT-4o (36.13%), validating the effectiveness of the uncertainty-aware mechanism.
- General OCR capability is not compromised: on the Scene (180), Doc (179), and Info (183) subsets of OCRBench, performance is comparable to GPT-4o (180/167/163) and the original Qwen2.5-VL-7B (181/181/182).
- The multi-objective reward combination is critical for handling real-world document degradation patterns; single-reward variants are clearly insufficient across all evaluation dimensions.
Highlights & Insights¶
- This work is the first to combine the quantifiable nature of KIE task answers with reinforcement learning, reformulating OCR hallucinations as a precisely optimizable problem.
- The three-level character degradation taxonomy (clear / partially occluded / completely occluded) is elegantly designed, enabling the reward function to guide model behavior with precision.
- A refusal-to-answer mechanism increases task difficulty, teaching the model to actively abstain when uncertain rather than fabricate answers.
- The cold-start stage leverages a GPT-4o + DeepSeek-R1 collaborative pipeline to address the challenge of multimodal CoT data generation.
Limitations & Future Work¶
- The benchmark dataset is relatively small in scale (2,000 training + 400 test instances); degradation types and document categories can be further diversified.
- Validation is currently limited to Qwen2.5-VL-7B; the effectiveness of the approach on larger-scale models remains to be explored.
- Evaluation metrics based on edit distance may not fully capture semantic-level comprehension quality.
- Degradation simulation relies primarily on synthetic methods, which may differ from the distribution of real-world degradation.
Related Work & Insights¶
- The GRPO algorithm from DeepSeek-R1 provides the foundation for the reinforcement learning framework; however, this paper designs a specialized multi-objective reward mechanism tailored to OCR tasks.
- Existing OCR benchmarks (OCRBench, DocLocal4K) primarily focus on line-level recognition and document understanding, overlooking hallucinations under degraded conditions.
- Reasoning-enhancement works (VisRL, Visual-RFT) have begun applying reinforcement learning to MLLM visual reasoning, but no prior study specifically targets OCR hallucinations.
- Key insight: for tasks requiring precise answers, constructing fine-grained reward functions that exploit the quantifiable properties of target answers is an effective approach.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Novel problem formulation; first degraded document hallucination benchmark; however, the GRPO framework itself is not original)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive comparisons and complete ablations; dataset scale is somewhat limited)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure; motivation is well articulated)
- Value: ⭐⭐⭐⭐ (Provides an important direction for document understanding reliability)