DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding¶
Conference: CVPR 2026 arXiv: 2604.12812 Code: https://github.com/yh-hust/DocSeeker Area: Multimodal VLM / Document Understanding Keywords: Long document understanding, evidence grounding, structured reasoning, reinforcement learning, visual RAG
TL;DR¶
DocSeeker is proposed to achieve structured reasoning and evidence grounding in long document understanding via an ALR (Analyze–Locate–Reason) visual reasoning paradigm combined with two-stage training (SFT + EviGRPO). The model is trained exclusively on short documents yet generalizes robustly to documents of extreme length.
Background & Motivation¶
Background: MLLMs suffer severe performance degradation on long-document VQA as document length increases. Pure-vision approaches treat each page as an image input to avoid OCR error propagation.
Limitations of Prior Work: (1) Low signal-to-noise ratio: key evidence is buried among a large number of irrelevant pages; (2) Supervision scarcity: datasets provide only short final answers without intermediate reasoning steps. Visual RAG faces a Top-\(k\) dilemma—large \(k\) introduces noise while small \(k\) misses evidence.
Key Challenge: Models learn fragile shortcuts (memorization) rather than genuine reasoning ability, resulting in poor interpretability and weak OOD generalization.
Goal: Train models to adopt a structured "locate first, then reason" workflow rather than directly predicting answers.
Key Insight: Inspired by human cognitive processes—first analyze intent, then locate evidence, and finally reason.
Core Idea: The ALR paradigm requires the model to explicitly produce a structured chain of thought following the "Analyze → Locate → Reason" sequence, combined with two-stage training via SFT and evidence-aware GRPO.
Method¶
Overall Architecture¶
Built upon Qwen-2.5-VL-7B, with page IDs prefixed to each page as pointers. The output is constrained to follow the ALR structure: \(\mathbf{Y} = (\mathbf{Y}_A \oplus \mathbf{Y}_L \oplus \mathbf{Y}_R) \oplus (\mathbf{Y}_E \oplus \mathbf{Y}_F)\), comprising question analysis, evidence localization (with cited page numbers), reasoning process, evidence page list, and final answer.
Key Designs¶
-
ALR Visual Reasoning Paradigm:
- Function: A structured "locate first, then reason" workflow.
- Mechanism: Page-aware input (interleaved page IDs and visual tokens) combined with three-stage structured output. The model must first analyze user intent, then scan the document to locate relevant pages with justification, and finally synthesize reasoning from the grounded evidence.
- Design Motivation: Enforced evidence localization compels the model to distinguish visual tokens across different pages, counteracting interference caused by long visual inputs.
-
Evidence-Aware GRPO (EviGRPO):
- Function: Jointly optimizes evidence localization and reasoning via reinforcement learning.
- Mechanism: A multi-dimensional reward function \(R = \lambda_1 R_{format} + \lambda_2 R_{evidence} + \lambda_3 R_{answer}\). The format reward enforces the ALR template; the evidence reward evaluates page localization accuracy using a weighted F1 score (\(\beta > 1\), biased toward recall); the answer reward evaluates the final answer using ANLS.
- Design Motivation: Reasoning paths produced by SFT tend to be suboptimal; RL enables the model to learn directly from outcome signals, surpassing imitation learning.
-
Evidence-Guided Resolution Allocation (EGRA):
- Function: Supports longer document inputs during training.
- Mechanism: Evidence pages are kept at high resolution; 70% of non-evidence pages are downsampled (1024→256) while the remaining 30% retain high resolution. At inference time, all pages are processed at high resolution.
- Design Motivation: This not only alleviates GPU memory constraints but also improves the signal-to-noise ratio of training data—outperforming the strategy of simply discarding non-evidence pages.
Loss & Training¶
Stage I: Standard cross-entropy SFT on 13,986 ALR CoT samples distilled from Gemini-2.5-Flash. Stage II: EviGRPO (rollout group size 16; format/evidence/answer reward weights of 0.1/0.3/0.6). Training is conducted exclusively on documents with \(\leq 20\) pages.
Key Experimental Results¶
Main Results¶
| Method | Params | DUDE↑ | MPDocVQA↑ | MMLong↑ | LongDocURL↑ |
|---|---|---|---|---|---|
| Baseline | 7B | 35.2 | 70.1 | 25.4 | 37.8 |
| InternVL3 | 8B | 47.4 | 80.8 | 24.1 | 38.7 |
| GPT-4o | — | 54.1 | 67.4 | 42.8 | 64.5 |
| DocSeeker | 7B | 56.8 | 87.2 | 48.5 | 58.3 |
Ablation Study¶
| Configuration | DUDE | MPDocVQA | Notes |
|---|---|---|---|
| Full DocSeeker | 56.8 | 87.2 | SFT + EviGRPO + EGRA |
| SFT only | 52.1 | 84.5 | No RL |
| SFT + GRPO (no evidence reward) | 54.3 | 85.8 | Standard GRPO |
| Without EGRA | 50.8 | 82.1 | Uniform resolution |
Key Findings¶
- Improvements of 30–60% over the baseline demonstrate the effectiveness of the ALR paradigm.
- Trained exclusively on documents with \(\leq 20\) pages, the model generalizes robustly to documents of up to 468 pages.
- DocSeeker's localization capability naturally complements visual RAG and can serve as a backbone model for RAG systems.
Highlights & Insights¶
- "Train short, generalize long" is a surprising result: the ALR paradigm learns transferable reasoning skills rather than memorized patterns.
- The EGRA strategy is simple yet effective: differentiated resolution allocation simultaneously reduces memory consumption and improves signal-to-noise ratio, outperforming the strategy of discarding non-evidence pages.
- The evidence-aware reward design makes the RL stage more targeted and purposeful.
Limitations & Future Work¶
- Training data is drawn solely from MP-DocVQA and DUDE, limiting domain coverage.
- Reliance on Gemini-2.5-Flash for distillation constrains data quality to the teacher model's capability.
- The pure-vision approach still has limitations on dense-text pages.
- The framework is extensible to multi-document cross-document reasoning.
Related Work & Insights¶
- vs. VisRAG/SV-RAG: These are retrieval-augmented methods; the ALR paradigm endows DocSeeker as an end-to-end approach with explicit localization capability.
- vs. mPLUG-DocOwl2: DocOwl2 employs visual token compression, whereas DocSeeker applies differentiated resolution via EGRA.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Both the ALR paradigm and EviGRPO represent significant innovations.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive in-domain and out-of-domain evaluation with detailed ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ Method and experiments are articulated clearly.
- Value: ⭐⭐⭐⭐⭐ Substantial advancement for long document understanding.