DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding¶

Conference: CVPR 2026 arXiv: 2604.12812 Code: https://github.com/yh-hust/DocSeeker Area: Multimodal VLM / Document Understanding Keywords: Long document understanding, evidence grounding, structured reasoning, reinforcement learning, visual RAG

TL;DR¶

DocSeeker is proposed to achieve structured reasoning and evidence grounding in long document understanding via an ALR (Analyze–Locate–Reason) visual reasoning paradigm combined with two-stage training (SFT + EviGRPO). The model is trained exclusively on short documents yet generalizes robustly to documents of extreme length.

Background & Motivation¶

Background: MLLMs suffer severe performance degradation on long-document VQA as document length increases. Pure-vision approaches treat each page as an image input to avoid OCR error propagation.

Limitations of Prior Work: (1) Low signal-to-noise ratio: key evidence is buried among a large number of irrelevant pages; (2) Supervision scarcity: datasets provide only short final answers without intermediate reasoning steps. Visual RAG faces a Top-\(k\) dilemma—large \(k\) introduces noise while small \(k\) misses evidence.

Key Challenge: Models learn fragile shortcuts (memorization) rather than genuine reasoning ability, resulting in poor interpretability and weak OOD generalization.

Goal: Train models to adopt a structured "locate first, then reason" workflow rather than directly predicting answers.

Key Insight: Inspired by human cognitive processes—first analyze intent, then locate evidence, and finally reason.

Core Idea: The ALR paradigm requires the model to explicitly produce a structured chain of thought following the "Analyze → Locate → Reason" sequence, combined with two-stage training via SFT and evidence-aware GRPO.

Method¶

Overall Architecture¶

Built upon Qwen-2.5-VL-7B, with page IDs prefixed to each page as pointers. The output is constrained to follow the ALR structure: \(\mathbf{Y} = (\mathbf{Y}_A \oplus \mathbf{Y}_L \oplus \mathbf{Y}_R) \oplus (\mathbf{Y}_E \oplus \mathbf{Y}_F)\), comprising question analysis, evidence localization (with cited page numbers), reasoning process, evidence page list, and final answer.

Key Designs¶

ALR Visual Reasoning Paradigm:
- Function: A structured "locate first, then reason" workflow.
- Mechanism: Page-aware input (interleaved page IDs and visual tokens) combined with three-stage structured output. The model must first analyze user intent, then scan the document to locate relevant pages with justification, and finally synthesize reasoning from the grounded evidence.
- Design Motivation: Enforced evidence localization compels the model to distinguish visual tokens across different pages, counteracting interference caused by long visual inputs.
Evidence-Aware GRPO (EviGRPO):
- Function: Jointly optimizes evidence localization and reasoning via reinforcement learning.
- Mechanism: A multi-dimensional reward function \(R = \lambda_1 R_{format} + \lambda_2 R_{evidence} + \lambda_3 R_{answer}\). The format reward enforces the ALR template; the evidence reward evaluates page localization accuracy using a weighted F1 score (\(\beta > 1\), biased toward recall); the answer reward evaluates the final answer using ANLS.
- Design Motivation: Reasoning paths produced by SFT tend to be suboptimal; RL enables the model to learn directly from outcome signals, surpassing imitation learning.
Evidence-Guided Resolution Allocation (EGRA):
- Function: Supports longer document inputs during training.
- Mechanism: Evidence pages are kept at high resolution; 70% of non-evidence pages are downsampled (1024→256) while the remaining 30% retain high resolution. At inference time, all pages are processed at high resolution.
- Design Motivation: This not only alleviates GPU memory constraints but also improves the signal-to-noise ratio of training data—outperforming the strategy of simply discarding non-evidence pages.

Loss & Training¶

Stage I: Standard cross-entropy SFT on 13,986 ALR CoT samples distilled from Gemini-2.5-Flash. Stage II: EviGRPO (rollout group size 16; format/evidence/answer reward weights of 0.1/0.3/0.6). Training is conducted exclusively on documents with \(\leq 20\) pages.

Key Experimental Results¶

Main Results¶

Method	Params	DUDE↑	MPDocVQA↑	MMLong↑	LongDocURL↑
Baseline	7B	35.2	70.1	25.4	37.8
InternVL3	8B	47.4	80.8	24.1	38.7
GPT-4o	—	54.1	67.4	42.8	64.5
DocSeeker	7B	56.8	87.2	48.5	58.3

Ablation Study¶

Configuration	DUDE	MPDocVQA	Notes
Full DocSeeker	56.8	87.2	SFT + EviGRPO + EGRA
SFT only	52.1	84.5	No RL
SFT + GRPO (no evidence reward)	54.3	85.8	Standard GRPO
Without EGRA	50.8	82.1	Uniform resolution

Key Findings¶

Improvements of 30–60% over the baseline demonstrate the effectiveness of the ALR paradigm.
Trained exclusively on documents with \(\leq 20\) pages, the model generalizes robustly to documents of up to 468 pages.
DocSeeker's localization capability naturally complements visual RAG and can serve as a backbone model for RAG systems.

Highlights & Insights¶

"Train short, generalize long" is a surprising result: the ALR paradigm learns transferable reasoning skills rather than memorized patterns.
The EGRA strategy is simple yet effective: differentiated resolution allocation simultaneously reduces memory consumption and improves signal-to-noise ratio, outperforming the strategy of discarding non-evidence pages.
The evidence-aware reward design makes the RL stage more targeted and purposeful.

Limitations & Future Work¶

Training data is drawn solely from MP-DocVQA and DUDE, limiting domain coverage.
Reliance on Gemini-2.5-Flash for distillation constrains data quality to the teacher model's capability.
The pure-vision approach still has limitations on dense-text pages.
The framework is extensible to multi-document cross-document reasoning.

vs. VisRAG/SV-RAG: These are retrieval-augmented methods; the ALR paradigm endows DocSeeker as an end-to-end approach with explicit localization capability.
vs. mPLUG-DocOwl2: DocOwl2 employs visual token compression, whereas DocSeeker applies differentiated resolution via EGRA.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Both the ALR paradigm and EviGRPO represent significant innovations.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive in-domain and out-of-domain evaluation with detailed ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Method and experiments are articulated clearly.
Value: ⭐⭐⭐⭐⭐ Substantial advancement for long document understanding.