Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://huggingface.co/ApsaraStackMaaS (Model Weights)
Area: Multimodal VLM / Document Retrieval
Keywords: Visual document retrieval, curriculum learning, hard negative mining, LLM meta-controller, late interaction
TL;DR¶
Evo-Retriever couples the "model" and "training curriculum" into a synergistic evolutionary pair—stabilizing representations through multi-viewpoint alignment and bidirectional contrastive learning, while an external LLM meta-controller dynamically adjusts the difficulty of hard negatives based on real-time training states. It achieves new SOTA results on ViDoRe V2 and MMEB(VisDoc) with nDCG@5 scores of 65.2% and 77.1%, respectively.
Background & Motivation¶
Background: Complex Visual Document Retrieval (CVDR) entails precisely locating pages related to multimodal queries from large-scale corpora. Mainstream approaches have shifted from "parse-then-retrieve" (OCR + chunking) to directly feeding page screenshots into VLMs to encode them into dense vectors. Among these, multi-vector late-interaction models like ColPali index pages at the token level and calculate pair-wise similarity with query tokens at runtime for fine-grained alignment, representing the current SOTA.
Limitations of Prior Work: The authors identify three specific flaws in current SOTA models on real-world complex documents: (a) Insufficient spatial awareness: Relying on a single fixed viewpoint makes it difficult to integrate spatially dispersed information (e.g., aggregating "Plastic 12%" and "Metal 8%" from different areas); (b) Inherent vulnerability to textual confusion: Existing contrastive learning only mines hard negatives that are "visually similar but semantically different," while neglecting "textually similar but visually mismatched" negatives (e.g., misidentifying a text block as a match for a query requesting a chart); (c) Stagnation due to static curricula: Even with data synthesis (GME), the curriculum for selecting hard negatives is pre-defined. Models quickly learn the initial hard negatives, after which these samples provide no challenge, leading to gradient signal decay and degraded discriminative power.
Key Challenge: Model capabilities evolve dynamically during training, whereas the training curriculum (hard negative difficulty) remains static—a fixed threshold provides effective gradients early on but yields only trivial negatives with near-zero gradients later (the core motivation in Fig.2).
Core Idea: Enable model–curriculum co-evolution. First, establish robust base representations through Viewpoint-Pathway Collaboration (multi-view + bidirectional paths). Then, employ an LLM meta-controller to adaptively adjust difficulty ranges for hard negative mining based on training state summaries, ensuring the supervision signal remains challenging throughout the training process.
Method¶
Overall Architecture¶
Evo-Retriever is built upon a Qwen2.5-VL backbone with multi-vector late interaction, integrating three collaborative components: MVA (Multi-Viewpoint Alignment) for spatial awareness, BCL (Bidirectional Contrastive Learning) to mitigate textual confusion—forming the "Viewpoint-Pathway Collaboration"—and LLM-EC (LLM-guided Evolutionary Curriculum) as a meta-controller to dynamically adjust hard negative mining. The pipeline's key lies in MVA/BCL producing robust representations while generating a candidate pool of hard negatives (images and queries), which the LLM-EC uses to determine difficulty levels at each step, alternating between estimator (difficulty estimation) and learner (sample learning).
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Page Screenshot + Query<br/>Qwen2.5-VL Late Interaction Backbone"] --> B["Multi-Viewpoint Alignment MVA<br/>Original+Downsampled+Rotated Stitching<br/>Consistency Alignment"]
B --> C["Bidirectional Contrastive Learning BCL<br/>Q→D Hard Negative Images + D→Q Synthetic Hard Queries"]
C --> D["Offline Candidate Pool Generation<br/>warm-up→Global Mining top-N"]
D --> E["LLM-EC Online Evolutionary Curriculum<br/>State Summary→LLM Decision→Update Difficulty Range"]
E -->|2 Hard Neg Images + 2 Hard Neg Queries per step| C
E --> F["Late Interaction Retrieval<br/>nDCG@5 Output"]
Key Designs¶
1. Multi-Viewpoint Alignment (MVA): Forcing Geometric Invariance with Single-Token Budget
To address "insufficient spatial awareness," the authors construct a multi-view jigsaw \(I_{aug}\) for each image \(I\): the original image, a downsampled version, and a rotated version (sampled from \([-180°, 180°]\)) are horizontally stitched. Leveraging Qwen-VL’s smart resizing, the jigsaw generates different patch layouts within the same token budget, allowing the model to see multi-scale and multi-orientation views without increasing inference costs. The original image \(I\) and the jigsaw \(I_{aug}\) share the same mapping path and are aligned with the matching query \(Q\). Two alignment losses, \(\mathcal{L}_{Q\to I}\) and \(\mathcal{L}_{Q\to I_{aug}}\), enforce "representation consistency across scales/orientations under the same query." Preserving the original view is critical: ablations show performance drops when using only downsampling (-1.42%) or only stitching without the original image (-2.58%), indicating the necessity of both high-fidelity global context and geometric perturbations. MVA is used only during training and incurs zero additional inference overhead.
2. Bidirectional Contrastive Learning (BCL): Adding the D→Q Path to Mine "Textually Similar, Visually Mismatched" Negatives
To address "textual confusion," where mainstream retrievers only use query→document (Q→D) contrast, BCL adds constraints in both directions. The core is an automated Hard Negative Query Synthesis (HNQS) pipeline: given a positive pair \((I_{pos}, Q_{pos})\), a VLM (Qwen2.5-VL-72B) synthesizes 20 candidate queries that are syntactically/contextually similar to \(Q_{pos}\) but semantically inconsistent with image \(I_{pos}\). These "textually similar but visually mismatched" hard queries force the model to anchor semantics to visual evidence rather than surface-level text matching. Selection is dynamically determined by the LLM-EC.
The overall objective unifies these directions into a softplus-based margin loss (rather than softmax-normalized InfoNCE), allowing each hard negative to contribute gradients independently:
The forward term includes the dual views of MVA: \(\mathcal{L}_{forward} = \mathcal{L}(Q_{pos}, I_{ori}, \{I_{neg}\}) + \beta \cdot \mathcal{L}(Q_{pos}, I^{aug}_{ori}, \{I^{aug}_{neg}\})\). The general margin loss sums over \(K\) hard negatives: \(\mathcal{L}(Q, I_{pos}, \{I_{neg}\}) = \sum_{k=1}^{K} \log\big(1 + \exp(\frac{\text{sim}(Q, I_{neg}^{(k)}) - \text{sim}(Q, I_{pos})}{\tau})\big)\), where similarity follows ColBERT late interaction: \(\text{sim}(Q, I) = \sum_{l=1}^{L_Q} \max_{j=1}^{L_I}\big(E_Q(Q)_l \cdot E_I(I)_j^T\big)\) (L2-normalized token embeddings, dot product as cosine). The backward term symmetrically constrains D→Q using synthetic hard queries \(\{Q_{neg}\}\).
3. LLM-Guided Curriculum Evolution (LLM-EC): Closed-Loop Scheduling of Negative Sample Difficulty
This design directly addresses "static curriculum stagnation." It operates in two phases. Offline Candidate Pool Generation: A warm-up round using in-batch negatives establishes an initial representation space. The model then switches from learner to estimator for a one-time global offline mining—retrieving and storing the top-N most similar negative documents for each query \(q\) from the entire corpus \(\mathcal{D}\) to form a candidate pool \(C_q\).
Online LLM-Guided Curriculum Evolution: The curriculum is formalized into \(M\) discrete difficulty ranges, each defined by \([\tau_{low}, \tau_{high}]\) on a positive-aware difficulty metric. The action of "selecting a range" is assigned to an external LLM in a three-step cycle: (i) State Summary: Aggregating key metrics (e.g., average hard negative loss, loss trends) into structured reports; (ii) LLM Deliberation: The LLM selects the next action based on a "three-stage decision protocol"; (iii) Curriculum Update: Implementing the new range for the next training phase. Unlike fixed threshold schedulers, this controller makes decisions conditioned on training dynamics, allowing for non-monotonic adjustments (e.g., rolling back to simpler ranges if training becomes unstable).
The Three-Stage Decision Protocol mimics human curriculum design: ① Exploration: Systematically mapping the "difficulty-performance landscape" by monitoring loss across various ranges; ② Transition: Identifying "effective learning" actions where loss falls within an ideal range (\(0.3 \le \text{loss} \le 1.2\)) and selecting the hardest as the main anchor; ③ Lock-in: Fine-tuning during the main training phase by evaluating "learning speed"—increasing difficulty when mastered, decreasing when struggling, or maintaining otherwise.
Loss & Training¶
Backbone: Qwen2.5-VL-3B / 7B-Instruct, projected to 128 dimensions, max 1024 visual tokens per image. \(\alpha = \beta = 1\). Training consists of two stages: 1 epoch of warmup using 480K pairs with in-batch negatives + InfoNCE to build the top-N=200 candidate pool; followed by 1 epoch with the LLM-EC dynamic curriculum. Training uses LoRA (rank 32), paged_adamw_8bit, learning rate \(2\times10^{-5}\), and a global batch size of 32. Each step uses 2 synthetic hard neg queries + 2 hard neg images. The meta-controller is Qwen3-235B-A22B.
Key Experimental Results¶
Main Results¶
Benchmarks: ViDoRe V2 (Zero-shot, nDCG@5) and MMEB VisDoc (nDCG@5).
| Benchmark | Metric | Evo-Retriever-7B | Prev. SOTA | Gain |
|---|---|---|---|---|
| ViDoRe V2 (Avg) | nDCG@5 | 65.2 | 63.5 (llama-nemoretriever-3b) | +1.7 |
| ViDoRe V2 · Economics Macro | nDCG@5 | 59.1 | 55.9 | +3.2 |
| MMEB (VisDoc) Avg | nDCG@5 | 77.12 | 75.18 (gme-Qwen2-VL-7B) | +1.94 |
| MMEB · VisRAG | nDCG@5 | 89.28 | 84.99 | +4.29 |
Notably, Evo-Retriever-3B (63.3%) outperforms the architecturally similar colqwen2.5-v0.2 (59.3%) by 4.0% on ViDoRe V2, demonstrating gains from training strategy rather than just scale.
Ablation Study (ViDoRe V2, 3B Model, Baseline Net0 = InfoNCE in-batch = 61.17%)¶
| Configuration | nDCG@5 | Note |
|---|---|---|
| Baseline (Net0) | 61.17 | In-batch negative only |
| + MVA (Net1) | 62.25 | Multi-view alignment, +1.08 |
| Downsample-only | 59.75 | Simplified augmentation, -1.42 |
| Stitched-only | 58.59 | Removing original image, -2.58 |
| + BCL (Net2) | 61.84 | Bidirectional contrast, +0.67 |
| Net0+MVA+BCL | 62.39 | Base representation, +1.22 |
| Fixed Window 80-98% | 62.10 | Strong static curriculum, +0.93 |
| Rule-based Oracle | 62.81 | Dynamic curriculum with fixed loss thresholds |
| LLM-EC (Ours) | 63.05 | Outperforms Oracle by +0.24 |
| Full Model | 63.30 | All components |
Key Findings¶
- Dual Views are Essential: MVA requires both the original image (fidelity) and the augmented jigsaw (invariance); removing either performs worse than the baseline.
- Exploration Phase is Vital: Disabling it leads to a 1.19% drop, showing that adaptively determining the starting difficulty is more effective than pre-defined starts.
- Difficulty Granularity Matters: Neither too coarse (10 ranges) nor too fine (21 ranges) performs as well as 16 ranges, which balances convergence within each interval.
- Controller Scale is Not the Bottleneck: Qwen3-32B achieved 63.30%, slightly higher than the 235B model, suggesting that with a proper protocol, curriculum control relies on flexible logical interpretation rather than extreme scale.
Highlights & Insights¶
- Model-Curriculum Co-Evolution is a Clean Abstraction: Explicitly modeling hard negative difficulty as an Action space for an external agent turns the "curriculum" from a dead parameter into a closed-loop controllable object.
- Efficient Token Jigsawing: Using VLM smart resizing to pack multi-scale views into the same token budget achieves geometric invariance at almost zero marginal cost.
- LLM as "Coach" rather than Generator: The LLM does not generate data; it reads structured summaries and makes difficulty decisions. This "meta-level scheduling" is a lightweight way to use LLMs in training.
- Margin Loss vs. InfoNCE: Margin loss allows hard negative signals to remain distinct, avoiding the dilution of gradients that often occurs with softmax normalization in InfoNCE.
Limitations & Future Work¶
- Reliance on External Models: Generating hard queries and the candidate pool requires powerful VLMs and global mining, making the pipeline relatively heavy.
- Static Candidate Pool: While selection is dynamic, the pool itself is indexed after warm-up. If representations evolve significantly, the pool may no longer contain the "hardest" negatives.
- Controller Scale Ambiguity: The 32B model outperforming the 235B model lacks deep variance analysis; it is unclear if this is a trend or noise.
- Single-Domain Validation: The co-evolution paradigm has yet to be tested on general image-text retrieval or basic cross-modal alignment.
Related Work & Insights¶
- vs. ColPali / Late Interaction: This work adopts the late-interaction architecture but adds multi-view consistency to stabilize alignment under layout changes. It addresses spatial invariance, textual de-confusin, and curriculum evolution on top of fine-grained matching.
- vs. GME / NV-Retriever: While prior work uses state-agnostic thresholds or linear interpolation, this work uses an LLM meta-controller to ensure supervision remains challenging based on training dynamics.
- vs. DocReRank: The HNQS synthesis prompt design is inspired by DocReRank but integrated into a bidirectional contrastive framework and dynamic curriculum rather than just a reranking stage.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ "Model-curriculum co-evolution + LLM meta-controller" is a distinct and self-consistent paradigm.
- Experimental Thoroughness: ⭐⭐⭐⭐ Solid results on primary benchmarks with three-layer ablations, though candidate pool updates and controller variance could be explored further.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear mapping between pain points and components; intuition is well-communicated.
- Value: ⭐⭐⭐⭐ SOTA performance with a transferable paradigm, though reproduction costs are high.