RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations¶

Conference: CVPR 2026 arXiv: 2602.22013 Code: https://robustvisrag.github.io/ Area: Information Retrieval Keywords: VisRAG, Robustness, Causal Inference, Visual Degradation, Dual-Path Encoding

TL;DR¶

This paper proposes RobustVisRAG, a causality-guided dual-path framework that decouples semantic–degradation entanglement in VisRAG by capturing degradation signals via a non-causal path and learning clean semantics via a causal path. Under real-world degradation conditions, the framework achieves improvements of 7.35%, 6.35%, and 12.40% in retrieval, generation, and end-to-end performance, respectively, while preserving performance on clean data.

Background & Motivation¶

Background: VisRAG encodes document images directly with a VLM for retrieval and generation, avoiding OCR errors, and has become the dominant paradigm for document question answering.

Limitations of Prior Work: - Both TextRAG and VisRAG suffer significant performance degradation under corrupted inputs (blur, noise, low light, shadow, etc.). - Semantic and degradation factors are entangled in the visual encoder of VisRAG: degradation distorts the embedding space, causing retrieval mismatches and generation instability. - Dual failure modes exist: corrupted representations may lead to incorrect document retrieval, and even when retrieval is correct, degradation can mislead generation.

Key Challenge: In the representation space of existing VLM encoders, the semantic factor \(S\) and degradation factor \(D\) are entangled. Since the observed image \(X\) is a collider node of \(S\) and \(D\), conditioning on \(Z\) opens the spurious path \(S \leftrightarrow D\).

Goal: Enable VisRAG to remain robust under degraded inputs without additional inference cost and without compromising performance on clean inputs.

Key Insight: A structural causal model (SCM) is used to analyze how degradation propagates through VisRAG, and causal intervention (do-operator) is applied to block the non-causal path.

Core Idea: Learn a disentangled representation \(Z = [Z_{sem}, Z_{deg}]\) such that the semantic component is invariant to degradation, equivalent to the causal intervention \(P(A|do(D=d_0))\).

Method¶

Overall Architecture¶

A dual-path architecture is introduced into the VLM visual encoder. The Non-Causal Path extracts degradation representations via unidirectional attention, while the Causal Path encodes clean semantics via bidirectional attention. Both paths are jointly optimized through two learning objectives: NCDM and CSA.

Key Designs¶

Non-Causal Path:
- A learnable non-causal token \(z_{nc}^{(0)}\) is introduced.
- Unidirectional attention constraint: the non-causal token can attend to all patch tokens, but patch tokens cannot attend to the non-causal token.
- Degradation cues are aggregated as: \(z_{nc}^{(l+1)} = z_{nc}^{(l)} + \sum_j \alpha_{nc \leftarrow j}^{(l)} v_j^{(l)}\)
- The final degradation representation is \(Z_{deg} = z_{nc}^{(L)}\).
- Design Motivation: Unidirectional attention prevents degradation information from flowing back into semantic tokens, achieving structured isolation.
Causal Path:
- Bidirectional attention operates among patch tokens, excluding the non-causal token.
- Semantic representation: \(Z_{sem} = \text{Agg}(x_1^{(L)}, ..., x_T^{(L)})\)
- This path follows the causal route \(S \to Z_{sem}\) and is designed to be unaffected by degradation.
Non-Causal Distortion Modeling (NCDM):
- A degradation contrastive learning objective: representations of the same degradation type are pulled closer, while those of different types are pushed apart.
- \(\mathcal{L}_{NCDM} = \max(0, \|Z_{deg}^a - Z_{deg}^p\|_2^2 - \|Z_{deg}^a - Z_{deg}^n\|_2^2 + \delta)\)
- Ensures that the non-causal token genuinely learns to encode degradation feature patterns.
Causal Semantic Alignment (CSA):
- Aligns the semantic representations of degraded and clean images to prevent degradation from leaking into the causal path.
- Keeps \(Z_{sem}\) stable under degradation conditions.

Loss & Training¶

NCDM, CSA, and the original retrieval/generation losses are jointly optimized. Both paths produce \(Z_{sem}\) and \(Z_{deg}\) within the same forward pass, incurring no additional inference overhead.

Key Experimental Results¶

Main Results¶

Method	Retrieval (Real-Degrade)	Generation (Real-Degrade)	End-to-End (Real-Degrade)
VisRAG baseline	~70%	~55%	~45%
VisRAG-FT (full finetune)	~73%	~57%	~48%
Two-Stage Restoration	~72%	~56%	~47%
RobustVisRAG	~77%	~61%	~57%
Gain	+7.35%	+6.35%	+12.40%

Distortion-VisRAG Dataset¶

QA Pairs: 367K Document Types: 7 domains (papers, charts, tables, slides, handwritten notes, etc.) Synthetic Degradations: 12 types (blur, noise, compression, etc.) Real Degradations: 5 types (low light, shadow, paper damage, etc.) Multiple Severity Levels: ✓

Key Findings¶

RobustVisRAG does not degrade performance on clean data, indicating that causal disentanglement does not impair normal comprehension.
Perceptual quality improvements from image restoration (Two-Stage) do not necessarily translate into retrieval or generation gains.
Full-parameter fine-tuning (FFT) improves degradation robustness but causes forgetting of pretrained knowledge and fails to separate semantics from degradation.
The end-to-end gain (12.40%) substantially exceeds the individual retrieval and generation improvements, suggesting a compounding effect from improvements at both stages.

Highlights & Insights¶

Elegant Application of Causal Modeling: An SCM is used to analyze the degradation propagation path in VisRAG, theoretically deriving the necessity of representation disentanglement, which is then realized through a concrete network design (non-causal token + unidirectional attention). The connection between theory and implementation is tight.
Distortion-VisRAG Dataset: The first benchmark specifically designed for VisRAG under degradation conditions, covering both synthetic and real degradations, filling a critical evaluation gap.
Zero Additional Inference Overhead: The causal and non-causal paths are computed within the same forward pass; only \(Z_{sem}\) is used at inference time, making the approach highly practical.

Limitations & Future Work¶

The degradation modeling in the non-causal path relies on simple contrastive learning, which may be insufficient for complex mixed degradations.
Training requires degradation–clean paired data (or degradation type labels), which can be costly to obtain in practice.
Generalization to unseen degradation types remains to be verified.
The current work addresses document images only; robustness to degradation in natural scene images is not explored.

vs. TeCoA / FARE (adversarial robustness): These methods target small perturbations under \(\ell_p\) norm constraints and are not suited for natural degradations (blur, low light, shadow). RobustVisRAG handles a broader range of degradation types by explicitly learning degradation representations.
vs. Image Restoration → VisRAG Pipeline: Restored images exhibit improved perceptual quality but do not guarantee semantic consistency, whereas RobustVisRAG achieves semantic protection directly at the encoder level.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of causal modeling and dual-path encoder design is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ A complete benchmark with both synthetic and real degradations is constructed.
Writing Quality: ⭐⭐⭐⭐ The causal analysis section is formally rigorous.
Value: ⭐⭐⭐⭐ VisRAG robustness is an important practical problem.