Skip to content

Quantifying and Improving the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data

Conference: ACL2026
arXiv: 2503.05587
Code: https://github.com/maybenotime/RAG-SpuriousFeatures
Area: Information Retrieval / RAG Robustness
Keywords: RAG, Spurious Features, Robustness Evaluation, Perturbation Benchmark, SFT and DPO

TL;DR

This paper proposes the SURE framework to systematically evaluate the sensitivity of RAG generation ends to semantically irrelevant spurious features in retrieved documents—such as style, source, logic, format, and metadata—and significantly improves RALM robustness through SFT/DPO using synthetic data generated by SURE.

Background & Motivation

Background: RAG alleviates LLM hallucinations by retrieving external documents and has become a common paradigm for factual QA and knowledge-intensive applications. Existing robustness research mostly focuses on explicit noise, such as retrieving documents with semantic errors, irrelevance, contradictions, or poor positioning.

Limitations of Prior Work: Real-world internet retrieval results contain not only semantic noise but also a large number of semantically irrelevant features that influence model behavior: HTML/Markdown/YAML/JSON formats, sentence order, source domains, timestamps, stylistic complexity, and traces of LLM rewriting. Existing benchmarks rarely systematically measure the impact of these "spurious features" in RAG scenarios.

Key Challenge: Given the same golden document, the correct answer remains unchanged even if the format or metadata is altered; however, the RAG reader's output may flip from correct to incorrect. Traditional dataset-level accuracy often only observes aggregate changes and fails to capture flips of individual instances before and after perturbation.

Goal: Establish an automated framework capable of injecting spurious features in bulk while maintaining the causal semantics of documents, providing instance-level robustness metrics, and further generating robust training data.

Key Insight: The authors decompose RAG input into instructions, grounding data, and queries, modifying only the surface attributes of the grounding data that are semantically irrelevant to the answer, then comparing model outputs under original versus perturbed inputs.

Core Idea: Use a "perturb-preserve-evaluate" controlled experimental framework to explicitly isolate spurious features from RAG, quantifying model sensitivity and transforming non-robust samples into training signals.

Method

The complete SURE pipeline includes four parts: a spurious feature taxonomy, perturbation injection, causal feature preservation, and robustness evaluation. The authors then construct SURE_Wiki and SIG_Wiki/SIG_Trivial based on this process, exploring mitigation methods such as scaling, Chain-of-Note, reasoning models, SFT, and DPO.

Overall Architecture

Given a query, the retriever returns documents, and the reader LLM receives a prompt \(P=(I,G,Q)\) to generate an answer. SURE defines a perturbation function \(g(.)\), which transforms grounding data \(G\) into \(g(G)\), creating a counterfactual input \(\hat{P}=(I,g(G),Q)\). If the answers for \(G\) and \(g(G)\) are semantically consistent but the model output correctness changes, it indicates the RALM is not robust to that spurious feature.

Key Designs

  1. Taxonomy of five categories of spurious features:

    • Function: Covers surface attributes common in RAG that should not change the answer.
    • Mechanism: Defines five major categories—Style, Source, Logic, Format, and Metadata—containing 13 types of perturbations. Style includes simple/complex; Source includes LLM-generated/self-generated; Logic includes reverse/random/LLM-reranked; Format includes JSON/HTML/YAML/Markdown; Metadata includes timestamp pre/post and datasource wiki/twitter.
    • Design Motivation: Internet retrieved documents are naturally heterogeneous. RAG systems cannot guarantee uniform document formats, sources, or writing styles upon deployment; thus, these features represent practical risks rather than toy perturbations.
  2. Causal Feature Preservation Mechanism:

    • Function: Ensures perturbations only change spurious features without altering the semantic content the answer depends on.
    • Mechanism: For model-generative perturbations, semantic equivalence is explicitly requested; then, bi-directional entailment checks whether \(G\) entails \(g(G)\) and vice-versa. String matching ensures the ground truth in golden documents remains present in perturbed documents, and noise documents do not accidentally acquire correct answers.
    • Design Motivation: If a perturbation alters the factual answer, it becomes impossible to determine if the model error is due to spurious features or causal content changes. Bi-directional NLI and answer string checks bring the evaluation closer to a controlled experiment.
  3. Instance-level Robustness Metrics and Training Data Generation:

    • Function: Simultaneously evaluates single-instance flips and supports subsequent training.
    • Mechanism: Correctness is judged for both the original output \(y\) and perturbed output \(\hat{y}\), calculating Win Rate (WR), Lose Rate (LR), and Robustness Rate (RR). For non-robust instances, SURE records the query, correct answer, incorrect answer, original golden passage, and perturbed golden passage for SFT or DPO.
    • Design Motivation: Dataset accuracy may mask the instability of "the same question flipping back and forth," while instance-level pairs are naturally suited for constructing preference training or consistency training samples.

Loss & Training

The evaluation phase of SURE does not train models, primarily obtaining RR/WR/LR through perturb-then-evaluate. For mitigation, the authors use two training strategies: SFT pairs both original and perturbed golden passages with the correct answer to train the model to output the correct answer stably; DPO treats the correct answer as "preferred" and the incorrect answer as "rejected," constructing preference samples combined with original and perturbed passages. Experiments use Llama-3.1-8B-Instruct as the backbone, training for 2 epochs on over 30k samples, and evaluating on SIG_Wiki and cross-domain SIG_Trivial.

Key Experimental Results

Main Results

Data / Model Style RR Source RR Logic RR Format RR Meta RR Description
SIG_Trivial Mistral-7B 88.0 94.0 94.5 94.0 99.0 Bing + TrivialQA, string eval
SIG_Trivial Mistral-7B Judge 90.5 91.5 92.0 93.8 96.0 LLM-as-Judge results close
SIG_Trivial Llama-3.1-8B 87.5 93.5 93.0 90.8 97.0 Open-source reader
SIG_Trivial Llama-3.1-8B Judge 85.0 92.0 91.0 90.8 93.3 Validating string metric

Ablation Study

Method Style Source Logic Format Meta Dataset
Llama3.1-8B 10.0 15.5 20.0 24.0 94.0 SIG_Wiki
+ SFT 96.5 94.5 99.0 99.5 99.7 SIG_Wiki
+ DPO 96.5 96.0 96.0 98.0 98.0 SIG_Wiki
Llama3.1-8B 87.5 93.5 93.0 90.8 97.0 SIG_Trivial
+ SFT 88.5 91.5 95.0 96.3 99.0 SIG_Trivial
+ DPO 94.5 94.5 97.3 95.8 98.0 SIG_Trivial

Key Findings

  • On SURE_Wiki, the impact of different perturbation categories varies significantly; RR within the same category is similar, but WR/LR can differ significantly, indicating some spurious features occasionally "correct" the model.
  • For Mistral-7B-Instruct, HTML in format perturbations reaches a Lose Rate of 9.30 on Known-Golden samples, higher than JSON/YAML/Markdown, suggesting structural formats significantly impact the reader.
  • Six SOTA models exhibit specific sensitivity points on SIG_Wiki; even GPT-4o only achieves approximately 89% RR on datasource(twitter).
  • Chain-of-Note and DeepSeek-R1 do not reliably solve the problem: DeepSeek-V3's Style RR is 96.5, whereas DeepSeek-R1 drops to 84.5, suggesting stronger reasoning does not equate to higher stability against spurious features.
  • Attention analysis shows that Win/Lose samples with changed outputs exhibit larger changes in attention on the answer span compared to Robust samples; the \(\Delta A\) for Robust is \(6.52 \times 10^{-5}\) while Lose is \(1.15 \times 10^{-4}\) (Welch t-test \(p=0.046\)).

Highlights & Insights

  • The paper systematically introduces "semantically invariant but surface-variant" features into RAG evaluation, which is closer to real search environments than simply adding irrelevant documents.
  • The RR/WR/LR instance-level paired metrics are highly practical: they do not just indicate how much accuracy changed, but also distinguish whether perturbations made the answer better or worse.
  • The training data reuse design is natural. SURE is not just a benchmark; it transforms non-robust samples found during evaluation into SFT/DPO data, forming a closed loop.
  • The results serve as a reminder: document cleaning, format preservation, and metadata processing in RAG pipelines are not neutral pre-processings; they can directly alter model outputs.

Limitations & Future Work

  • The authors acknowledge that the taxonomy cannot exhaust all spurious features; real web pages may contain more complex factors like ads, templates, navigation bars, table layouts, and footnotes.
  • Current evaluations primarily focus on QA tasks and English Wikipedia / open web; long documents, multi-hop reasoning, cross-lingual RAG, and private corporate documents still require verification.
  • While efficient, string matching might lack flexibility for aliases, paraphrased answers, and numerical formats; LLM-as-Judge was used only for supplementary validation.
  • SFT is extremely strong on in-domain SIG_Wiki, but some metrics are inferior to DPO on SIG_Trivial, suggesting training-based mitigations still face domain generalization issues.
  • Training-based mitigation requires full-parameter fine-tuning and A100-level resources; lightweight adapters or inference-time normalization strategies are worth further research.
  • vs Explicit Noise RAG benchmarks: Past studies often examined irrelevant documents, contradictions, and document positioning; this work focuses on semantically invariant spurious features, resembling a RAG-oriented extension of prompt sensitivity.
  • vs Prompt format sensitivity: Works by Sclar, He, et al., proved LLMs are sensitive to prompt formats; this paper moves this sensitivity to the grounding data level, finding that retrieval document formats are equally critical.
  • vs Chain-of-Note: CoN is designed for explicit noise by requiring models to write a rationale first; this paper's experiments show COT-style methods are not necessarily effective against spurious features.
  • vs DPO/SFT Robust Training: This work does not perform generic preference optimization but uses paired original/perturbed passages to construct data, aiming more specifically to align consistency across "different surfaces of the same fact."

Rating

  • Novelty: ⭐⭐⭐⭐ Systematizing spurious features into RAG grounding data provides significant value in problem definition.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ The taxonomy, two benchmarks, multiple models, prompting, scaling, training mitigation, and attention analysis are comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ The framework is clear; tables are dense but information-rich.
  • Value: ⭐⭐⭐⭐⭐ Highly relevant for real-world RAG deployment, document preprocessing, and robustness training.