Skip to content

Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement

Conference: ICLR2026
arXiv: 2506.05154
Code: lcy80366872/knowledgeable-R1
Area: Causal Reasoning
Keywords: RAG, Parametric Knowledge, Reinforcement Learning, Knowledge Conflict, GRPO

TL;DR

This paper proposes Knowledgeable-R1, a reinforcement learning-based framework that jointly samples trajectories from parametric knowledge (PK) and contextual knowledge (CK), combined with local/global advantage estimation and adaptive asymmetric advantage transformation, enabling LLMs to resist misleading retrieved contexts in RAG scenarios while preserving the ability to leverage reliable context.

Background & Motivation

RAG reduces hallucination and factual errors in LLMs by incorporating externally retrieved content. However, when retrieved context contains noisy, counterfactual, or internally contradictory information, LLMs tend to over-rely on such external information and suppress their own parametric knowledge — a phenomenon known as context dominance. Existing approaches exhibit notable shortcomings:

  • Prompting methods (e.g., Astute-RAG): guide the model to verify/filter context, but increase computational complexity and lack generalizable decision rules
  • Decoding methods (e.g., CK-PLUG): adjust token distributions to mitigate conflicts, but also lack generalization ability
  • Fine-tuning methods (e.g., Self-RAG, InFO-RAG): require complex data annotation pipelines, limiting flexibility and scalability
  • Standard GRPO: sampling space is confined to query+context inputs, making it difficult for the model to explore the critical yet rare decision of "ignoring context and falling back to parametric knowledge"

Core Problem

How can LLMs dynamically decide within RAG systems: when to trust retrieved contextual knowledge (CK), and when to fall back to their own parametric knowledge (PK) — significantly improving robustness against misleading context without degrading normal RAG performance?

Method

1. Three-Strategy Joint Sampling

For each query \(q\), three decoding strategies are defined:

Strategy Input Output Behavior
PK (Parametric Knowledge) \(p\) = query \(o\) (answer from parametric knowledge) Pure parametric knowledge response
CK (Context-Aware) \(p'\) = query+context \(o'\) (answer leveraging context) Utilizes reliable context
RPK (Robust Parametric Knowledge) \(p'\) = query+context \(o\) (answer consistent with PK) Falls back to PK under misleading context

Key design: RPK does not independently generate an answer; instead, it reuses the PK trajectory \(o^{pk}\) as the target and re-evaluates its log-probability under the query+context input \(p'\), encouraging the model to maintain parametric knowledge tokens even in the presence of misleading context.

2. Local-Global Advantage Estimation

  • PK advantage: uses only local advantage \(A_i^{pk\text{-}local}\) (within-group Z-score normalization), ensuring query-only responses are as accurate as possible
  • CK advantage: sum of local and global advantages \(A_j' = A_j^{ck\text{-}local} + A_j^{ck\text{-}global}\), where the global term is normalized over the joint pool of CK and RPK trajectories under \(p'\), giving CK priority when both knowledge sources are correct (as context is more up-to-date)
  • RPK advantage: global advantage only \(\hat{A}_i^{global}\), competing with CK trajectories under the same input \(p'\); RPK receives positive advantage when context is misleading

The global advantage mechanism resolves the issue of distinguishing CK vs. RPK preferences when within-group trajectories receive uniform rewards.

3. Knowledge Balance Modulation

An asymmetric advantage transformation \(T(\hat{A}_i; \beta)\) is introduced: positive advantages remain unchanged, while negative advantages are scaled by \(\beta \in [0.01, 1]\). \(\beta\) is dynamically adjusted based on the cumulative advantages of CK and RPK within each mini-batch:

\[\beta \leftarrow \text{clip}\left(\frac{S_{ck} - S_{rpk+}}{S_{rpk-}}, 0.01, 1\right)\]

When CK substantially outperforms RPK, \(\beta\) decreases to reduce the penalty on RPK negative advantages, encouraging more parametric knowledge exploration; as the gap narrows, \(\beta\) increases for more conservative training. \(\beta\) converges to a stable value within approximately 8 steps.

4. Policy Optimization

PPO-style clipped updates are adopted, with the total objective as a weighted sum of three components:

\[\mathcal{J}(\theta) = \lambda_{pk} J_{PK} + \lambda_{ck} J_{CK} + \lambda_{rpk} J_{RPK}\]

In experiments, \(\lambda_{pk} = \lambda_{ck} = \lambda_{rpk} = 1.0\) with clipping parameter \(\epsilon = 0.2\).

Key Experimental Results

Evaluated across 5 contextual scenarios (correct / adversarial / self-conflicting / irrelevant / partially relevant), with Qwen2.5-7B-Instruct as the base model:

Scenario RAG Prompting GRPO w/ RAG Knowledgeable-R1 Gain
S1 Correct Context (PC-QA) 74.35% 80.03% 80.90% +6.54%
S2 Adversarial Context (NC-MR) 13.47% 26.94% 43.94% +30.47%
S2 Adversarial Context (NC-MC) 8.06% 19.74% 37.34% +29.28%
S3 Self-Conflicting Context (SC) 59.50% 75.33% 76.33% +15.92%
S4 Irrelevant Context (ExplainPE) 62.21% 66.50% 67.57% +5.36%
S5 Partially Relevant (HotpotQA) 20.36% 27.93% 31.45% +11.09%

On the subset answerable by parametric knowledge, NC-MR/MC/QA averages +22.89% over GRPO w/ RAG. Consistent gains are also observed on Llama3.1-8B-Instruct.

Key Ablation Findings: - Removing \(J_{RPK}\) causes the largest performance drop in TIFE (parametrically correct, contextually incorrect) scenarios (MC drops by 33.12%) - Removing adaptive \(\beta\) causes a 27.39% drop in TIFE performance (MC) - Removing global advantage \(A^{ck\text{-}global}\) leads to significant TIFE degradation

Highlights & Insights

  • Precise problem formulation: knowledge conflict in RAG is explicitly decomposed into three sub-objectives (parametric correctness, context utilization, robust fallback), with targeted joint sampling strategies designed accordingly
  • Elegant RPK design: rather than generating new trajectories, RPK reuses PK trajectories and re-evaluates them under query+context inputs, enabling low-cost exploration of "context present but ignored" behavior
  • Adaptive \(\beta\) requires no manual hyperparameter tuning, remains robust across datasets, and converges rapidly
  • Strong generalization: achieves significant gains on 2WikiMultiHopQA and MuSiQue without task-specific fine-tuning
  • Training on only 1% erroneous context still outperforms GRPO, indicating the model learns genuine decision boundaries rather than data statistics

Limitations & Future Work

  • Performance gains in S3 (self-conflicting) and S5 (partially relevant) scenarios are relatively modest; handling intra-context contradictions remains an open challenge
  • Sensitivity to varying conflict ratios (e.g., 1 incorrect vs. 4 incorrect out of 5 retrieved passages) is not analyzed in depth
  • Joint sampling allocates approximately half the rollout budget to query-only PK trajectories; S1 (correct context) performance is slightly below GRPO w/ RAG (mitigable by adjusting \(\lambda_{ck}\))
  • Validation is limited to knowledge-intensive QA tasks; more complex multi-source retrieval environments remain unexplored
  • vs. GRPO w/ RAG: standard GRPO samples only under query+context, lacking parametric knowledge exploration; Knowledgeable-R1 explicitly encourages parametric fallback via PK/RPK branches, achieving an average gain of +22.89% in S2 scenarios
  • vs. Self-RAG / InFO-RAG: these SFT methods rely on complex annotation pipelines; Knowledgeable-R1 automatically learns decision rules via RL without explicit labeling of "when to trust context"
  • vs. CK-PLUG: CK-PLUG adjusts token probabilities at decoding time with limited effect (even degrading in S2); Knowledgeable-R1 directly optimizes knowledge utilization strategy during training
  • vs. Astute-RAG: Astute-RAG uses prompting to guide context filtering but underperforms in retrieval-irrelevant scenarios; Knowledgeable-R1 outperforms it comprehensively

The approach generalizes to any "multi-source information fusion" scenario, such as knowledge selection under vision-text conflicts in multimodal settings. The RPK paradigm of "shared trajectory evaluated under different conditions" can be adopted in other RL training frameworks to reduce additional sampling overhead. The adaptive \(\beta\) reward shaping strategy offers a general solution to insufficient exploration in RL training.

Rating

  • Novelty: 8/10 — Three-strategy joint sampling and RPK design are genuine contributions, though PPO-style optimization itself is not novel
  • Experimental Thoroughness: 8/10 — 5 scenarios, 4 base models, detailed ablations, but lacks conflict-ratio sensitivity analysis
  • Writing Quality: 7/10 — Method description is clear, but notation is dense and could be simplified
  • Value: 8/10 — Addresses a critical knowledge conflict problem in RAG with a concise and practical approach