Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models¶

Conference: AAAI 2026 arXiv: 2504.08202 Code: GitHub Area: LLM Efficiency / Long-Context Evaluation Keywords: Long-context language models, parametric knowledge, extrinsic retrieval ability, Needle-in-a-Haystack, knowledge conflict

TL;DR¶

This paper presents the first systematic study of how parametric knowledge influences generation in long-context language models (LCLMs), finding that such influence grows with context length and that methods designed to improve extrinsic retrieval suppress parametric recall. Based on these findings, the paper proposes the Hybrid Needle-in-a-Haystack (Hybrid NIAH) benchmark to jointly evaluate both capabilities.

Background & Motivation¶

Background: LCLMs can now process contexts ranging from 128K to 1M tokens. Existing benchmarks (NIAH, RULER, InfiniteBench, HELMET, etc.) primarily assess the model's ability to retrieve and utilize information from external contexts (extrinsic retrieval ability), while largely neglecting the role of knowledge encoded in model parameters (parametric knowledge) during generation.

Limitations of Prior Work: - Underestimation of parametric knowledge: Existing NIAH tests deliberately use fictitious information to avoid interference from parametric knowledge, yet in real-world applications the interaction between parametric knowledge and external context is unavoidable. - Side effects of improving extrinsic retrieval: Position encoding improvements such as STRING enhance context retrieval for long-context scenarios, but their impact on parametric knowledge utilization remains unknown. - Blind spots in evaluation: All current long-context benchmarks assess only one capability, failing to reflect a model's ability to synthesize information from both knowledge sources.

Key Challenge: A potential trade-off exists between extrinsic retrieval ability and parametric recall ability—strengthening the former may suppress the latter—yet existing evaluations entirely overlook this tension.

Goal: - Verify the significance of parametric knowledge in long-context generation and how its influence varies with context length. - Reveal the trade-off between extrinsic retrieval ability and parametric recall ability. - Design a new benchmark capable of jointly evaluating both capabilities.

Key Insight: The paper approaches the problem from the perspective of knowledge conflict, constructing datasets where parametric knowledge either aligns with or conflicts with the external context, and employing controlled experiments to expose the interaction between the two knowledge sources.

Core Idea: Parametric knowledge influence in long-context models grows as context length increases; improving extrinsic retrieval suppresses parametric recall; and Hybrid NIAH is needed to jointly evaluate both capabilities.

Method¶

Overall Architecture¶

The study is structured in three parts: (1) validating the role of parametric knowledge in long-context generation via the I-WhoQA dataset; (2) revealing the trade-off between extrinsic retrieval and parametric recall by comparing RoPE and STRING; and (3) proposing the Hybrid NIAH evaluation framework.

Key Designs¶

I-WhoQA Dataset Construction:
- Function: Constructs a long-context QA dataset for probing the influence of parametric knowledge.
- Mechanism: Based on the WhoQA dataset, answers are generated for each entity, retaining only those for which the model consistently produces a single definitive answer (ensuring unambiguous parametric knowledge). Two subsets are then constructed for each entity: I-WhoQA-Parametric (external context consistent with parametric knowledge) and I-WhoQA-Conflict (external context contradicts parametric knowledge).
- Design Motivation: Comparing performance across the two subsets at varying context lengths directly quantifies the degree of parametric knowledge influence. A dataset of 300 samples is constructed separately for each model, as parametric knowledge differs across models.
Trade-off Experiment Design:
- Function: Validates whether improving extrinsic retrieval ability degrades parametric recall ability.
- Mechanism: RoPE (baseline) and STRING (improved RoPE) are compared across three datasets:
  - I-WhoQA-Irrelevant: External context is entirely unrelated; answers require parametric knowledge.
  - HotpotQA-Context: Answers are located in the external context, requiring extrinsic retrieval.
  - HotpotQA-Parametric: External context is related but unhelpful; answers rely on parametric knowledge.
- Key Finding: STRING shows clear gains on HotpotQA-Context (enhanced extrinsic retrieval) but consistently declines on I-WhoQA-Irrelevant and HotpotQA-Parametric (suppressed parametric recall), confirming the existence of a trade-off.
Hybrid Needle-in-a-Haystack Test:
- Function: Designs a test that jointly evaluates parametric recall and extrinsic retrieval capabilities.
- Mechanism: Traditional NIAH inserts an answer directly into the haystack for the model to retrieve. Hybrid NIAH instead formulates two-step questions—e.g., "What's the favorite thing of the person who wrote {Book_Name}?"—requiring the model to first recall the book's author from parametric knowledge (parametric recall) and then retrieve the inserted needle about that author from the context (extrinsic retrieval).
- Distractor Insertion: To prevent models from directly matching the needle via syntactic patterns without genuinely leveraging parametric knowledge, 0–3 random facts are inserted as distractors. Results show that distractors significantly reduce Hybrid NIAH scores (by up to 25%), while standard NIAH scores are unaffected.
- Design Motivation: By modifying only the question formulation (adding one step of parametric recall), the test is upgraded from single-capability to dual-capability evaluation—an elegant and nearly zero-cost design choice.

Key Experimental Results¶

Main Results¶

Validation of parametric knowledge influence growth with context length (I-WhoQA):

Observation	Phenomenon	Quantification
Parametric knowledge aids generation	I-WhoQA-Parametric consistently outperforms I-WhoQA-Conflict	Holds across all context lengths
Influence grows with length	Proportion of reliance on parametric knowledge in Conflict subset increases with context	Up to 10 percentage points on Llama-3.1-8B
STRING trade-off	STRING improves Context subset but degrades Parametric subset	Clear gains on HotpotQA-Context; consistent drops on Parametric/Irrelevant

Hybrid NIAH core results:

Model	NIAH (0 facts)	Hybrid (0 facts)	Hybrid (3 facts)	Drop
Llama-3.1-8B	100.0	92.97	67.47	−25.5%
Llama-3.1-70B	—	95.73	71.81	−23.9%
Qwen2.5-7B	99.78	86.38	73.14	−13.2%
Qwen2.5-72B	—	64.01→94.16*	97.77→98.86*	Significant recovery with generation length

*Qwen2.5-72B scores are deflated at gen_length=32 because the model generates a refusal before answering; scores recover at gen_length=64.

Ablation Study¶

Random Facts Count	Llama-3.1-70B (gen=32)	Qwen2.5-72B (gen=64)
0	95.73	94.16
1	86.39	99.53
2	74.48	99.84
3	71.81	98.86

Key Findings¶

Qwen2.5 substantially outperforms Llama-3.1: Qwen2.5 exhibits near-linear improvement in parametric knowledge utilization as model scale increases, whereas Llama-3.1 shows no meaningful improvement even when scaling from 8B to 70B.
Larger is not always better (Llama): Llama-3.1-70B does not meaningfully outperform the 8B version on Hybrid NIAH, indicating that simply scaling parameters does not resolve the parametric knowledge utilization problem.
"Refuse-then-answer" behavior in larger Qwen2.5 models: The 14B and 72B variants first generate refusal text before retrieving the answer, causing deflated scores at short generation lengths; this behavior disappears in the multi-needle setting.
Random facts serve as an effective anti-shortcut mechanism: Standard NIAH is barely affected (98–100%), while Hybrid NIAH scores drop substantially, confirming that models exploit syntactic patterns rather than genuinely retrieving information in standard NIAH.

Highlights & Insights¶

Discovery of an overlooked phenomenon: Parametric knowledge influence increases—rather than decreases—in long-context settings, contrary to the intuition that more external context reduces reliance on parametric knowledge. This finding has important implications for RAG and long-document processing in practice.
Elegant design of Hybrid NIAH: By merely rephrasing the question to require one additional parametric recall step, the test elevates difficulty from ~100% to 67–95% and effectively discriminates between model families. The cost is negligible while the informational yield is substantial.
Random facts as an anti-shortcut mechanism: This reveals that models may engage in pattern-matching rather than genuine retrieval in standard NIAH. This technique should become standard practice in all NIAH variants.

Limitations & Future Work¶

Only open-source models are studied: Closed-source models such as GPT-4o and Claude are not evaluated, and they may exhibit different behavior with respect to parametric knowledge utilization.
I-WhoQA construction relies on consistency filtering: Retaining only entities for which models "always produce a single answer" may discard many interesting and informative cases.
Hybrid NIAH uses only the [author] entity type: The knowledge type is narrow (book→author); different categories of parametric knowledge (e.g., scientific facts, temporal information, geography) may yield different results.
No method is proposed to resolve the trade-off: The paper remains at the diagnostic level without offering a solution for jointly improving both capabilities, which constitutes an important open question.
The "refuse-then-answer" behavior in Qwen2.5 is not deeply analyzed: The authors offer only a tentative hypothesis (possibly related to chunked attention), leaving this phenomenon to future investigation.

vs. NIAH/RULER/HELMET: These benchmarks evaluate only extrinsic retrieval ability; Hybrid NIAH is a natural extension that adds the parametric knowledge dimension.
vs. Knowledge conflict studies (Xie et al. 2023, etc.): Prior knowledge conflict research is limited to short-context settings; this paper is the first to extend it to long-context scenarios and uncovers a new trend—influence grows with context length.
vs. STRING (An et al. 2024b): STRING focuses on improving long-context extrinsic retrieval; this paper exposes its side effect of suppressing parametric recall.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic study of the role of parametric knowledge in long-context settings; Hybrid NIAH is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across multiple datasets, models, and settings, though closed-source models are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Three-stage progression (phenomenon → trade-off → solution) is logically coherent and clearly presented.
Value: ⭐⭐⭐⭐ Hybrid NIAH should become a standard component of long-context evaluation; the trade-off finding carries important implications for model design.