Complex Reasoning with Natural Language Contexts and Background Knowledge¶

Conference: ACL 2025
Code: None
Area: LLM Reasoning
Keywords: Complex Reasoning, Natural Language Context, Background Knowledge, Knowledge-Enhanced Reasoning, Commonsense Reasoning

TL;DR¶

This paper proposes a complex reasoning framework that integrates natural language contexts with structured background knowledge. By utilizing knowledge graph retrieval augmentation and context-aware reasoning chain generation, it significantly improves LLM performance on multi-step reasoning tasks that require external knowledge support.

Background & Motivation¶

Background: LLMs have demonstrated excellent performance on simple reasoning tasks, but they still struggle in complex scenarios that require multi-step reasoning integrated with external background knowledge. Existing approaches mainly include: (1) direct CoT prompting to stimulate internal model knowledge; (2) RAG retrieval of external documents to provide auxiliary information; (3) knowledge graph-assisted reasoning to provide structured knowledge. However, each of these methods has its own limitations.

Limitations of Prior Work: CoT reasoning relies solely on the internal knowledge of the model, which is prone to hallucinations when dealing with niche domains or up-to-date information. Documents retrieved by RAG are unstructured, making it difficult to support complex problems requiring multi-hop reasoning. Knowledge graphs are structured but have limited coverage and a semantic gap with natural language queries. More importantly, existing methods ignore the importance of context—the same question under different contexts requires different background knowledge and reasoning paths.

Key Challenge: Complex reasoning simultaneously requires understanding the natural language context, retrieving relevant background knowledge, and performing multi-step logical deduction, whereas existing methods can only handle one or two of these stages.

Goal: To design an end-to-end reasoning framework capable of (1) understanding the contextual environment of the question; (2) dynamically retrieving required background knowledge; and (3) generating multi-step reasoning chains guided by that knowledge.

Key Insight: The authors observe that when humans perform complex reasoning, they first understand the context of the problem, then activate relevant background knowledge, and finally derive the conclusion step-by-step. This paper simulates this cognitive process by designing a three-stage "comprehend-retrieve-reason" framework.

Core Idea: Dynamically select the background knowledge most relevant to the current reasoning step through Context-Aware Knowledge Retrieval, and ensure each reasoning step is backed by knowledge facts using Knowledge-Conditioned Chain-of-Thought generation.

Method¶

Overall Architecture¶

Given a question \(Q\) and a natural language context \(C\), the system first parses the context to extract key entities and relations, then retrieves relevant background knowledge snippets \(\{k_1, k_2, ..., k_n\}\) from the knowledge graph and text knowledge base, and finally feeds both the context and the retrieved knowledge into the LLM, guiding it to generate a grounded reasoning chain and output the final answer.

Key Designs¶

Context-Aware Knowledge Retrieval:
- Function: Dynamically retrieve the most relevant background knowledge based on the contextual situation of the question.
- Mechanism: Retrieval is performed in two stages. The first stage is entity linking—identifying key entities from the context and question, and linking them to knowledge graph nodes. The second stage is subgraph expansion—expanding \(k\)-hop neighbors centered around the linked nodes in the knowledge graph to form a candidate knowledge subgraph. Then, a context encoder is used to score the relevance of the candidate knowledge, selecting the top-\(m\) most relevant knowledge triples. The context encoder is trained using contrastive learning, where positive samples are knowledge actually used during the reasoning process, and negative samples are random knowledge from the graph.
- Design Motivation: Static retrieval cannot capture the impact of context on knowledge requirements—the same entity requires completely different background knowledge under different contexts. For example, for "Li Bai", a literary context requires poetry knowledge, while a historical context requires biographical knowledge.
Knowledge-Conditioned Chain-of-Thought Generation:
- Function: Generate grounded multi-step reasoning chains guided by retrieved background knowledge.
- Mechanism: On top of standard CoT prompting, for each reasoning step, the model is required to: (a) select the knowledge items that the current step depends on from the retrieved knowledge; (b) generate the current reasoning step based on the selected knowledge; (c) determine whether extra knowledge is needed for the next step. If so, a new round of retrieval is triggered. This forms an iterative retrieval-reasoning loop. Structured outputs are formatted using special tokens [Knowledge: ...] and [Reasoning: ...] to track knowledge utilization.
- Design Motivation: Traditional CoT does not provide knowledge sources, preventing readers from verifying the correctness of the reasoning. Knowledge-conditioned generation makes each step of reasoning traceable, enhancing both interpretability and verifiability.
Reasoning Consistency Verification:
- Function: Verify whether the generated reasoning chain is logically consistent with the background knowledge.
- Mechanism: A lightweight verification module is designed to check two conditions for each step of the reasoning chain: (a) whether the cited knowledge actually exists in the knowledge base (to prevent hallucinated knowledge); (b) whether the logical deduction from the cited knowledge to the reasoning conclusion is valid (using an NLI model to judge entailment). If verification fails, the model backtracks to the previous step to re-reason, allowing up to 3 backtracks. This self-correction mechanism captures errors during the reasoning process rather than waiting for the final answer.
- Design Motivation: LLMs are prone to error accumulation during reasoning; detecting and correcting errors early is more effective than post-hoc verification.

Loss & Training¶

The context encoder is trained using the InfoNCE contrastive loss to pull together context-relevant knowledge pairs and push apart context-irrelevant knowledge pairs. The NLI verification module is fine-tuned using standard SNLI/MNLI data. The backbone LLM is instruction-tuned on a small amount of reasoning data with knowledge annotations to learn to output structured knowledge citation formats.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours	CoT (GPT-4)	RAG+CoT	KG-CoT	Gain
StrategyQA	Acc	82.3	74.5	77.8	79.1	+3.2
OpenBookQA	Acc	86.7	78.2	82.4	83.5	+3.2
ARC-Challenge	Acc	89.5	83.6	86.2	87.1	+2.4
CommonsenseQA	Acc	84.1	79.3	81.5	82.3	+1.8
MedQA	Acc	71.8	62.4	68.3	66.5	+3.5

Ablation Study¶

Configuration	StrategyQA	OpenBookQA	Description
Full model	82.3	86.7	Full system
w/o Context-Aware Retrieval	79.1	83.5	Degenerates to static KG retrieval
w/o Knowledge Conditioning	77.8	82.4	Degenerates to RAG+CoT
w/o Consistency Verification	80.5	84.8	Without self-correction
w/o Iterative Retrieval	80.9	85.1	Only one round of retrieval

Key Findings¶

Context-aware retrieval makes the largest contribution (+3.2), indicating that selecting correct background knowledge is more important than the reasoning method itself.
Consistency verification corrected an average of 8.3% of reasoning errors, showing the most significant effect in domains requiring precise reasoning, such as medical QA (MedQA).
Iterative retrieval mainly shows advantages in multi-hop reasoning tasks; for single-hop problems, the first round of retrieval is sufficient.
The proposed method has a clear advantage on tasks requiring external knowledge, but shows no significant improvement on tasks testing the internal reasoning capability of the model (such as GSM8K math reasoning).

Highlights & Insights¶

The three-stage "comprehend-retrieve-reason" framework simulates human cognitive processes, with a design philosophy that is both natural and effective. Unlike pure RAG, this paper dynamically triggers retrieval during reasoning, which aligns better with actual reasoning needs.
The consistency verification module provides real-time quality assurance for the reasoning process, presenting a "verification-during-generation" approach that can be widely applied to other reasoning systems.
The structured output format of knowledge citations enhances interpretability, allowing users to trace the knowledge source of each reasoning step.

Limitations & Future Work¶

The system depends on the coverage of the knowledge graph; a small-scale KG may fail to provide sufficient background knowledge.
The accuracy of entity linking directly impacts the quality of subsequent retrievals; ambiguous entities can lead to erroneous retrievals.
Multi-round iterative retrieval increases reasoning latency, which may be unacceptable in real-time application scenarios.
The verification module uses an NLI model to judge logical consistency, but NLI itself is not perfect at complex reasoning.

vs IRCoT (Interleaving Retrieval with CoT): IRCoT also alternates retrieval during reasoning, but uses unstructured documents instead of a knowledge graph. This paper leverages the structured information of KGs to locate the required knowledge more precisely.
vs KAPING: KAPING converts KG triples into natural language as part of the prompt. This paper goes further by achieving dynamic retrieval and consistency verification.
vs Self-RAG: Self-RAG decides when to retrieve via reflection tokens. This paper's context-aware retrieval and consistency verification provide finer retrieval control and quality assurance.

Rating¶

Novelty: ⭐⭐⭐⭐ The integration of context-aware retrieval and knowledge-conditioned reasoning is novel, and consistency verification increases reliability.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple reasoning benchmarks with detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear description of the framework and sufficient explanation of the motivation for each component.
Value: ⭐⭐⭐⭐ Has practical application value for knowledge-intensive reasoning tasks.