Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models¶

Conference: ICML2025
arXiv: 2410.13080
Code: RManLuo/graph-constrained-reasoning
Area: Graph Learning
Keywords: Knowledge Graph Question Answering, LLM Reasoning, Constrained Decoding, Trie Index, Hallucination Elimination

TL;DR¶

This paper proposes Graph-constrained Reasoning (GCR), which encodes knowledge graphs into a KG-Trie and embeds it into the LLM decoding process to achieve faithful reasoning with zero hallucinations, achieving SOTA on KGQA benchmarks and demonstrating zero-shot cross-graph transferability.

Background & Motivation¶

LLMs demonstrate excellent performance on reasoning tasks but still face two core challenges in scenarios requiring factual knowledge: knowledge deficiency and hallucination. Existing methods that leverage knowledge graphs (KGs) to enhance LLM reasoning can be categorized into two paradigms:

Retrieval-based: e.g., RoG, GNN-RAG, which extract relevant facts from the KG using an external retriever and feed them into the LLM. However, retriever precision is limited, capturing graph structures is difficult, and generalization to unseen issues is insufficient.
Agent-based: e.g., ToG, EffiQA, which treat the LLM as an agent interacting with the KG via multiple turns. The limitation lies in high computational overhead and latency (e.g., ToG requires an average of 11.6 LLM calls).

Key Finding: Even the leading KG-enhanced method, RoG, still suffers from a 33% hallucination error rate in graph reasoning (Sui et al., 2024). This implies that the reasoning paths generated by LLMs are not truly grounded in the KG, casting doubt on their credibility.

Design Motivation: Can structured KG constraints be directly integrated into the LLM decoding process to fundamentally eliminate reasoning hallucinations?

Method¶

The GCR framework consists of three core components: KG-Trie construction, graph-constrained decoding, and graph inductive reasoning.

1. KG-Trie Construction¶

Goal: To convert the KG into a structured index that can constrain LLM decoding.

Given a question \(q\) and a KG \(\mathcal{G}\), starting from the question entities \(\mathcal{E}_q\), all reasoning paths within \(L\) hops are retrieved using BFS. These paths are formatted as text, tokenized by the LLM tokenizer, and used to build a prefix tree:

\[\mathcal{W}_z = \text{BFS}(\mathcal{G}, \mathcal{E}_q, L)\]

\[\mathcal{T}_z = \text{Tokenizer}(\mathcal{W}_z)\]

\[\mathcal{C}_\mathcal{G} = \text{Trie}(\mathcal{T}_z)\]

Advantages: KG-Trie supports path traversal in constant time \(O(|\mathcal{W}_z|)\); it can be pre-built offline or constructed on-the-fly dynamically; the average 2-hop construction time is only 0.28 seconds, requiring 0.5 MB of space.

2. Graph-constrained Decoding¶

Core Idea: Use the KG-Trie to constrain token generation during LLM decoding, guaranteeing that the outputted reasoning paths strictly exist in the KG.

The generation probability is modeled as:

\[P_\phi(a, \boldsymbol{w}_z | q) = \underbrace{P_\phi(a|q, \boldsymbol{w}_z)}_{\text{常规解码}} \cdot \underbrace{\prod_{i=1}^{|\boldsymbol{w}_z|} P_\phi(w_{z_i}|q, w_{z_{1:i-1}}) \cdot \mathcal{C}_\mathcal{G}(w_{z_i}|w_{z_{1:i-1}})}_{\text{图约束解码}}\]

The constraint function is a hard constraint (0/1 mask):

\[\mathcal{C}_\mathcal{G}(w_{z_i}|w_{z_{1:i-1}}) = \begin{cases} 1, & \exists \text{prefix}(w_{z_{1:i}}, \boldsymbol{w}_z), \boldsymbol{w}_z \in \mathcal{W}_z \\ 0, & \text{otherwise} \end{cases}\]

The Trie constraint is applied during the generation of the reasoning path (between <PATH>...</PATH> tags). Once the path generation is complete, it switches back to standard decoding to generate the hypothetical answer.

Training: Fine-tune a lightweight KG-specific LLM (e.g., Qwen2-0.5B to Llama-3.1-8B). The loss function is standard autoregressive loss:

\[\mathcal{L} = \mathbb{E}_{(q, \boldsymbol{w}_z, a) \sim \mathcal{D}_\mathcal{G}} \left[ \log \prod_{i} P_\phi(a_i|q, \boldsymbol{w}_z, a_{1:i-1}) \prod_{j} P_\phi(w_{z_j}|q, w_{z_{1:j-1}}) \right]\]

Training data: The shortest paths between question entities and answer entities are utilized as reasoning paths. The WebQSP training set contains 28,307 samples, and the CWQ training set contains 181,602 samples.

3. Graph Inductive Reasoning¶

Beam search is utilized to generate \(K\) candidate reasoning paths and hypothetical answers, which are then fed into a powerful general LLM (e.g., ChatGPT, GPT-4o-mini) for inductive reasoning to produce the final answer:

\[\mathcal{Z}_K = \{a_k, \boldsymbol{w}_{z_k}\}_{k=1}^K = \arg\text{top-}K \, P_\phi(a, \boldsymbol{w}_z | q)\]

\[P_\theta(\mathcal{A}|q, \mathcal{Z}_K) \simeq \prod_{k=1}^K P_\theta(\mathcal{A}|q, a_k, \boldsymbol{w}_{z_k})\]

Using a FiD-like architecture, all paths are processed in a single LLM call without requiring additional fine-tuning of the general LLM.

Key Experimental Results¶

Main Results: KGQA Reasoning Performance (Table 1)¶

Method	WebQSP Hit	WebQSP F1	CWQ Hit	CWQ F1
ChatGPT+Self-Consistency	83.5	63.4	56.0	48.1
RoG (Llama-2-7B)	85.7	70.8	62.6	56.2
GNN-RAG+RA	90.7	73.5	68.7	60.4
GCR (Llama-3.1-8B + ChatGPT)	92.6	73.2	72.7	60.9
GCR (Llama-3.1-8B + GPT-4o-mini)	92.2	74.1	75.8	61.7

GCR achieves SOTA on both datasets, outperforming the second-best method by 2.1% in WebQSP Hit and 9.1% in CWQ Hit.

Efficiency Comparison (Table 2, WebQSP)¶

Method	Hit	Avg Inference Time (s)	Avg LLM Calls	Avg Input Tokens
RoG	85.7	2.60	2	521
ToG (Agent)	75.1	16.14	11.6	7,069
GCR	92.6	3.60	2	231

GCR requires only 2 LLM calls and 231 input tokens, far exceeding the efficiency of agent-based methods.

Hallucination Elimination¶

GCR achieves a 100% faithful reasoning rate on both WebQSP and CWQ (the reasoning paths are fully verifiable in the KG).
Removing the KG constraints drops the faithful reasoning rate on CWQ to 48.1%, verifying the necessity of the Trie constraint.

Zero-Shot Cross-Graph Transfer (Table 6)¶

Model	FreebaseQA	CSQA	MedQA
GPT-4o-mini	89	91	75
GCR (GPT-4o-mini)	94	94	79

Without any additional training, directly plug in the Trie of a new KG (e.g., ConceptNet, medical KG); FreebaseQA improves by +5% and CSQA improves by +3%.

Ablation Study: KG-Specific LLM Ablation (Table 4, WebQSP)¶

The fine-tuned Qwen2-0.5B model achieves a Hit score of 87.5, surpassing the zero-shot Llama-3.1-70B (38.5), demonstrating the critical role of fine-tuning for graph reasoning. The optimal configurations are beam size \(K=10\) and hop limit \(L=2\).

Highlights & Insights¶

Paradigm Innovation: Diverging from retrieval-based and agent-based paradigms, GCR establishes a third paradigm of "constrained decoding"—directly embedding the graph structure into the LLM decoding process, offering an elegant and highly efficient design.
Dual-LLM Collaborative Architecture: Lightweight KG-specific LLM (for graph reasoning) + powerful general LLM (for inductive reasoning), leveraging the strengths of both sides with no fine-tuning needed for the general LLM.
Zero-Hallucination Guarantee: Achieves theoretical zero hallucinations via hard-constraint masking, rather than soft constraints or post-hoc filtering.
Plug-and-Play Cross-Graph Transfer: Relies solely on swapping the KG-Trie to transfer to new graphs, highlighting the modularity of the proposed method.

Limitations & Future Work¶

Limitation of the Zero-Hallucination Definition: It only guarantees that the reasoning paths exist in the KG, but the KG itself might be incomplete or contain erroneous facts, making it unable to detect false positives inherent to the KG.
Scalability on Complex Multi-Hop Questions: As \(L\) increases, the KG-Trie construction overhead grows exponentially (\(O(E^L)\)), leading to a significant efficiency drop beyond 3 hops. The paper suggests combining with planning-decomposition approaches, but this has not been empirically verified.
Irrelevant Reasoning Paths: The LLM occasionally selects paths that exist in the graph but are irrelevant to the question (e.g., in Case 1, querying for a constituency but wandering into political office paths), leading to incorrect answers.
Dependence on KG Coverage: If the KG is incomplete (e.g., missing character-actor relationships), the method fails completely.
General LLM Cost: Inductive reasoning relies on closed-source models like ChatGPT/GPT-4o-mini, raising concerns about deployment costs and privacy.

RoG (Luo et al., ICLR 2024): The predecessor of GCR, presenting a plan-retrieve-reason framework, but still suffering from a 33% hallucination rate.
GNN-RAG (Mavromatis & Karypis, 2024): Performs graph retrieval utilizing GNNs, which is complementary to GCR—GNN-RAG outputs can be used as subgraph inputs to the KG-Trie.
Constrained Decoding (De Cao et al., ICLR 2022): The Trie-constrained decoding in GCR directly inherits the autoregressive entity retrieval concept of GENRE.
Insight: The concept of constrained decoding can be generalized to other structured data (e.g., database schemas, code ASTs) to achieve structure-aware, faithful generation.

Rating¶

Novelty: ⭐⭐⭐⭐ — Introducing Trie-constrained decoding to KG reasoning is a clever cross-domain adaptation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage across 5 datasets, 22 baselines, ablation studies, efficiency analyses, hallucination rates, and transferability testing.
Writing Quality: ⭐⭐⭐⭐ — Clear mathematical formulations, well-motivated, and detailed appendices.
Value: ⭐⭐⭐⭐ — High practicality with zero-hallucination guarantees, high efficiency, and zero-shot transferability.