DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA¶
Conference: ICML 2026
arXiv: 2510.16302
Code: https://anonymous.4open.science/r/DTKG-621F
Area: Graph Learning / Knowledge Graph / Multi-Hop QA / RAG
Keywords: Knowledge Graph, Multi-Hop Reasoning, Dual-Process Theory, Fact Checking, Path Pruning
TL;DR¶
DTKG dichotomizes multi-hop questions into "parallel fact-checking vs. chained reasoning." It first routes questions to the appropriate branch using a few-shot classifier. The parallel branch verifies atomic facts using KG triples, while the chained branch performs DFS path expansion and score pruning on Wikidata, supplemented by "task-aware" denoising. It improves by 5%–29.5% over single-strategy baselines like KGR and ToG across six datasets.
Background & Motivation¶
Background: RAG has become the mainstream solution to mitigate LLM hallucinations, and multi-hop QA is the most challenging sub-task within RAG, requiring entity-relation chain reasoning across multiple knowledge units. Current mainstream approaches either focus on LLM-centric fact-checking (e.g., KGR) or KG path-based chain construction (e.g., ToG).
Limitations of Prior Work: Both types of methods suffer from a "one-size-fits-all" error. LLM-based fact-checking excels at decomposing answers into independent facts for KG comparison but loses intermediate conclusions and breaks the reasoning chain in problems requiring sequential attributes. Conversely, KG path-based methods excel at step-by-step relation traversal but generate massive redundant branches for parallel tasks, wasting computation on irrelevant paths.
Key Challenge: Multi-hop questions are inherently heterogeneous in topology—some sub-questions are mutually independent (parallel), while others are strongly dependent on predecessor conclusions (chained). Existing architectures use a single "reasoning core," which is inevitably suboptimal for one class of problems. Stanovich refers to this as the "cognitive miser" tendency, where all problems are subjected to the same cheap heuristics.
Goal: (i) Dynamically route multi-hop questions to the correct processing branch; (ii) design specialized KG reasoning operators for both branches; (iii) differentiate denoising strategies by task type, as parallel task noise (redundant triples) differs fundamentally from chained task noise (divergent paths).
Key Insight: The authors adopt the dual-process theory of Tversky & Kahneman—humans use "unconscious fast classification + conscious deep processing" for different tasks. This corresponds to a two-stage pipeline: "first classify, then branch."
Core Idea: A few-shot LLM classifier simulates the "unconscious stage" to quickly determine the question type. Two distinct KG processing pipelines (fact-checking vs. path construction) then simulate the "conscious stage" to focus on their respective strengths, paired with task-related denoising to eliminate strategy-task mismatch.
Method¶
Overall Architecture¶
The input is a natural language multi-hop question \(Q\), and the output is the answer \(A\). The pipeline is divided into two stages:
- Classification Stage (Unconscious): A 6-shot prompted LLM classifier \(\mathcal{C}: Q \to \{para, chain\}\) identifies the question as "parallel fact-checking" or "chained reasoning." The prompt includes five boundary rules and six labeled examples. The classification signal is "dependency on intermediate reasoning conclusions."
- Branch Processing Stage (Conscious): Based on the classification, the question is routed to either the Parallel Fact-Checking Branch or the Chained Reasoning Branch. Both branches utilize Wikidata as the external KG and share a two-stage hybrid scoring module (embedding similarity + reranker) for relation/triple matching.
The authors formalize the multi-hop reasoning space as a bipartite partition \(\mathcal{Q} = \mathcal{Q}_{para} \cup \mathcal{Q}_{chain}\) and propose the Optimal Strategy Alignment theorem: the optimal core \(\mathcal{K}^* = \arg\max_{\mathcal{K}} \mathbb{P}(A|Q,\mathcal{K})\) is achieved if and only if \(\mathcal{K}\) is topologically aligned with \(Q\).
Key Designs¶
-
Few-Shot Task Classifier (Unconscious Stage):
- Function: Categorizes the input question to decide which branch to execute.
- Mechanism: Employs 6-shot prompting to let the LLM determine the key signal of "dependency on intermediate conclusions." The prompt consists of five explicit rules and six annotated examples; the classifier outputs "yes" for chained and "no" for parallel. It relies on in-context learning without requiring supervised training data.
- Design Motivation: Simulates human "unconscious" fast pattern recognition to avoid expensive labeling and training. It addresses the root cause of strategy-task mismatch at the front of the pipeline. Ablations show that Random Classification causes HotpotQA ACC to drop from 85.8% to 73.0%, proving that classification quality is critical.
-
Dual-Track Processing Engine (Conscious Stage):
- Function: Performs atomic fact decomposition and KG verification for parallel questions, and DFS path expansion with pruning for chained questions.
- Mechanism (Parallel Branch): The LLM decomposes the candidate answer \(R\) into a set of atomic facts \(F = \{f_1, \ldots, f_n\}\). For each fact \(f_i\), the subject entity \(e_i\) is mapped to a Wikidata QID. Candidate triples are filtered using cosine similarity \(\text{sim}_{\cos}(f_i, t_j) = \mathbf{h}(f_i)\cdot \mathbf{h}(t_j) / (|\mathbf{h}(f_i)||\mathbf{h}(t_j)|)\) to select the top-\(K\), followed by a reranker. Scores are fused via \(\text{score}_{\text{combined}} = \alpha \cdot \text{score}_{\text{rerank}} + (1-\alpha)\cdot \text{sim}_{\cos}\). If inconsistent with the top triple \(t^*\), the fact is rewritten as \(f_i' = f_{\text{rewrite}}(f_i, t^*)\).
- Mechanism (Chained Branch): Extracts the central entity \(e_0\) and retrieves head/tail relations \(R(e_0) = R_{\text{head}} \cup R_{\text{tail}}\). It performs DFS along paths \(P_k = [e_0 \xrightarrow{r_1} e_1 \cdots \xrightarrow{r_k} e_k]\), scoring paths by \(\text{score}_{\text{path}}(P_k) = \prod_{i=1}^k \text{score}_{\text{combined}}(r_i)\). Complexity is controlled by a maximum depth \(D_{\max}=3\), keeping top-\(W_{\max}\) per step, threshold \(\theta\) truncation, and an LLM-based selection of at most three candidates. An early stop condition \(\text{stop}(P_k) = \mathbb{I}[\text{info}(P_k) \supseteq \text{info}(Q)]\) is checked at each layer.
- Design Motivation: Optimal operators differ by task. Parallel tasks require a flat "decomposition-matching-rewriting" pipeline focused on independent facts. Chained tasks require a deep "entity-relation-scoring-expansion-earlystop" pipeline to link relations. Sharing the scoring module optimizes implementation while keeping pruning and generation independent to avoid interference.
-
Task-Aware Denoising:
- Function: Cleans "cross-subquestion redundant triples" (parallel) and "off-track branch paths" (chained) based on the branch type.
- Mechanism: A two-layer mechanism. The first is rule-based filtering using a static keyword library \(K_{\text{invalid}} = \{\text{ID, source, version, metadata}\}\): \(\text{filter}_{\text{rule}}(r) = \text{True}\) if \(\exists k \in K_{\text{invalid}}, k \in \text{label}(r)\). This removes KG administrative metadata. The second is dynamic filtering where the LLM scores remaining relations \(\text{score}_n(r, Q) = \text{LLM}(\text{Prompt}_n \oplus r \oplus Q)\); relations with \(\text{score}_n < \theta\) are discarded.
- Design Motivation: Universal threshold pruning often deletes critical information. Categorizing noise into "administrative" and "contextual" allows rule-based efficiency for the former and LLM-based precision for the latter.
Loss & Training¶
DTKG is a training-free reasoning framework. All "learning" is achieved through LLM in-context prompting with no additional parameters. Key hyperparameters include embedding top-\(K\), reranker fusion weight \(\alpha\), DFS max depth \(D_{\max}=3\), width \(W_{\max}\), and relation score thresholds. The base LLM is Llama-3-8B.
Key Experimental Results¶
Main Results¶
Evaluated on six datasets against baselines like COT, CRITIC, KGR, and ToG using Llama-3-8B. Metrics are EM and ACC (semantic matching accuracy judged by BERTScore).
| Dataset | Metric | Ours (DTKG) | Top Baseline | Gain |
|---|---|---|---|---|
| HotpotQA | EM/ACC | 38.2 / 85.8 | ToG 37.1 / KGR 83.5 | +1.1 / +2.3 |
| Mintaka | EM/ACC | 67.6 / 93.9 | COT 66.6 / KGR 92.0 | +1.0 / +1.9 |
| CWQ | EM/ACC | 46.3 / 90.0 | KGR 45.0 / ToG 88.1 | +1.3 / +1.9 |
| QALD10-en | EM/ACC | 50.0 / 85.0 | COT 50.0 / KGR 83.0 / CRITIC 83.0 | tie / +2.0 |
| GraphRAG-Bench | EM/ACC | 14.5 / 87.1 | COT 14.4 / ToG 84.5 | +0.1 / +2.6 |
| MuSiQue | EM/ACC | 18.5 / 83.0 | COT 18.0 / ToG 80.5 | +0.5 / +2.5 |
The absolute ACC improvement relative to the Original (no KG) reaches up to 29.5% (53.5% → 83.0%) on MuSiQue.
Ablation Study¶
Classifier ablations (Llama-3-8B, EM/ACC):
| Configuration | HotpotQA | Mintaka | CWQ | QALD10-en | Description |
|---|---|---|---|---|---|
| Only-Fact Verification | 35.6 / 83.5 | 62.3 / 91.5 | 40.5 / 86.0 | 49.5 / 83.5 | All tasks follow the parallel branch; performance drops on chained datasets. |
| Only-Reasoning Chain | 36.5 / 85.5 | 57.0 / 91.5 | 40.6 / 87.1 | 50.0 / 81.0 | All tasks follow the chained branch; EM on parallel task Mintaka plummets. |
| Random Classification | 30.5 / 73.0 | 56.0 / 86.0 | 41.0 / 79.0 | 47.5 / 76.0 | Mismatch cost is massive; HotpotQA ACC drops by 12.8%. |
| Full DTKG | 38.2 / 85.8 | 67.6 / 93.9 | 46.3 / 90.0 | 50.0 / 85.0 | Complete dual-track. |
Key Findings¶
- The classifier is essential: Random classification reduces HotpotQA ACC from 85.8% to 73.0%, proving that a hybrid strategy is worse than a single branch if not based on question type. This justifies the dual-process design.
- Single branches perform well only in their domain: Only-Chain is close to DTKG on the chain-heavy CWQ (EM 40.6 vs 46.3) but significantly worse on the parallel-heavy Mintaka (EM 57.0 vs 67.6), validating the need for strategy-task alignment.
- ACC shows the largest gain: The significant boost on MuSiQue ACC suggests KG grounding recovers semantically correct but differently phrased answers.
Highlights & Insights¶
- Adopting psycholgical dual-process theory as an architectural metaphor is not just for show: the "cheap fast judgment" (Type 1) and "deep processing" (Type 2) explain why routing is handled upfront and specialized operators are handled downstream. This can generalize to other heterogeneous tasks.
- The bipartite denoising (administrative vs. contextual) is highly reusable. Rule-based methods handle absolute noise, while LLM scoring handles relative noise, avoiding the precision-recall trade-off of single-threshold pruning.
- The "embedding top-\(K\) → reranker → weighted fusion" two-stage scoring is a pragmatic engineering design that maintains recall while improving precision.
Limitations & Future Work¶
- The classifier relies on 6-shot prompting, and its robustness depends on the LLM's in-context capabilities. Stability across models other than Llama-3-8B was not cross-validated.
- Strong dependency on Wikidata: Coverage is limited by KG completeness. In sparse scenarios (e.g., medical, legal, or non-English KGs), the chained branch's recall might collapse.
- Fixed \(D_{\max} = 3\) constraint: While most multi-hop questions are \(\leq 3\) hops, datasets like MuSiQue contain 4-hop samples that would be truncated.
- Cumulative LLM calls: The pipeline requires multiple calls (classification + scoring + generation) per question, which likely increases latency and cost compared to single-branch baselines.
Related Work & Insights¶
- vs KGR (Guan et al., 2024): KGR uses KG for fact verification, essentially equivalent to running only the parallel branch of DTKG. It lags on chained datasets like CWQ.
- vs ToG (Sun et al., 2024): ToG performs iterative path exploration, equivalent to running only the chained branch. It wastes computation on parallel datasets like Mintaka.
- vs CoT (Wei et al., 2022): CoT is pure LLM reasoning without KG grounding, making it hallucination-prone. DTKG outperforms CoT significantly in accuracy (e.g., 87.1 vs 77.2 on GraphRAG-Bench).
Rating¶
- Novelty: ⭐⭐⭐ Limited single-point innovation (classifier + known branches), but the overall "dual-process guided routing" is a novel combination.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covered 6 datasets and 3 classification ablations, though denoising ablations and multi-hop depth analyses are missing.
- Writing Quality: ⭐⭐⭐⭐ Clear logic from motivation to theory and method, though the cognitive science background is occasionally over-emphasized.
- Value: ⭐⭐⭐⭐ Provides a useful template for multi-strategy RAG systems by applying the "classify + route" paradigm to KG-QA.