NeuroPath: Neurobiology-Inspired Path Tracking and Reflection for Semantically Coherent Retrieval¶
Conference: NeurIPS 2025 arXiv: 2511.14096 Code: GitHub Area: Video Understanding / RAG Keywords: RAG, multi-hop QA, knowledge graph, place cells, semantic path tracking
TL;DR¶
Inspired by the hippocampal place cell navigation and memory consolidation mechanisms in neurobiology, this paper proposes NeuroPath—a RAG framework based on semantic path tracking—that achieves average improvements of 16.3% in recall@2 and 13.5% in recall@5 on multi-hop QA tasks through LLM-driven goal-directed path construction and a post-retrieval completion strategy.
Background & Motivation¶
Background: RAG significantly enhances LLM performance on knowledge-intensive tasks. Naive RAG retrieves documents based on vector similarity but cannot capture inter-document associations, making it ill-suited for multi-hop reasoning.
Limitations of Prior Work: - Naive RAG: Flat knowledge organization with no cross-document association. - Graph-based RAG (HippoRAG): Uses the PPR algorithm to propagate node importance, but ignores edge semantics, resulting in higher structural relevance than semantic coherence. - Graph-based RAG (LightRAG): Subgraph construction by collecting direct neighbors introduces substantial noise.
Key Challenge: The advantage of graph structures lies in explicit semantic reasoning paths, yet existing graph-based methods focus more on topological structure than on path-level semantic coherence, failing to fully exploit this advantage.
Goal: (1) Address the loss of semantic coherence in retrieval results; (2) eliminate irrelevant noise introduced during node matching and subgraph construction.
Key Insight: Drawing an analogy to the navigation mechanism of hippocampal place cells—place cells preplay future path sequences during navigation and replay them during rest to consolidate memory.
Core Idea: Entities in the knowledge graph are treated as place cells and triples as place fields; dynamic retrieval is performed via LLM-driven, goal-directed semantic path tracking.
Method¶
Overall Architecture¶
The framework follows a three-step pipeline: (1) Static Indexing: an LLM extracts a knowledge graph from documents and constructs coreference sets; (2) Dynamic Path Tracking: simulating the place cell preplay mechanism, the LLM performs goal-directed path filtering and expansion from seed nodes; (3) Post-Retrieval Completion: simulating the replay mechanism, a two-stage retrieval is conducted using intermediate reasoning chains and the original query to fill in missing information.
Key Designs¶
-
Static Indexing and Pseudo-Coreference Resolution:
- An LLM extracts the entity set \(\mathcal{E}\) and relation triple set \(\mathcal{T}\) from each document \(d_i\) in a single pass.
- A potential coreference set \(\mathcal{R}_i\) is constructed for each entity \(e_i\), containing candidate entities whose cosine similarity exceeds 0.8: \(\text{Sim}(i,j) = \text{CosSim}(\text{Enc}(i), \text{Enc}(j))\) \(\mathcal{R}_i = \text{argtopk}_j \text{Sim}(i,j), \quad i,j \in \mathcal{E}\) The top-5 most similar entities are retained as the coreference set by default.
-
Dynamic Path Tracking (Simulating Preplay):
- Seed Node Selection: Key entities are extracted from the query and matched to the most similar nodes in the graph; coreference sets are expanded as initial seeds \(\mathcal{S}^0\).
- Path Expansion: Triples \(\mathcal{P}_{sub}^h\) connected to the seed nodes are retrieved and concatenated with the current expanded path \(\mathcal{P}_{exp}^h\) to form candidate paths: \(\mathcal{P}_{cur}^{h+1} = \mathcal{P}_{val}^h + \text{Cat}(\mathcal{P}_{exp}^h, \mathcal{P}_{sub}^h)\)
- LLM Tracking: The LLM filters candidate paths, marks valid paths \(\mathcal{P}_{val}^h\), decides whether further expansion is needed, and generates expansion requirements \(g^h\).
- Pruning Based on Expansion Requirements: The expansion requirements generated by the LLM at the previous hop are used to prune new paths by similarity, preventing exponential growth: \(\mathcal{P}_{cur}^{h'} = \text{argtopk}_p \text{Sim}(g^{h-1}, p), \quad p \in \mathcal{P}_{cur}^h\) Top-30 paths are retained by default.
-
Post-Retrieval Completion (Simulating Replay):
- After finalizing the paths, source documents on the paths are collected as candidates \(\mathcal{D}_p\).
- The reasoning chain from the last hop \(c_{\text{last}}\) and the expansion requirement \(g_{\text{last}}\) are concatenated with the original query \(q\) to perform a second-stage retrieval that fills in missing information.
- The final document set is \(\mathcal{D}_{ret} = \mathcal{D}_p \cup \mathcal{D}_e\).
Loss & Training¶
- No additional training is required—zero-shot prompting is used throughout.
- Graph indexing uses GPT-4o-mini.
- Path tracking can use either GPT-4o-mini or Qwen-2.5-14B.
- The maximum number of reasoning hops is set to 2 by default.
Key Experimental Results¶
Main Results — Retrieval Performance (Contriever Retriever)¶
| Method | MuSiQue R@2 | MuSiQue R@5 | 2Wiki R@2 | 2Wiki R@5 | HotpotQA R@2 | HotpotQA R@5 | Avg R@2 | Avg R@5 |
|---|---|---|---|---|---|---|---|---|
| BGE-M3 (Naive) | 40.4 | 54.2 | 64.9 | 71.8 | 71.8 | 84.7 | 59.0 | 70.2 |
| HippoRAG 2 (Graph-based) | 41.8 | 55.5 | 62.5 | 74.2 | 65.3 | 83.4 | 56.5 | 71.0 |
| Iter-RetGen (Iterative) | 46.0 | 59.8 | 62.1 | 76.5 | 78.3 | 90.6 | 62.1 | 75.6 |
| NeuroPath | 48.0 | 62.7 | 77.2 | 92.5 | 75.6 | 90.4 | 66.9 | 81.9 |
QA Performance (GPT-4o-mini + Contriever)¶
| Method | MuSiQue EM | 2Wiki EM | HotpotQA EM | Avg EM |
|---|---|---|---|---|
| HippoRAG | 27.8 | 58.6 | 43.3 | 43.2 |
| HippoRAG 2 | 27.4 | 46.0 | 50.7 | 41.4 |
| Iter-RetGen | 29.9 | 51.5 | 48.7 | 43.4 |
| NeuroPath | 31.4 | 63.4 | 50.5 | 48.4 |
Ablation Study¶
| Component | MuSiQue R@2 | 2Wiki R@2 | HotpotQA R@2 | Token Consumption Change |
|---|---|---|---|---|
| Full model (p=30) | 48.0 | 77.2 | 75.9 | Baseline |
| w/o pruning | 48.7 | 76.8 | 75.7 | Tokens increase ~45% |
| p=20 | 47.3 | 76.5 | 74.9 | Tokens decrease ~7% |
| w/o post-retrieval completion (hop=2) | 41.8 | 73.6 | 67.5 | — |
| w/o post-retrieval completion (hop=1) | 35.5 | 61.0 | 61.3 | — |
Key Findings¶
- Compared to state-of-the-art graph-based RAG methods, recall@2 improves by an average of 16.3% and recall@5 by 13.5%.
- Compared to iterative RAG methods, NeuroPath achieves higher accuracy while reducing token consumption by 22.8%.
- The largest gains are observed on the most challenging MuSiQue dataset, which is specifically designed for difficult multi-hop reasoning.
- NeuroPath is robust to the choice of retriever, whereas iterative methods and HippoRAG 2 exhibit high sensitivity (differences up to 20%).
- Robust performance is maintained across 4 smaller LLMs (Llama3.1, GLM4, Mistral0.3, Gemma3).
- The post-retrieval completion (Replay mechanism) contributes approximately 6–8% of the recall improvement.
Highlights & Insights¶
- Novel Neuroscience-Inspired Analogy: The mapping from place cell preplay/replay to path tracking/post-retrieval completion is both conceptually elegant and empirically effective.
- Path-Level Retrieval Outperforms Node/Subgraph-Level Retrieval: Explicit semantic paths ensure coherence in retrieval results, avoiding the noise introduced by subgraph-based methods.
- Active LLM Participation in Retrieval: Rather than passive matching, the LLM actively reasons, filters, and predicts expansion directions at each hop, realizing a form of "thinking-driven retrieval."
- High Token Efficiency: Token consumption is reduced by 22.8% compared to iterative RAG while simultaneously achieving higher accuracy.
Limitations & Future Work¶
- The framework relies on LLMs for path tracking, and inference costs (number of LLM API calls) remain relatively high.
- Knowledge graph quality is constrained by the LLM's extraction capability, and extraction errors propagate to downstream retrieval.
- Coreference resolution relies on a simple vector similarity threshold (0.8), which may miss coreferent entities with significant name variation.
- The maximum hop count is limited to 2; scalability to deeper reasoning chains remains to be validated.
- On tasks with lower knowledge integration demands, such as HotpotQA, the advantage over simpler methods is less pronounced.
Related Work & Insights¶
- HippoRAG: The primary competing method, which uses the PPR algorithm but ignores edge semantics. NeuroPath addresses this through explicit path-level semantic coherence.
- LightRAG: Subgraph construction introduces excessive noise (answering incorrectly even with 60 entities and 169 relations), demonstrating that "more retrieval" does not equate to "better retrieval."
- PathRAG: Another path-based method, but it applies uniform resource allocation and disregards edge importance and semantics.
- Place Cell Theory (O'Keefe, 1971): Provides an elegant conceptual framework for the method design.
- Insight: The shift in RAG from "retrieve more" to "retrieve more precise paths" may represent a key future direction.
Rating¶
- Novelty: ⭐⭐⭐⭐ The neuroscience analogy is novel, practically grounded, and effective; the path tracking concept exhibits genuine originality.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three primary datasets plus three additional datasets, with ablations across multiple LLMs and retrievers.
- Writing Quality: ⭐⭐⭐⭐ The structure is clear and case studies are intuitive, though the depth of the neuroscience analogy could be strengthened.
- Value: ⭐⭐⭐⭐ Substantially outperforms state-of-the-art on multi-hop QA and provides important reference value for the RAG community.