MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains¶

Conference: AAAI 2026 arXiv: 2508.18260 Code: GitHub Area: Medical Imaging Keywords: Medical QA, Knowledge Graph, Multi-Chain Reasoning, Test-Time Inference Scaling, Retrieval-Augmented Generation

TL;DR¶

This paper proposes MIRAGE, a framework that extends conventional linear reasoning chains into a parallel multi-chain reasoning paradigm. It combines adaptive retrieval from structured medical knowledge graphs (via neighborhood expansion and multi-hop traversal) with cross-chain verification to resolve contradictions, consistently outperforming GPT-4o, ToT, and Search-o1 on three medical QA benchmarks.

Background & Motivation¶

Large Reasoning Models (LRMs) have demonstrated remarkable multi-step reasoning capabilities through chain-of-thought (CoT) prompting during test-time scaling. Models such as OpenAI o1 and DeepSeek-R1 enhance reasoning performance by extending reasoning chains without retraining. Agentic frameworks like Search-o1 further integrate retrieval-augmented generation (RAG) into the reasoning loop.

However, existing approaches face two fundamental limitations:

Limitation 1: Fragility of Linear Scaling. Current methods primarily rely on linear scaling through sequential reasoning chains or iterative retrieval rounds. If an early reasoning step is erroneous or based on incomplete evidence, the entire extended reasoning chain is compromised. While Tree-of-Thoughts (ToT) attempts to explore multiple reasoning paths, it lacks a coherent mechanism for coordinating parallel reasoning chains and performing explicit cross-chain verification. In the medical domain, where reasoning errors can have severe consequences, the linear scaling paradigm fails to effectively leverage additional computational budget.

Limitation 2: Flat Knowledge Coverage. Existing retrieval-augmented methods typically retrieve unstructured text and integrate it into the reasoning process in a flat, context-free manner. This approach neglects the structured relationships and semantic hierarchies inherent in domain knowledge—particularly in medicine, where understanding often depends on complex inter-entity relationships, causal chains, and hierarchical taxonomies. Even when more computational resources are allocated to retrieve additional information, the flat integration of isolated text fragments limits the capacity for precise multi-hop reasoning.

Core Idea: Transform linear scaling into parallel scaling—decompose complex queries into entity-anchored sub-problems, execute multiple parallel reasoning trajectories, acquire evidence through adaptive retrieval from structured knowledge graphs, and integrate answers via cross-chain verification.

Method¶

Overall Architecture¶

MIRAGE operates through four core components: a Question Decomposer that breaks complex clinical queries into entity-anchored sub-problems; an Evidence Retriever that acquires structured evidence from a knowledge graph within a reasoning-retrieval loop; an Answer Synthesizer that integrates all sub-answers and resolves contradictions through consistency verification; and a Coordinator that manages the execution of these three components and facilitates communication via a shared memory workspace.

Key Designs¶

Question Decomposer:
- Function: Decomposes complex medical queries into focused, entity-anchored sub-problems.
- Mechanism: Follows two domain-specific principles—(1) decomposition is triggered only when the query involves multiple distinct medical entities; (2) ambiguous references are replaced with explicit entities extracted from the original query, ensuring each sub-problem is self-contained and directly mappable to entities \(\mathcal{E}\) in the knowledge graph. At most \(N_q\) sub-problems are generated per query to prevent over-fragmentation.
- Design Motivation: Addresses the weakness of prior decomposers that generate entity-agnostic, free-form sub-problems, thereby enabling more precise downstream retrieval.
Graph-Augmented Evidence Retriever:
- Function: Retrieves relevant evidence from a structured medical knowledge graph \(\mathcal{G} = (\mathcal{E}, \mathcal{R})\) within an iterative reasoning-retrieval loop.
- Mechanism: During sub-problem decoding, the model may emit special search blocks \(\vartheta\), where contained entities are soft-matched to knowledge graph entities via embedding similarity. Retrieval operates in two modes:
  - Anchor Mode: For single-entity queries, retrieves the fixed neighborhood \(\mathcal{N}(e)\) of the entity, returning at most \(k\) neighbors per relation type.
  - Bridge Mode: For two-entity queries, searches for typed relation chains \(\mathcal{P}_h(e_1, e_2)\) of length at most \(h\) between the two entities, supporting cross-entity reasoning (e.g., linking symptoms to complications).
- Retrieved results are verbalized in natural language (e.g., "Diabetes has symptom Fatigue") and inserted back into the model's context; the model may issue additional queries within its budget.
- Design Motivation: Maintains contextual focus (injecting only relevant facts), supports iterative refinement (updating as new evidence is retrieved), and ensures all claims are traceable to specific graph paths.
Answer Synthesizer:
- Function: Integrates answers from all sub-problems, detects contradictions, and resolves conflicts prior to generating the final response.
- Mechanism: (1) Medical terminology is normalized to canonical synonyms and dosage units are standardized; (2) pairwise comparisons across all answers identify mutually exclusive diagnoses or conflicting treatment recommendations; (3) when conflicts arise, the answer whose supporting chain covers a broader relevant relational neighborhood or better matches the original query is retained (majority-vote verification strategy); (4) the final response is constrained to 1–2 paragraphs of patient-facing text.
- Design Motivation: Reduces hallucinations and enhances clinical accuracy by detecting contradictions before generation and suppressing unsupported claims.
Coordinator:
- Manages the execution of the three components and facilitates communication via a shared memory workspace.
- Monitors the workspace and automatically activates downstream modules when their required inputs become available.

Loss & Training¶

MIRAGE is a purely test-time framework and involves no model training or fine-tuning. It uses the open-source Qwen-QWQ-32B model as the backbone LLM for all core components, with a maximum input length of 32,768 tokens. The central contribution of the method lies in more efficiently allocating computational resources at inference time—shifting from linear chain extension to parallel multi-chain reasoning.

Key Experimental Results¶

Main Results¶

Method	GenMedGPT-5k F1	GenMedGPT-5k Rank↓	CMCQA F1	CMCQA Rank↓	ExplainCPE Acc
GPT-4o	0.825	7.4	0.849	7.2	77.8%
GPT-4o+ToT	0.841	5.9	0.850	6.7	80.2%
QWQ-32B	0.836	4.4	0.849	4.6	82.8%
MindMap	0.841	3.8	0.847	3.1	84.6%
Search-o1	0.849	3.3	0.852	3.0	80.7%
MIRAGE	0.852	1.8	0.853	2.8	84.8%

Ablation Study (GPT-4o Pairwise Comparison Win/Tie/Lose %)¶

Configuration	Win	Tie	Lose	Notes
w/o Question Decomposer	40.72	44.97	14.31	Removal has greatest impact on disease identification
w/o Answer Synthesizer	44.03	43.23	12.73	Removal has greatest impact on treatment recommendations
w/o Both	48.27	38.68	13.05	Removing both yields the largest overall degradation

Key Findings¶

MIRAGE consistently achieves the best GPT-4o ranking and answer accuracy across all three datasets: Rank 1.8 on GenMedGPT-5k (substantially better than Search-o1's 3.3) and 84.8% accuracy on ExplainCPE.
QWQ-32B outperforms GPT-4o+ToT, suggesting that reasoning capabilities acquired during pretraining may generalize better than prompt-based strategies.
Static retrieval methods (BM25, embedding retrieval) yield inconsistent performance, whereas structured retrieval via knowledge graphs (MindMap) provides more consistent improvements.
Search-o1 exhibits degraded performance on ExplainCPE, likely due to noise in web content; MIRAGE maintains robustness by relying on structured knowledge.
There exists an optimal value for the sub-problem threshold \(N_q\) (approximately 4); excessively large values introduce noise through over-fragmentation. Increasing the retrieval threshold \(N_r\) yields diminishing but positive returns.
MIRAGE also achieves the best performance on the DeepSeek-R1-32B backbone (Rank 2.9, Acc 84.4%), demonstrating the generalizability of the framework.
Human evaluation confirms that MIRAGE receives the highest overall preference rate, strongly consistent with GPT-4o rankings.

Highlights & Insights¶

The paradigm shift from "linear scaling" to "parallel scaling" is both intuitive and effective—decomposing complex problems into independent sub-problems enables parallel reasoning, improves computational efficiency, and naturally supports cross-chain verification and error correction.
The Anchor and Bridge retrieval modes elegantly correspond to two distinct knowledge demands: local attribute queries versus cross-entity relational reasoning.
The contradiction detection and majority-vote verification strategy in the answer synthesis stage provides an important safeguard for generating reliable medical responses.
Operating entirely at test time with no additional training required, the framework can be applied as a plug-and-play enhancement for existing LLMs.

Limitations & Future Work¶

The framework relies heavily on high-quality domain knowledge graphs; performance may degrade in domains where knowledge graphs are incomplete or inaccurate.
The knowledge graphs employed are drawn from existing resources; the handling of novel knowledge or rare diseases absent from the graph is not discussed.
Parallel multi-chain reasoning introduces additional computational overhead at inference time; the paper does not report specific latency or cost figures.
Validation is primarily conducted on Chinese and English medical QA; transferability to other domains (e.g., legal, financial) remains to be investigated.
Sub-problem decomposition depends entirely on LLM prompt engineering, and decomposition quality may vary substantially across different backbone models.

The key distinction from Search-o1 is that Search-o1 integrates retrieval into a linear reasoning chain, whereas MIRAGE achieves a triple improvement: parallel reasoning + structured graph retrieval + cross-chain verification.
MindMap also employs knowledge graphs but lacks parallel reasoning and cross-chain verification mechanisms.
The framework has general applicability to other knowledge-intensive reasoning tasks—any scenario requiring multi-step reasoning with access to structured knowledge sources can benefit from this "decompose–parallel retrieval–synthesize and verify" paradigm.
The combination of parallel reasoning chains and knowledge graphs warrants further exploration across a broader range of tasks.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐