RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models¶
Conference: ACL 2025
arXiv: 2412.02830
Code: https://github.com/fatebreaker/RARE
Area: Information Retrieval
Keywords: Retrieval-Augmented Reasoning, MCTS, Factuality Scoring, Medical Question Answering, Commonsense Reasoning
TL;DR¶
Proposes RARE, which introduces two retrieval-augmented actions into the MCTS reasoning framework of rStar (A6: generate search queries based on the original question and retrieve; A7: retrieve and re-answer sub-questions) and replaces the original discriminator with a Retrieval-Augmented Factuality Scorer (RAFS), enabling LLaMA 3.1 to match or exceed GPT-4o performance on medical and commonsense reasoning tasks.
Background & Motivation¶
Background: LLM reasoning enhancement methods include CoT, Self-Consistency, and MCTS search (rStar). For knowledge-intensive tasks (such as medical QA), RAG methods (such as MedRAG, i-MedRAG) are also needed to introduce external knowledge. However, reasoning enhancement and retrieval augmentation are usually designed separately.
Limitations of Prior Work: - Pure reasoning enhancement (rStar/MCTS) lacks external knowledge, limited by pre-training knowledge on knowledge-intensive questions. - Pure RAG methods (MedRAG) perform only a single retrieval, failing to dynamically acquire information at each step of multi-step reasoning. - Although i-MedRAG supports iterative retrieval, it lacks a systematic reasoning framework and path selection mechanism.
Key Challenge: Knowledge-intensive tasks require a combination of structured multi-step reasoning + dynamic retrieval, whereas existing methods only achieve one of the two.
Key Insight: Introduce two new retrieval-augmented actions into the MCTS action space of rStar, and replace the answer verification with a retrieval-augmented factuality scorer, achieving deep integration of reasoning and retrieval.
Core Idea: Integrate retrieval augmentation into the MCTS action space and scoring mechanism, allowing each branch of the reasoning tree to dynamically acquire external knowledge and verify factuality.
Method¶
Overall Architecture¶
Phase 1 - Retrieval-Augmented Generator: Explores reasoning paths based on MCTS, adding A6 (question-level retrieval) and A7 (sub-question-level retrieval) to the original 5 actions (A1-A5) of rStar, to generate multiple candidate reasoning paths.
→ Phase 2 - Retrieval-Augmented Factuality Scorer (RAFS): Performs factuality verification on each reasoning path and selects the path with the highest score as the final answer.
Key Designs¶
-
A6: Search Query Generation + Information Retrieval:
- The LLM generates multiple search queries based on the original question.
- ColBERT is used to retrieve relevant documents from the corpora (PubMed, StatPearls, medical textbooks, Wikipedia).
- The retrieved information is integrated with the original question to generate the final answer.
- Applicable to single-step knowledge-based questions requiring external knowledge supplementation.
-
A7: Sub-question Retrieval + Re-answering:
- Independent retrieval is performed for the sub-questions generated in A3.
- Each sub-question is re-answered based on the retrieved contextual information.
- The final sub-question is a reformulation of the original question, whose answer serves as the answer to the original question.
- Applicable to complex questions requiring iterative retrieval.
-
Retrieval-Augmented Factuality Scorer (RAFS):
- Step 1: Deconstruct the reasoning path into independent claims.
- Step 2: The LLM generates retrieval queries for each claim.
- Step 3: Retrieve relevant documents.
- Step 4: Compare each claim with the retrieved evidence, labeling it as Supported/Not Supported.
- Factuality Score: The proportion of Supported claims -> select the path with the highest score.
Key Experimental Results¶
Main Results (Medical Reasoning Accuracy)¶
| Model | Method | MedQA | MedMCQA | MMLU-Med | Avg |
|---|---|---|---|---|---|
| LLaMA3.1 8B | CoT | 61.51 | 55.15 | 71.63 | 62.76 |
| LLaMA3.1 8B | rStar | 70.40 | 62.13 | 79.16 | 70.56 |
| LLaMA3.1 8B | RARE | 75.57 | 64.32 | 81.63 | 73.84 |
| LLaMA3.1 70B | RARE | 87.43 | 75.18 | 90.91 | 84.51 |
| GPT-4o | CoT | 85.55 | 74.70 | 90.45 | 83.57 |
RARE + LLaMA3.1-70B outperforms GPT-4o on MedQA (87.43) and MMLU-Med (90.91).
Commonsense Reasoning Comparison¶
| Model | Method | StrategyQA | CommonsenseQA | PIQA | Avg |
|---|---|---|---|---|---|
| LLaMA3.1 8B | rStar | 71.57 | 76.58 | 79.65 | 75.93 |
| LLaMA3.1 8B | RARE | 78.02 | 80.84 | 82.52 | 80.46 |
Key Findings¶
- RARE yields the greatest gains on multi-step reasoning tasks: On StrategyQA, it improves by 10.19% over CoT, far exceeding the 7.22% on CommonsenseQA, indicating that retrieval augmentation is more valuable in multi-hop reasoning.
- RAFS is more effective than the original rStar discriminator—factuality scoring is more reliable than pure self-consistency voting.
- A6 and A7 are complementary: A6 is suitable for single-step knowledge queries, while A7 is suitable for complex compound reasoning.
Highlights & Insights¶
- Deep integration of retrieval and reasoning: Instead of simple "retrieve first, then reason," retrieval is treated as an atomic action in MCTS, allowing the search tree to naturally explore "when to retrieve and what to retrieve."
- The factuality verification method of RAFS can be used independently—the process of decomposing reasoning paths into claims -> verifying claim-by-claim -> percentage scoring is applicable to any scenario requiring factuality evaluation.
- The results of open-source LLaMA outperforming GPT-4o demonstrate the importance of reasoning framework design.
Limitations & Future Work¶
- The computational overhead of MCTS is still relatively large (multiple rollouts + retrieval).
- The retrieval corpus needs to be pre-constructed, requiring adaptation for new domains.
- Using LLMs to judge Supported/Not Supported in RAFS may introduce judgment bias.
- Only evaluated on multiple-choice questions; effectiveness on open-ended generation tasks remains unknown.
Related Work & Insights¶
- vs rStar: rStar provides a reasoning framework but lacks external knowledge; RARE outperforms rStar by 3.28% on average (medical).
- vs i-MedRAG: i-MedRAG features iterative retrieval but lacks a systematic reasoning framework; RARE outperforms i-MedRAG by 2.63% on average.
- vs MedRAG: Single retrieval in MedRAG is insufficient for complex questions; RARE leads by a large margin.
Rating¶
- Novelty: ⭐⭐⭐⭐ Innovatively integrates RAG into the MCTS action space; RAFS factuality scoring has a clear idea.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3 medical + 3 commonsense benchmarks, 3 scales of LLaMA, compared with GPT-4o, and detailed ablation.
- Writing Quality: ⭐⭐⭐⭐ Clear flowcharts, detailed case analyses, and strong motivation.
- Value: ⭐⭐⭐⭐⭐ A practical solution for open-source models outperforming GPT-4o, with direct application value for medical AI.