PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning¶
Conference: AAAI 2026 arXiv: 2509.22315 Code: Available Area: Information Retrieval Keywords: Dual-system reasoning, fast-and-slow thinking, retrieval augmentation, multi-agent, planning
TL;DR¶
Inspired by dual-process cognitive theory, PRIME is a multi-agent reasoning framework in which a Quick Thinking Agent (System 1) rapidly generates intuitive answers, a Reflection Agent evaluates their confidence, and—when uncertainty is detected—six specialized System 2 agents (Planning / Search / Reading / Hypothesis / Integration / Decision) are triggered for deep knowledge-retrieval reasoning. The framework enables open-source LLaMA 3 to approach GPT-4o performance on medical and multi-hop QA benchmarks.
Background & Motivation¶
Background: LLM reasoning enhancement methods include CoT, RAG, and agent frameworks, among others. However, most approaches uniformly apply slow reasoning to all questions, wasting computational resources.
Limitations of Prior Work: - Simple questions do not require deep reasoning—invoking System 2 for "What is the capital of France?" is wasteful. - Existing RAG methods lack explicit planning—what to retrieve and when to retrieve it are left unaddressed. - Single-agent reasoning lacks specialization—the same model must simultaneously handle search, reasoning, and verification.
Key Challenge: Deep reasoning is effective but expensive—intelligent selection of when to activate it is required.
Goal: Design a multi-agent framework that adaptively triggers deep reasoning.
Key Insight: Kahneman's dual-process theory—System 1 for fast intuition, System 2 for slow analysis—with a Reflection Agent deciding when to switch.
Core Idea: System 1 fast answering + Reflection self-evaluation + System 2 six-agent deep reasoning = efficient and accurate inference.
Method¶
Overall Architecture¶
Input question → Quick Thinking Agent (decompose sub-questions, answer sequentially) → Reflection Agent (self-assess confidence) → output if confident → otherwise trigger System 2: Planning Agent (formulate reasoning plan) → Hypothesis Agent (generate hypotheses) → Search Agent (retrieve evidence) → Reading Agent (careful extraction) → Integration Agent (merge evidence) → Decision Agent (final judgment).
Key Designs¶
-
Quick Thinking Agent (System 1):
- Function: Rapidly generate intuitive answers.
- Mechanism: Decompose the question into sub-questions and answer them sequentially without external retrieval.
- Design Motivation: The majority of questions can be handled by System 1, avoiding unnecessary expensive reasoning.
-
Reflection Agent (switching gate):
- Function: Assess the confidence of System 1 outputs.
- Mechanism: Explicit self-reflection—check whether the answer contains logical gaps, uncertainty, or reliance on unverified assumptions.
- Design Motivation: The key innovation—determines when to switch from fast to slow thinking.
-
System 2 six-agent reasoning pipeline:
- Function: Deep knowledge retrieval and multi-step reasoning.
- Division of labor: Planning (formulate reasoning path) → Hypothesis (generate hypotheses) → Search (external retrieval) → Reading (careful evidence reading) → Integration (multi-source synthesis) → Decision (final judgment).
- Design Motivation: Each agent focuses on a single cognitive sub-task—planning ≠ search ≠ reasoning.
Loss & Training¶
- No training required—purely prompt-based agent coordination.
- Backbone model: LLaMA 3 (8B / 70B).
Key Experimental Results¶
Medical Reasoning Tasks¶
| Model / Method | MedQA | MedMCQA | MMLU-Medical | Avg. |
|---|---|---|---|---|
| LLaMA3.1 8B + CoT | 61.51 | 55.15 | 71.63 | 62.76 |
| LLaMA3.1 8B + MedRAG | 63.00 | 56.87 | 74.56 | 64.81 |
| LLaMA3.1 8B + Search-O1 | 73.13 | 62.13 | 79.16 | 71.47 |
| LLaMA3.1 8B + PRIME | 76.91 | 67.49 | 83.56 | 75.99 |
| LLaMA3.3 70B + Search-O1 | 83.17 | 73.11 | 87.23 | 81.17 |
| LLaMA3.3 70B + PRIME | 87.51 | 78.94 | 92.74 | 86.39 |
| GPT-4 | 83.97 | 69.88 | 89.44 | 81.10 |
| GPT-4o | 85.55 | 74.71 | 90.45 | 83.57 |
- PRIME enables LLaMA3.3 70B to achieve an average score of 86.39%, surpassing both GPT-4 (81.10%) and GPT-4o (83.57%).
Multi-Hop Reasoning Tasks¶
| Method | Musique F1 | 2Wiki F1 | HotpotQA F1 |
|---|---|---|---|
| Naive RAG | 30.52 | 38.22 | 40.06 |
| Search-O1 | 41.94 | 74.24 | 54.81 |
| PRIME | 48.81 | 79.81 | 60.68 |
Ablation Study¶
| Configuration | Effect |
|---|---|
| System 1 only | Strong on simple questions, poor on hard ones—over 80% answered correctly but severe hallucination on the remaining 20% |
| System 2 only (full deep reasoning) | Accurate but slow—computational resources wasted |
| PRIME (adaptive switching) | Accurate and efficient—approximately 60% of questions handled by System 1 |
Key Findings¶
- LLaMA3.3 70B + PRIME surpasses GPT-4o (86.39 vs. 83.57)—the multi-agent framework bridges the open- vs. closed-source gap.
- LLaMA3.1 8B + PRIME (75.99) outperforms GPT-4o-mini (74.59)—even an 8B model benefits substantially.
- Approximately 60% of questions are resolved by System 1, saving significant computational resources.
- The quality of the Reflection Agent is critical—false triggers waste System 2 resources; missed triggers lead to incorrect answers.
- On multi-hop reasoning, PRIME outperforms Search-O1 by an average of 6–7 F1 points—reflecting the combined advantage of knowledge retrieval and hypothesis testing.
Highlights & Insights¶
- A faithful implementation of dual-process theory in LLMs—not a metaphor but an actual architectural design.
- The specialization of six agents yields more reliable deep reasoning than a single-agent approach.
- Direct practical guidance for reasoning efficiency: not every question warrants deliberate, slow thinking.
Limitations & Future Work¶
- The Reflection Agent may misjudge—false negatives (hard questions failing to trigger System 2) and false positives (easy questions incorrectly triggering it) both degrade efficiency.
- The communication overhead of six agents is not cost-effective for simple questions—further cost analysis of System 2 is needed.
- Validation is limited to QA tasks; performance on creative writing, code generation, and other scenarios remains unknown.
- The Search Agent depends on external retrieval quality—retrieval failure can cause the entire System 2 pipeline to fail.
- No persistent memory mechanism exists—similar questions must be reasoned through from scratch each time.
Related Work & Insights¶
- vs. ReAct: Single-agent loop. PRIME uses multi-agent coordination with adaptive triggering.
- vs. Self-Consistency: Multiple sampling with majority voting. PRIME performs one round of deep reasoning.
- The dual-system framework is generalizable to any scenario requiring an efficiency–accuracy trade-off.
Rating¶
- Novelty: ⭐⭐⭐⭐ Faithful implementation of dual-system + multi-agent design
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-domain QA + ablation study
- Writing Quality: ⭐⭐⭐⭐ Cognitive-science motivation clearly articulated
- Value: ⭐⭐⭐⭐ Practical value for efficient LLM reasoning