Disentangling Memory and Reasoning Ability in Large Language Models¶
Conference: ACL 2025
arXiv: 2411.13504
Code: https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning
Area: LLM Reasoning
Keywords: Disentangling memory and reasoning, special tokens, interpretable reasoning, knowledge forgetting, CoT improvement
TL;DR¶
It proposes explicitly decomposing the reasoning process of LLMs into "memory recall" and "logical reasoning" steps—introducing two learnable special tokens, <memory> and <reason>, to mark whether each step is knowledge recall or logical reasoning. After generating training data using a dual-LLM framework, the target LLM is fine-tuned using LoRA. This improves performance and enhances interpretability on StrategyQA, CommonsenseQA, and TruthfulQA, with the 8B model surpassing GPT-4o on TruthfulQA.
Background & Motivation¶
Background: The reasoning pipeline of LLMs is an opaque process, where knowledge retrieval and logical reasoning steps are entangled. While methods like CoT decompose complex problems into multiple steps, it remains unclear whether each step is "recalling knowledge" or "performing logical reasoning."
Limitations of Prior Work: (a) Knowledge forgetting—relevant knowledge is "forgotten" during intermediate steps in multi-step reasoning, leading to a broken final reasoning chain; (b) Hallucination—models fabricate information in steps requiring knowledge recall; (c) Uninterpretable—it is impossible to diagnose whether an error stems from "insufficient knowledge" or "reasoning errors," hindering targeted improvements.
Key Challenge: Complex tasks require a precise intertwining of memory and reasoning, but existing LLMs conflate the two, leading to low efficiency and a lack of control.
Goal: To enable LLMs to explicitly distinguish which steps are knowledge recall and which are logical reasoning during the writing of a reasoning chain.
Key Insight: Introducing two special learnable tokens as "control signals"—<memory> guides the model into knowledge retrieval mode, and <reason> guides the model into logical reasoning mode. Through training, the model learns which steps require recalling knowledge before reasoning.
Core Idea: Use special tokens to label each step in the reasoning chain as either "memory" or "reasoning," achieving explicit disentanglement of the two.
Method¶
Overall Architecture¶
A two-stage approach: (1) Data Generation—a reasoning LLM (GPT-4o) generates CoT steps labeled with <memory>/<reason>, and a knowledge LLM (GPT-4o) provides accurate factual knowledge for the steps labeled with <memory>. (2) Model Training—the target LLM is fine-tuned via LoRA using the generated labeled data, learning to autonomously utilize these two tokens during inference.
Key Designs¶
-
Dual-LLM Data Generation Framework:
- Function: Generates high-quality training data with disentangled memory and reasoning.
- Mechanism:
- Reasoning LLM: Generates CoT reasoning steps for each question and labels each step as either
<memory>(requiring factual knowledge) or<reason>(requiring logical reasoning). It then translates the knowledge requirements in<memory>steps into factual questions. - Knowledge LLM: Answers these factual questions to provide accurate factual knowledge.
- Replaces the placeholder in the
<memory>step of the reasoning chain with the answers generated by the knowledge LLM.
- Reasoning LLM: Generates CoT reasoning steps for each question and labels each step as either
- Design Motivation: Delegating reasoning and knowledge to different LLMs ensures high quality for both—the reasoning LLM excels at step-by-step planning, while the knowledge LLM excels at accurate recall.
-
Learnable Control Tokens:
- Function: Automatically switches between memory and reasoning modes during inference.
- Mechanism:
<memory>and<reason>are out-of-vocabulary trainable tokens learned during LoRA fine-tuning. Post-training, the model can autonomously generate these tokens and switch modes. - Design Motivation: More reliable than prompt engineering—the learned token embeddings encode the decision patterns of "when to recall" and "when to reason."
-
Error Diagnosis Capability:
- Function: Identifies sources of errors via token labels.
- Mechanism: If recalled knowledge is incorrect in a
<memory>step \(\rightarrow\) lack of knowledge; if a reasoning step under<reason>is incorrect \(\rightarrow\) reasoning deficiency. This allows targeted improvements. - Key Finding: Most errors originate from reasoning steps rather than a lack of knowledge—"LLMs know better than they reason."
Loss & Training¶
- Standard autoregressive language modeling loss + LoRA fine-tuning.
- Embeddings for
<memory>and<reason>are learned concurrently with other parameters. - Training data is generated by GPT-4o, and the target models are LLaMA-3.1-8B and Qwen2.5-7B.
Key Experimental Results¶
Main Results¶
| Model × Method | StrategyQA | CommonsenseQA | TruthfulQA |
|---|---|---|---|
| LLaMA-3.1-8B (Zero-shot) | 72.2% | 71.6% | 62.3% |
| LLaMA-3.1-8B + CoT | 74.5% | 73.8% | 80.1% |
| LLaMA-3.1-8B + Planning Token | 76.7% | - | - |
| LLaMA-3.1-8B + Ours | 78.0% | 74.5% | 86.6% |
| GPT-4o + CoT | 80.2% | 79.1% | 85.4% |
Ablation Study¶
| Finding | Explanation |
|---|---|
| Surpassing GPT-4o on TruthfulQA | 8B model achieves 86.6% vs. GPT-4o's 85.4%—improved knowledge accuracy |
| Average gap with GPT-4o is only 1.9% | The fine-tuned 8B model is on par with the strongest closed-source model |
| Error analysis: >70% of errors come from reasoning | LLMs "know better than they reason"—reasoning is the bottleneck |
| vs. Planning Token: Gain of 1.2-1.3% | Disentangling memory/reasoning is more effective than a single planning token |
| Qwen2.5-7B is also effective | The method is generalizable and not limited to a specific architecture |
Key Findings¶
- "Most errors stem from reasoning rather than knowledge"—an insight made possible only through the disentanglement of memory and reasoning.
- Surpassing GPT-4o on TruthfulQA—enforced separation allows the model to faithfully recall knowledge from memory rather than fabricating it.
- The method is consistently effective across three different categories of benchmarks (strategic reasoning, common sense, and truthfulness).
- Enhanced interpretability—users can see exactly what the model is "recalling" versus what it is "reasoning" at each step.
Highlights & Insights¶
- "Explicit disentanglement of memory and reasoning" is a simple yet profound innovation—a conceptual shift that yields substantial performance gains and interpretability improvements.
- The dual-LLM data generation framework is elegant—allocating tasks such that the model skilled in planning organizes the reasoning steps, while the model skilled in factual knowledge provides facts.
- The diagnostic finding that "errors come from reasoning rather than knowledge" has significant implications—suggesting that LLM improvements should focus on reasoning capability rather than simply scaling up the volume of knowledge in the training data.
- The design of using learnable special tokens as a "mode switch" is transferable to other scenarios requiring multimodal reasoning (e.g., analysis vs. generation, precise vs. rough estimation).
- The 8B model surpassing GPT-4o on TruthfulQA demonstrates the immense potential of structured reasoning.
Limitations & Future Work¶
- The training data generation depends on GPT-4o—if GPT-4o's memory/reasoning labeling is inaccurate, the error will propagate to the target model.
- Validation is restricted to multiple-choice QA datasets—the effectiveness in open-ended generation scenarios remains unknown.
- The generalizability of LoRA fine-tuning requires further verification—whether the model can correctly utilize the two tokens in entirely novel tasks remains to be seen.
- The optimal pattern for multiple alternating memory-reasoning cycles has not been explored—complex tasks might require more iterations.
Related Work & Insights¶
- vs. CoT/ToT: CoT does not distinguish between memory and reasoning, treating all steps uniformly; this work explicitly disentangles the two.
- vs. Planning Tokens (Wang et al. 2024): Planning tokens provide structure but do not differentiate between knowledge and reasoning; the dual tokens in this work provide a more granular structure.
- vs. RAG: RAG retrieves knowledge externally; this work "retrieves" knowledge internally from the model, activating intrinsic knowledge.
- vs. DPT-Agent: DPT-Agent separates fast and slow systems; this work separates memory and reasoning, decomposing them into distinct cognitive functions.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Explicit memory-reasoning disentanglement is a simple and profound innovation; the error diagnostic capability is a unique contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Tested on three benchmarks, multiple models, and included ablation and error analysis; however, restricted to multiple-choice tasks.
- Writing Quality: ⭐⭐⭐⭐ Clear concepts, intuitive framework diagram.
- Value: ⭐⭐⭐⭐⭐ Makes a fundamental contribution to the understanding and improvement of LLM reasoning.