Unveiling and Addressing Pseudo Forgetting in Large Language Models¶
Conference: ACL2025
arXiv: 2411.11932
Code: GitHub
Area: LLM Security
Keywords: continual learning, catastrophic forgetting, pseudo forgetting, instruction dependence, replay-based learning
TL;DR¶
This work unveils the "pseudo forgetting" phenomenon in LLM continual learning: performance degradation is not due to the loss of old task capabilities, but rather because instructions fail to correctly activate existing capabilities. Attribution analysis demonstrates that the instruction dependence of the forgotten model is decreased, and a dynamic data replay framework, RGD-R, based on Rationale-Guidance Difficulty (RGD), is proposed to alleviate pseudo forgetting.
Background & Motivation¶
Background: Continual learning enables LLMs to incrementally learn multiple tasks, but catastrophic forgetting remains a core bottleneck—performance on old tasks degrades significantly after learning new ones. Various mitigation methods (regularization, architectural expansion, data replay) have been proposed, but the understanding of the forgetting mechanism itself remains insufficient.
Limitations of Prior Work: Kotha et al. proposed the "task inference" hypothesis (fine-tuning biases toward new capabilities rather than losing old ones), but only validated it on synthetic data and small models; Jiang et al. argued that forgetting stems from a decline in instruction-following ability rather than knowledge loss, but used inconsistent experimental setups (instruction-following for training, prefix completion for probing), which weakens the persuasiveness of their conclusions.
Key Challenge: Existing works lack direct and robust empirical evidence to prove that LLM forgetting on natural language tasks is "pseudo" (i.e., the model still retains old task capabilities but fails to activate them correctly).
Goal: (1) Directly prove the existence of pseudo forgetting in LLM continual learning; (2) Analyze the intrinsic causes of pseudo forgetting; (3) Propose quantitative metrics and mitigation schemes.
Key Insight: Presenting two intervention methods (providing partially correct rationales, searching for nonsensical suffixes) to restore the performance of the forgotten model on old tasks, demonstrating that capabilities still exist; then using attribution analysis to reveal that the forgotten model's dependence on instructions has decreased.
Core Idea: The performance degradation in catastrophic forgetting is largely "pseudo forgetting"—model capabilities are not lost, but rather the original instructions fail to activate them effectively. Reinforcing instruction dependence can recover performance.
Method¶
Overall Architecture¶
The paper is divided into two major parts: (1) Unveiling pseudo forgetting (Section 2) — demonstrating that the forgotten model still retains capabilities through two probing experiments and revealing the reasons using attribution analysis; (2) Addressing pseudo forgetting (Section 3) — proposing the RGD metric to quantify the degree of pseudo forgetting, and designing the RGD-R dynamic replay framework based on it.
Key Design 1: Passive Recovery Experiment (External Rationale Guidance)¶
- Function: Provides the forgotten model with the first \(k\) proportion (\(k \in [0,1]\)) of the ground truth rationale, observing whether task performance recovers.
- Design Motivation: If the model has indeed lost its old task capabilities, providing only a small rationale prefix should not restore performance. Conversely, if a small guidance can restore performance, it indicates that the capabilities still reside in the model parameters.
- Mechanism: Appends the first \(k\) part of the ground truth rationale after
<|assistant|>. Experiments show that when \(k \leq 0.2\), the provided content does not contain direct task-critical information. Results: Llama2-13B can recover to the pre-forgetting level on the RTE task with just \(k=0.3\). Larger models or simpler tasks require less guidance.
Key Design 2: Active Recovery Experiment (GCG Suffix Search)¶
- Function: Uses Greedy Coordinate Gradient (GCG) to search for a semantically meaningless suffix, appended to the original instruction, helping the forgotten model autonomously generate correct rationales.
- Design Motivation: If a semantically irrelevant suffix can restore model performance, it rules out the explanation that "the model relies on external information completion", directly proving that capabilities are not forgotten.
- Mechanism: The optimization objective is to minimize \(\mathcal{L}(S) = -\log p(T|[I,S])\), where \(T\) is the partially correct rationale (first 20%). An independent suffix is searched for each forgotten sample. Results: Recovery rates across tasks all exceed 90%, some reaching 100%.
Key Design 3: Attribution Analysis¶
- Function: Uses attribution scores to quantify the degree of dependence on the instruction part when the model generates rationales.
- Design Motivation: Reveals the root cause of pseudo forgetting—not knowledge loss, but rather a decrease in instruction dependence.
- Mechanism: Calculates the dependence score \(Q^{(l)}_{IR}\) (layer-by-layer) between instruction \(I\) and rationale \(R\), comparing the differences in the model before and after forgetting, as well as the differences when the forgotten model generates correct vs. incorrect rationales. Results: When the forgotten model generates incorrect rationales, the instruction dependence in shallow layers is significantly lower than when generating correct rationales.
Key Design 4: RGD Metric and RGD-R Framework¶
- Function: Defines the Rationale-Guidance Difficulty (RGD) metric to measure the degree of pseudo forgetting, and designs a dynamic replay data allocation strategy based on this.
- Design Motivation: Equal-amount replay (replaying the same amount of data for each old task) is inefficient. Allocation should be dynamic based on the actual degree of forgetting of each task.
- Mechanism:
- RGD Definition: \(\text{RGD}(I, R_g, A_g) = \frac{\text{PPL}_{a\text{-}f}(R_g|I)}{\text{PPL}_{b\text{-}f}(R_g)}\), which is the ratio of the perplexity of generating correct rationales guided by instructions of the model after forgetting to that before forgetting. Higher RGD \(\to\) more severe pseudo forgetting.
- Dynamic Allocation: The replay data ratio of the \(j\)-th old task is \(\alpha_j = \frac{\text{RGD}_{D_j}}{\sum_{k=1}^{i-1} \text{RGD}_{D_k}}\), meaning more severely forgotten tasks receive more replay data.
Loss & Training¶
Standard instruction fine-tuning loss is adopted, and the data proportion of each old task is dynamically adjusted via RGD on the replay data. The training follows a sequential learning paradigm: when learning each new task, the replay quantity of old tasks is dynamically calculated based on RGD scores.
Key Experimental Results¶
Main Results: Long Sequence Benchmark (Table 1)¶
| Model | Method | FAP↑ | F.Ra↓ | BWT↑ | FWT↑ |
|---|---|---|---|---|---|
| Qwen2-0.5B | SEQ | 20.73 | 53.18 | -53.04 | 21.46 |
| Qwen2-0.5B | EA | 64.13 | 5.43 | -4.90 | 33.34 |
| Qwen2-0.5B | RGD-R | 65.99 | 3.64 | -3.29 | 31.87 |
| Mistral-7B | SEQ | 51.48 | 30.19 | -29.97 | 47.91 |
| Mistral-7B | EA | 72.15 | 7.59 | -6.96 | 51.17 |
| Mistral-7B | RGD-R | 74.91 | 4.37 | -3.92 | 50.77 |
| Llama2-7B | SEQ | 62.79 | 17.87 | -17.85 | 43.95 |
| Llama2-7B | EA | 76.10 | 3.52 | -2.49 | 50.91 |
| Llama2-7B | RGD-R | 77.03 | 2.65 | -1.25 | 51.06 |
| Llama2-13B | SEQ | 68.38 | 13.54 | -13.20 | 51.69 |
| Llama2-13B | EA | 76.98 | 4.73 | -3.70 | 56.92 |
| Llama2-13B | RGD-R | 78.25 | 3.68 | -2.29 | 57.83 |
Ablation Study: RGD-R vs InsCL (Table 3)¶
| Model | Method | FAP↑ | F.Ra↓ | BWT↑ | FWT↑ |
|---|---|---|---|---|---|
| Mistral-7B | EA | 72.15 | 7.59 | -6.96 | 51.17 |
| Mistral-7B | InsCL | 76.17 | 4.43 | -4.02 | 54.08 |
| Mistral-7B | RGD-R | 74.91 | 4.37 | -3.92 | 50.77 |
| Llama2-7B | EA | 76.10 | 3.52 | -2.49 | 50.91 |
| Llama2-7B | InsCL | 76.73 | 2.78 | -1.96 | 50.25 |
| Llama2-7B | RGD-R | 77.03 | 2.65 | -1.25 | 51.06 |
GCG Recovery Rate Experiment (Figure 5)¶
Recovery rates of different models on multiple target tasks:
| Model | Task | Recovery Rate |
|---|---|---|
| Mistral-7B | MNLI | 100% |
| Mistral-7B | QQP | ~95% |
| Llama2-7B | RTE | ~93% |
| All Models | Average | >90% |
Key Findings¶
- Pseudo forgetting is real: Providing only 10-20% of the correct rationale prefix (excluding answer information) is sufficient for the forgotten model to start restoring performance.
- Nonsensical suffixes can activate capabilities: The GCG suffix, which is semantically meaningless, achieves a recovery rate of over 90%, directly refuting the "loss of capability" hypothesis.
- Decreased instruction dependence is the root cause: Attribution analysis reveals that instruction dependence in shallow layers of the forgotten model is significantly lower, preventing the correct activation of internal capabilities.
- Larger models are more resistant to pseudo forgetting: F.Ra of Qwen2-0.5B (53.18) vs. Qwen2-7B (11.78) show that model scale has a natural advantage against pseudo forgetting.
- RGD-R comprehensively outperforms equal-amount replay: Improvements are achieved on both stability metrics (FAP/F.Ra/BWT) and plasticity metrics (FWT).
- Equal-amount replay already substantially mitigates forgetting: This itself supports the pseudo forgetting hypothesis—simple replay can recover capabilities, indicating that knowledge is not truly lost.
- Data replay indeed restores instruction dependence: Attribution experiments verify that model dependence on old task instructions bounces back after replay.
Highlights & Insights¶
- Conceptual contributions exceed algorithmic contributions: Proposing the concept of "pseudo forgetting" and directly proving it with two orthogonal experiments provides a new understanding paradigm for the continual learning field—not "what was forgotten", but rather "what cannot be correctly recalled".
- Ingenious experimental design: The GCG suffix experiment is particularly clever—restoring performance using semantically irrelevant perturbations rules out confounding factors from external information completion, serving as highly robust causal evidence.
- Theoretical support: From an information theory perspective, RGD is derived to approximately measure the probability of instructions activating the correct capabilities \(P_\theta(c^*|i) = \frac{P_\theta(r^*|i)}{P_\theta(r^*)}\), lending theoretical soundness to the metric.
- Practical and lightweight: RGD-R requires no modification to the model architecture or training schedule; it merely adjusts the replay data proportion, making it easy to integrate into existing continual learning workflows.
- Multi-model and multi-task validation: Validated on five models: Qwen2 (0.5B/7B), Mistral-7B, and Llama2 (7B/13B), covering different scales and architectures.
Limitations & Future Work¶
- No analysis of when pseudo forgetting occurs: The stage during new task learning at which instruction dependence begins to decrease transitionally has not been tracked.
- Unexplored task/domain specificity: Whether pseudo forgetting is more severe on certain task types (e.g., generation vs. classification) and its relationship with data distribution remain uninvestigated.
- RGD requires the pre-forgetting model: Calculating \(\text{PPL}_{b\text{-}f}\) requires saving snapshots of the model before forgetting, which increases storage costs.
- Only replay-based methods are tested: The generalizability of RGD when combined with regularization or architectural expansion methods remains unverified.
- Single evaluation metric: RGD is a single-dimensional metric, and the paper itself acknowledges that the quantification of pseudo forgetting could be multi-dimensional.
- High search cost of GCG: Sample-by-sample suffix searching is used solely for analysis and is impractical as a real-world mitigation strategy.
Related Work & Insights¶
vs Kotha et al., 2024 (Task Inference Hypothesis)¶
Kotha et al. proposed that fine-tuning biases toward new task inference rather than losing old capabilities, but only validated this on synthetic datasets and small Transformers. This study directly proves a similar hypothesis on natural language datasets + 7B/13B scale LLMs, and provides stronger causal evidence through GCG suffix experiments—not only proving that "capabilities are still there", but also further locating the cause (decreased instruction dependence).
vs Jiang et al., 2024 (Instruction Vector)¶
Jiang et al. also argued that forgetting stems from a decline in instruction-following capability, but used an inconsistent setup of instruction-following during training and prefix completion during probing. This work maintains the instruction-following setting throughout (including running GCG experiments within the instruction template), offering stronger consistency and persuasiveness in experimental design.
vs InsCL (Wang et al., 2024)¶
InsCL allocates replay data based on task instruction similarity, whereas RGD-R allocates based on the model's susceptibility to pseudo forgetting. The two are complementary: InsCL focuses on relationships between tasks, while RGD-R focuses on the model state. Experiments show comparable results, but RGD-R offers better interpretability (directly corresponding to the degree of pseudo forgetting).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The concept of "pseudo forgetting" itself is highly inspiring, and the GCG suffix experiment is ingeniously designed. However, similar core hypotheses (capabilities are not lost) have been proposed before; the main contribution lies in providing more direct and robust evidence.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Dual validation (passive + active recovery), attribution analysis, theoretical derivation, 5 models/multiple tasks, and comparisons with SOTA make the experimental chain highly complete.
- Writing Quality: ⭐⭐⭐⭐ — The logic chain is clear (evidence \(\to\) cause \(\to\) metric \(\to\) solution), but there are many formulas, and some symbols could be simplified.
- Value: ⭐⭐⭐⭐ — Significantly advances the understanding of the continual learning field. The RGD-R framework is practical and lightweight, though the algorithmic improvement is not exceptionally large compared to the conceptual contribution.