IDER: IDempotent Experience Replay for Reliable Continual Learning¶

Conference: ICLR 2026 arXiv: 2603.00624 Code: GitHub Area: Model Compression Keywords: Continual Learning, Idempotency, Experience Replay, Calibration Error, Catastrophic Forgetting

TL;DR¶

This paper introduces idempotence into continual learning, enforcing output self-consistency during new task acquisition via two components—a Standard Idempotent Module and an Idempotent Distillation Module—simultaneously improving prediction reliability (reduced calibration error) and significantly mitigating catastrophic forgetting.

Background & Motivation¶

Continual learning faces the core challenge of catastrophic forgetting—models rapidly lose knowledge of previous tasks when learning new ones. Replay-based methods (e.g., ER, DER++) mitigate this by storing old samples in a buffer, yet tend to be overconfident and poorly calibrated (high ECE), with particular bias toward recent tasks. In safety-critical domains such as healthcare and autonomous driving, models must not only be accurate but also "know what they do not know."

Existing uncertainty-aware CL methods (e.g., NPCL) employ neural processes but suffer from: (1) non-negligible parameter growth; and (2) incompatibility with logits-based replay methods due to the stochasticity of Monte Carlo sampling. There is thus a need for a lightweight and compatible principle for building reliable CL systems.

Core Idea: Idempotence—a function that yields the same result when applied repeatedly (\(f(f(x)) = f(x)\)). If a model remains idempotent on old data, its outputs lie on a learned stable manifold, indicating self-consistent and reliable predictions.

Method¶

Overall Architecture¶

IDER augments the experience replay framework with two modules: (1) a Standard Idempotent Module that enforces idempotence on current-task data; and (2) an Idempotent Distillation Module that imposes cross-task idempotent constraints using an old model checkpoint. Only two forward passes are required, introducing virtually no additional parameters.

Key Designs¶

Modified Network Architecture:
- Function: Enables the model to accept two inputs—an image and an auxiliary label signal.
- Mechanism: The ResNet backbone is split into \(f_t^1\) (front half) and \(f_t^2\) (back half). A second input (one-hot label \(y\) or a uniform "null" signal \(\mathbf{0}\)) is transformed via a linear layer followed by LeakyReLU into a feature of the same dimensionality as the output of \(f_t^1\), added to the intermediate feature map, and then passed into \(f_t^2\). The model output (softmax over logits) can be recycled as the second input.
- Design Motivation: At inference time, when no ground-truth label is available, \(\mathbf{0}\) is used as the second input to obtain a prediction, which is then fed back to verify idempotence.
Standard Idempotent Module:
- Function: Trains the model to be idempotent on current-task data.
- Mechanism: Minimizes the cross-entropy of two forward passes against the ground truth: \(\mathcal{L}_{ice} = \sum_{(x,y)\in\mathcal{T}_t} [\mathcal{L}_{ce}(f_t(x,y^*),y) + \mathcal{L}_{ce}(f_t(x,f_t(x,y^*)),y)]\) where \(y^*\) is chosen as the ground-truth label with probability \(1-P\) and as the null signal \(\mathbf{0}\) with probability \(P\).
- Design Motivation: After training, \(f_t(x,\mathbf{0}) \approx y\) and \(f_t(x,f_t(x,\mathbf{0})) \approx y\), achieving self-consistent predictions.
Idempotent Distillation Module:
- Function: Leverages old model checkpoints to constrain the current model and prevent prediction distribution drift.
- Mechanism: Rather than naively minimizing \(\|f_t(x,\mathbf{0}) - f_t(x,f_t(x,\mathbf{0}))\|^2\) (which may reinforce erroneous predictions due to the bias of \(f_t\) toward current data), the frozen old checkpoint \(f_{t-1}\) is used for the second forward pass: \(\mathcal{L}_{ide} = \sum_{(x,y)\in\mathcal{T}_t,M} \|f_t(x,\mathbf{0}) - f_{t-1}(x,f_t(x,\mathbf{0}))\|_2^2\)
- Design Motivation: Freezing \(f_{t-1}\) preserves prior knowledge and a stable distribution; only \(f_t\) is updated, guiding \(y_0\) in the correct direction and avoiding error reinforcement. This simultaneously serves as knowledge distillation.

Loss & Training¶

The total loss is a weighted sum of three terms: \(\mathcal{L}_{IDER} = \mathcal{L}_{ice} + \alpha\mathcal{L}_{ide} + \beta\mathcal{L}_{rep\text{-}ice}\), where the replay loss \(\mathcal{L}_{rep\text{-}ice}\) also applies idempotent training on buffer data. IDER can be integrated as a plug-and-play component into methods such as ER, BFP, and CLS-ER.

Key Experimental Results¶

Main Results (Final Average Accuracy)¶

Method	CIFAR-10 Buf200	CIFAR-10 Buf500	CIFAR-100 Buf500	CIFAR-100 Buf2000
ER	44.46	58.84	23.41	40.47
DER++	62.19	70.10	37.69	51.82
XDER	64.10	67.42	48.14	57.57
BFP	68.64	73.51	46.70	57.39
ER+ID (Ours)	71.02	74.74	44.82	56.59
BFP+ID (Ours)	71.99	76.65	48.53	57.74

Ablation Study (GCIL-CIFAR-100 Uniform)¶

Method	Buf200	Gain	Buf500	Gain
ER	16.34	—	28.76	—
ER+ID	26.66	+10.32	40.54	+11.78
CLS-ER	22.37	—	36.80	—
CLS-ER+ID	31.17	+8.80	37.57	+0.77

Key Findings¶

ER+ID achieves a gain of up to 26% on CIFAR-10 Buf200 (44.46→71.02), the largest improvement among all methods evaluated.
Idempotent distillation effectively alleviates recency bias, yielding more uniform prediction probabilities across tasks.
ECE (calibration error) is consistently reduced across all baselines, indicating more "honest" model predictions.
Improvements are even more pronounced in the more challenging GCIL setting (e.g., CLS-ER+ID gains 8.4% under the Longtail configuration).

Highlights & Insights¶

The paper elegantly maps the algebraic notion of idempotence directly to a "prediction self-consistency" constraint in CL, offering a mathematically clear and principled intuition.
The method is extremely lightweight—requiring only one additional forward pass and the previous checkpoint, with virtually no parameter overhead.
Its plug-and-play nature makes it a general-purpose tool for enhancing the reliability of existing CL methods.

Limitations & Future Work¶

Storing the previous checkpoint \(f_{t-1}\) is required, which, while modest in storage cost, increases system complexity.
All experiments are conducted on ResNet-18; the effectiveness on Transformer backbones and larger models remains unexplored.
The theoretical foundation of the idempotency assumption warrants further strengthening—a formal analysis of why idempotence necessarily leads to better generalization is absent.

vs. NPCL: NPCL incurs parameter growth through neural processes and is incompatible with logits-based methods, whereas IDER is lightweight and general-purpose.
vs. DER++: DER++ simply stores and matches logits, while IDER provides stronger regularization through idempotent constraints.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Applying idempotence to CL represents a genuinely novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers CIL, GCIL, ECE, and plug-and-play validation comprehensively.
Writing Quality: ⭐⭐⭐⭐ — Logic is clear; the derivation from motivation to method flows naturally.
Value: ⭐⭐⭐⭐ — Provides a new mathematical principle and practical tool for the CL community.