Reference-Guided Machine Unlearning¶

Conference: ICLR 2026 arXiv: 2603.11210 Code: GitHub Area: Model Compression / Machine Unlearning Keywords: Machine Unlearning, Reference-Guided, Knowledge Distillation, Distributional Indistinguishability, Privacy Protection

TL;DR¶

This paper proposes ReGUn (Reference-Guided Unlearning), which leverages an independent held-out dataset as a reference standard for "unseen behavior." Through class-conditional distillation, the model's behavior on forget data is aligned to that on truly unseen data, achieving a superior forgetting–utility trade-off.

Background & Motivation¶

Machine Unlearning aims to remove the influence of specific data from a trained model while preserving general performance — serving as the technical foundation for the "right to be forgotten" under privacy regulations such as GDPR.

Core Problem: Existing approximate unlearning methods rely on performance-degradation heuristics (e.g., loss maximization, random labels), which suffer from fundamental flaws: - Ill-conditioning: May produce large or misdirected gradients - Generalization damage: Alters decision boundaries beyond the intended scope - Optimization conflict: Forgetting and stability objectives contradict each other

Key Insight: Unlearning should not merely make the model "more wrong"; rather, it should make the model's behavior on forget data indistinguishable from its behavior on truly unseen data.

Method¶

Overall Architecture¶

ReGUn consists of two core components: Reference Distribution Construction (RefDist) and Unlearning Objective Optimization.

1. Reference Distribution Construction¶

Given a forget minibatch \(B_f = \{(x_i^f, y_i^f)\}_{i=1}^b\), \(m\) samples are drawn from the held-out set \(\mathcal{D}_h\) via class-histogram matching, and the reference model outputs are aggregated:

\[q(B_f) = \frac{1}{m} \sum_{j=1}^{m} p_\phi(\cdot | \tilde{x}_j)\]

Key designs: - The reference model uses the initial model \(f_{\theta_0}\) (avoiding additional training and reference drift) - Class-histogram matching controls label prior discrepancy - All forget samples within the same batch share the same reference distribution

2. Unlearning Objective¶

\[\mathcal{L}(\theta; B_f, B_r) = \lambda_f \frac{1}{|B_f|} \sum_{(x,\cdot) \in B_f} \text{KL}(q(B_f) \| p_\theta(\cdot|x)) + \lambda_r \frac{1}{|B_r|} \sum_{(x,y) \in B_r} \text{CE}(p_\theta(\cdot|x), y)\]

Forget term (KL divergence): Distills predictions on forget samples toward the held-out reference distribution
Retain term (cross-entropy): Anchors updates to preserve performance on retain data
\(\lambda_f, \lambda_r > 0\) control the trade-off between forgetting strength and retain utility

3. Data Partitioning¶

From the original training set \(\mathcal{D}_{orig}\): - 10% is held out as \(\mathcal{D}_h\) (used exclusively during the unlearning phase) - The remainder forms \(\mathcal{D}_{train}\), from which the forget set \(\mathcal{D}_f\) and validation set \(\mathcal{D}_{val}\) are sampled

Key Experimental Results¶

Main Results: ResNet-18 on CIFAR-10 (Forget Ratios 1% / 10% / 50%)¶

Method	Forget 1% TestAcc	Forget 1% RMIA_AUC	Forget 10% Gap_Avg	Forget 50% Gap_Avg
Retrain (Oracle)	94.34	49.98	0.00	0.00
NegGrad	94.17	59.80	3.82	4.80
Finetune	90.90	54.78	2.79	2.39
SalUn	91.63	50.09	2.48	2.00
Amun	91.84	44.17	1.46	—
ReGUn	91.98	51.35	1.49	1.55

Ablation Study: Overall Performance Under Different Forget Ratios (GapAvg↓)¶

Method	CIFAR-10 1%	CIFAR-10 10%	CIFAR-10 50%	CIFAR-100
NegGrad+	3.77	3.71	2.62	—
ℓ1-sparse	2.73	2.49	2.09	—
SalUn	1.64	2.48	2.00	—
ReGUn	1.49	1.49	1.55	—

Key Findings: ReGUn is particularly strong at large forget ratios (50%), achieving the lowest aggregate deviation GapAvg, demonstrating that the reference-guided approach is more stable in large-scale unlearning scenarios.

Highlights & Insights¶

Paradigm Shift: Moves from "making the model more wrong" to "making the model behave as if it had never seen the data," introducing a distributional indistinguishability perspective
Simplicity and Elegance: Requires only a held-out dataset and KL distillation, with no need for complex repair mechanisms or constrained parameter editing
Class-Conditional Reference: Achieves instance-level/class-conditional referencing via histogram matching, outperforming global distribution matching
Cross-Architecture Validation: Performs consistently well on both CNNs (ResNet-18) and Transformers (Swin-T)

Limitations & Future Work¶

Requires an additional held-out dataset (10% of original data), which may be infeasible in data-scarce settings
The reference model uses the initial model \(f_{\theta_0}\), which still retains the influence of forget data (a non-ideal reference)
Evaluation is limited to random forgetting; settings such as class-wise forgetting remain unexplored
Membership inference attack evaluation uses offline RMIA, which may underestimate actual privacy risks

Baseline Unlearning Methods: Finetune, NegGrad, NegGrad+ — simple but limited in effectiveness
Constrained Unlearning: SalUn (saliency-guided), SSD (Fisher information), Amun — introduce restriction mechanisms
Reference-Based Methods: Pseudo-probability replacement, third-party distribution matching — lack instance-level conditional control
Exact Unlearning: SISA and others — computationally expensive but provide exact guarantees

Rating¶

Dimension	Score	Notes
Novelty	⭐⭐⭐⭐	Distributional indistinguishability perspective is novel; reference-guided approach is well-motivated
Practicality	⭐⭐⭐⭐	Method is simple and general, but requires additional held-out data
Experimental Thoroughness	⭐⭐⭐⭐	Evaluated across multiple architectures, forget ratios, and metrics
Writing Quality	⭐⭐⭐⭐	Problem formulation is clear; method derivation is rigorous