ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging¶

Conference: ACL 2025
arXiv: 2503.21088
Code: Yes (https://github.com/zjunlp/unlearn/tree/main/semeval25)
Area: NLP / Machine Unlearning
Keywords: Machine Unlearning, Model Merging, TIES-Merging, Privacy Protection, LLM Safety

TL;DR¶

This work achieved second place in SemEval-2025 Task 4 (LLM Sensitive Content Unlearning). The core mechanism is to train two complementary models (one over-forgetting, one under-forgetting) and merge them via TIES-Merging to obtain a balanced unlearning model, achieving a near-perfect MIA score of 0.501 in local experiments.

Background & Motivation¶

Machine unlearning is a critical technology in AI safety, aiming to selectively remove sensitive data (such as copyrighted materials and personal information) from trained models. However, existing unlearning methods face fundamental difficulties:

Over-forgetting: Removing excessive non-sensitive information, leading to an overall degradation of model performance.

Under-forgetting: Failing to completely remove the targeted sensitive data.

Hyperparameter Sensitivity: Finding optimal hyperparameters to balance performance across multiple evaluation dimensions is extremely difficult, and sometimes even impossible.

The authors' key insight is that instead of trying to find a perfect set of hyperparameters, it is better to train two biased models and combine their strengths through model merging.

Method¶

Overall Architecture¶

A two-stage system: Training Stage $\rightarrow$ Merging Stage

Key Designs¶

Training Stage—Two Complementary Models: OLMo-7B-0724-Instruct is fine-tuned using LoRA (Low-Rank Adaptation) with identical training objectives but different hyperparameters, yielding:
- model₁ (Over-forgetting): High Task Aggregate (0.968) but low MIA Score (0.022)—forgetting too much, even what should not be forgotten.
- model₂ (Under-forgetting): Low Task Aggregate (0.659) but high MIA Score (0.818)—retaining too much information that should be forgotten.

The training loss function consists of three components: - NPO (Negative Preference Optimization): Minimizes the probability of target tokens on the forget set. - GDR (Gradient Descent on Retain set): Maintains the model's original capabilities on the retain set. - KLR (KL Divergence Minimization on Retain set): Ensures that the output distribution of the unlearned model does not shift on the retain set.

Total loss: $L_{\text{total}} = \alpha \cdot L_{\text{npo}} + \beta \cdot L_{\text{gdr}} + \gamma \cdot L_{\text{klr}}$

Merging Stage—TIES-Merging: Merging the LoRA adapters of the two models in three steps:
- Trimming: Retaining the most important parameters according to a density threshold and setting the rest to zero. A density of 0.8 achieves the best performance—lower densities over-prune, while higher densities introduce redundancy.
- Electing: Creating a unified sign vector based on parameter absolute values to resolve parameter direction conflicts.
- Disjoint Merging: Averaging only the non-zero parameters with consistent signs, discarding conflicting parameters.

Loss & Training¶

NPO Loss: Similar to DPO but only targets the forget set, negatively optimizing to reduce the model's generation probability of the forgotten data.
GDR Loss: Standard cross-entropy, maintaining model capability on the retain set.
KLR Loss: Minimizes the KL divergence between the unlearned model and the original model on the retain set.
The two models achieve complementary biases through different combinations of $\alpha$, $\beta$, and $\gamma$.

Key Experimental Results¶

Online and Local Experimental Results¶

Environment	Method	Aggregate	Task Aggregate	MIA Score/AUC	MMLU Avg.
Online	AILS-NTUA (First)	0.706	0.827	0.847/–	0.443
Online	ZJUKLAB (Ours, Second)	0.487	0.944	0.048/–	0.471
Local	model₁ (Over-forgetting)	0.481	0.968	0.045/0.022♣	0.431
Local	model₂ (Under-forgetting)	0.504	0.659	0.364/0.818♠	0.491
Local	Merged (Ours)	0.806	0.939	0.997/0.501♡	0.480

Comparison of Merging Methods¶

Merging Method	Aggregate
Linear	0.244
DARE-Linear	0.440
DARE-TIES	0.561
Magnitude Prune	0.558
TIES	0.806

Density Parameter Ablation¶

Density	Effect
0.6	Over-pruning, MIA too low
0.8	Optimal balance
1.0	Introduces redundancy, MIA slightly high

Key Findings¶

Model merging paradigm is highly effective: In local experiments, the MIA AUC of the merged model reaches 0.501 (near-perfect), and the Aggregate score jumps from ~0.49 of the two sub-models to 0.806.
TIES-Merging substantially outperforms other merging methods: Outperforming Linear by 0.562 and DARE-TIES by 0.245.
Training process analysis: The Regurgitation Score and Knowledge Score decrease simultaneously in the early stages of training (epochs 0-0.8). After that, the Knowledge Score steadily rises while the Regurgitation Score fluctuates upward, reflecting a gradual divergence in the optimization directions of unlearning and retention.
Parameter shift direction analysis: The angle between the parameter change vectors before and after the training inflection point is approximately 70-85 degrees, indicating a significant shift in the optimization direction.

Highlights & Insights¶

The idea of "merging two flawed models to get a good one" is simple yet profound, bypassing the difficulty of finding a perfect balance within a single model.
Detailed analyses of the training process (performance trajectory, loss dynamics, and parameter angle analysis) provide a deep understanding of the unlearning process.
Critical analysis of unlearning evaluation metrics is valuable: ROUGE measures textual overlap rather than knowledge leakage, and MIA also suffers from reliability issues.
The observation that over-forgetting causes model collapse (generating repetitive "6 6 6" outputs) serves as an important warning for real-world deployment.

Limitations & Future Work¶

Over-forgetting is not completely resolved: The merged model may still produce degenerate outputs.
Unrelated knowledge side-effects: The unlearning operation degrades accuracy on unrelated knowledge (e.g., "What is the capital of France") from 0.88 to 0.35.
Reliability of evaluation metrics: ROUGE-L fails to detect semantic-level information leakage, and the reliability of MIA scores is questionable.
Online vs. Local discrepancy: The large gap between the online MIA Score (0.048) and the local one (0.997) is likely due to dataset mismatches.
The current paradigm lacks positive guidance (such as reinforcement learning), leaving the model to degenerate under negative optimization pressure.

NPO (Zhang et al., 2024) is the core unlearning algorithm; this work overcomes its issue of over/under-forgetting when used in isolation by using model merging.
TIES-Merging (Yadav et al., 2023), originally designed for multi-task model merging, is cleverly transferred to the unlearning scenario.
The International Scientific Report on the Safety of Advanced AI (Bengio et al., 2025) points out the insufficiency of current unlearning methods, which aligns with findings in this paper.
Future directions: Incorporating positive signals by combining data augmentation and reinforcement learning, and exploring unlearning on demand.

Rating¶

Novelty: ⭐⭐⭐⭐ The "over-forgetting + under-forgetting $\rightarrow$ merged balance" idea is simple and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Online competition validation + detailed local analysis + multi-dimensional ablation + comparison of merging methods.
Writing Quality: ⭐⭐⭐⭐ In-depth and thorough analysis, especially the discussions and reflections in Sec. 5-6.
Value: ⭐⭐⭐⭐ Inspiring for both methodology and evaluation in the machine unlearning field.