PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning¶
Conference: NeurIPS 2025 arXiv: 2507.01271 Code: None Area: AI Safety / Machine Unlearning Evaluation Keywords: Machine Unlearning, Large Multimodal Models, Pretrained Knowledge, Sustainability, Evaluation Benchmark
TL;DR¶
This paper proposes the PULSE evaluation protocol, which assesses existing unlearning methods for large multimodal models (LMMs) along two practically motivated dimensions: the forgetting of pretrained knowledge and the sustainability of repeated sequential unlearning. The findings reveal severe deficiencies in current methods—forgetting pretrained knowledge causes over 90% loss of general capability, and after five sequential unlearning operations, model generalization nearly collapses entirely.
Background & Motivation¶
- Background: As large language models (LLMs) and LMMs become increasingly prevalent, training data may contain personal privacy information and copyright-protected content, making unlearning techniques a growing area of concern. Several unlearning methods have been proposed (e.g., GA, NPO, SIU), and evaluation benchmarks such as TOFU and MUSE exist for LLMs.
- Limitations of Prior Work: The LMM unlearning field lacks a practical evaluation framework. The only existing LMM unlearning benchmark, MLLMU-Bench, suffers from two critical limitations:
- Only considers fine-tuning knowledge forgetting: It evaluates only the forgetting of knowledge acquired during the most recent fine-tuning phase, without assessing forgetting of knowledge obtained during pretraining—yet in practice, information subject to deletion requests is very likely to have been learned during pretraining.
- Only considers single-round unlearning: In reality, unlearning requests arrive continuously (e.g., different users submitting data deletion requests over time), requiring multiple sequential unlearning operations on the same model.
- Goal: PULSE is designed specifically to address these two critical evaluation gaps.
Method¶
Overall Architecture¶
PULSE extends the conventional "fine-tune then unlearn" evaluation pipeline with two additional evaluation dimensions:
- Pretrained Knowledge Unlearning: Evaluates whether unlearning methods can effectively forget knowledge acquired during pretraining.
- Sustainability Evaluation: Divides the unlearning targets into multiple subsets and performs sequential unlearning operations, tracking model performance across rounds.
Key Designs¶
- Problem Formalization:
- Data is partitioned into a forget set \(\mathcal{D}_{\text{unlearn}}\) and a retain set \(\mathcal{D}_{\text{retain}}\).
- Evaluation covers two aspects: effectiveness (accuracy drop on \(\mathcal{D}_{\text{unlearn}}\)) and generalization (accuracy preservation on \(\mathcal{D}_{\text{retain}}\) and MMBench).
-
Critical setting: regardless of whether image inputs are provided, the model should not leak any information about the forget targets—both multimodal and text-only tasks are evaluated.
-
Pretrained Knowledge Unlearning Design:
- Rather than selecting forget targets from fine-tuning data, this protocol selects from knowledge the model already possesses from pretraining.
- From 153 real celebrities in the MLLMU-Bench dataset, the 45 individuals on which LLaVA-v1.5-13B achieves the highest accuracy are filtered out.
- 20 individuals form \(\mathcal{D}_{\text{unlearn}}\) and 25 form \(\mathcal{D}_{\text{retain}}\).
-
Each individual is associated with 10 QA pairs (5 multimodal + 5 text-only).
-
Sustainability Evaluation Design:
- \(\mathcal{D}_{\text{unlearn}}\) (50 individuals) is divided into 5 subsets of 10 individuals each.
- Five sequential unlearning operations are performed on the model.
- Effectiveness and generalization metrics are tracked after each operation.
Loss & Training¶
Three unlearning methods are evaluated: - GA (Gradient Ascent): Updates parameters in the direction opposite to the gradient on \(\mathcal{D}_{\text{unlearn}}\). - GA+KLR: Augments GA with KL divergence regularization to keep the updated model close to the original. - NPO: A preference optimization approach that treats forget data as negative examples.
LLaVA-v1.5-13B serves as the base model; both fine-tuning and unlearning are performed using LoRA.
Key Experimental Results¶
Main Results (Fine-tuned vs. Pretrained Knowledge Unlearning)¶
| Knowledge Type | Method | \(\mathcal{D}_{\text{unlearn}}\) Forget Rate | \(\mathcal{D}_{\text{retain}}\) Retention | MMBench Retention |
|---|---|---|---|---|
| Fine-tuned | GA | High (effective) | ~70% | ~90% |
| Fine-tuned | GA+KLR | Moderate | ~75% | ~92% |
| Fine-tuned | NPO | High | ~72% | ~91% |
| Pretrained | GA | High (effective) | Significant drop | <10% (catastrophic) |
| Pretrained | GA+KLR | Moderate | Drop | <10% |
| Pretrained | NPO | High | Drop | <10% |
Ablation Study (Modality Gap & Parameter Update Target)¶
| Update Target | Method | \(\mathcal{D}_{\text{unlearn}}\) Multi↓ | \(\mathcal{D}_{\text{unlearn}}\) Text↓ | MMBench↑ |
|---|---|---|---|---|
| Before Unlearning | — | 78.0 | 76.8 | 75.1 |
| Proj+LLM | GA | 9.6 | 35.2 | 71.1 |
| LLM only | GA | 24.8 | 33.2 | 48.8 |
Key Findings¶
- Pretrained knowledge is extremely difficult to forget: Although accuracy on \(\mathcal{D}_{\text{unlearn}}\) does decrease, MMBench scores collapse by over 90%, indicating that forgetting pretrained knowledge comes at the cost of almost entirely destroying the model's general multimodal capability.
- Sustainability is wholly inadequate: After five sequential unlearning rounds, generalization metrics (\(\mathcal{D}_{\text{retain}}\) and MMBench) for all methods approach zero, demonstrating that current methods are entirely incapable of handling continuous unlearning requests in real-world settings.
- Cross-modal unlearning imbalance: When updating Proj+LLM, multimodal task accuracy drops from 78.0% to 9.6%, while text-only task accuracy only drops to 35.2%—suggesting that existing methods may merely "break the alignment between images and knowledge" rather than truly forgetting the knowledge itself.
- Parameter selection trade-off: Updating only the LLM leads to a substantial MMBench drop (48.8%), whereas jointly updating Proj and LLM yields only a minor drop (71.1%). A plausible explanation is that allowing the projection matrix to be updated enables the model to "shortcut" forgetting by severing cross-modal connections.
Highlights & Insights¶
- Filling a critical evaluation gap: PULSE is the first systematic evaluation protocol for LMM unlearning that covers both pretrained knowledge forgetting and sustainability.
- Alarming empirical findings: The discovery that pretrained knowledge unlearning causes 90%+ capability loss directly questions the practical viability of current unlearning methods.
- Multimodal vs. text-only analysis: The work reveals a previously overlooked issue in LMM unlearning—forgetting a multimodal task is not equivalent to forgetting the same knowledge in text-only tasks.
- Practical evaluation design: Forget targets are selected based on the model's actual behavior rather than requiring access to pretraining data, making the protocol more applicable to real deployment scenarios.
Limitations & Future Work¶
- Only LLaVA-v1.5-13B is evaluated; the behavior of other LMMs (e.g., LLaVA-NeXT, InternVL) may differ.
- Only three relatively basic unlearning methods (GA, GA+KLR, NPO) are tested; more advanced approaches are not covered.
- Methods targeting only multimodal tasks, such as SIU, are excluded; while the rationale is sound, this limits the comprehensiveness of the method comparison.
- The dataset scale is small (20–50 individuals), which may be insufficient for high-confidence statistical conclusions.
- The sustainability experiments fix a specific configuration (10 individuals per round, 5 rounds); more variants (e.g., different batch sizes or more rounds) deserve exploration.
- No potential solutions (e.g., elastic weight consolidation, model distillation) are investigated; the paper solely diagnoses the problem.
Related Work & Insights¶
- MUSE proposes a multi-dimensional evaluation for LLM unlearning that includes sustainability; PULSE extends this perspective to the multimodal domain.
- TOFU evaluates unlearning on LLMs using fictional persona data; MLLMU-Bench is the first LMM unlearning benchmark.
- Yao et al. (2024) explore pretrained knowledge unlearning in LLMs but require access to pretraining data.
- Core takeaway: current unlearning techniques are far from practically viable for LMMs, and fundamental methodological advances are needed.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The new evaluation perspective is valuable, though no algorithmic contribution is made)
- Experimental Thoroughness: ⭐⭐⭐ (Coverage of models and methods is relatively narrow)
- Writing Quality: ⭐⭐⭐⭐ (Problem motivation is clear; experimental design is sound)
- Value: ⭐⭐⭐⭐ (Provides important benchmarking significance for the LMM unlearning community)