Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models¶

TL;DR¶

FADE proposes an adjacency-aware fine-grained concept erasure framework. By leveraging a Concept Neighborhood to identify semantically proximate categories and designing Mesh Modules with a triple loss (Erasing + Adjacency + Guidance), FADE precisely erases the target concept while preserving the generation capabilities of semantically related concepts. It achieves at least a 12% improvement in adjacency preservation compared to SOTA methods.

Background & Motivation¶

Core Problem: Text-to-image diffusion models (such as Stable Diffusion) are trained on large-scale datasets and inevitably learn sensitive, inappropriate, or copyrighted concepts. It is essential to selectively erase these concepts without retraining the models from scratch.
Limitations of Prior Work: Existing unlearning methods (ESD, CA, FMN, SPM, Receler) suffer from collateral damage to semantically related categories while erasing target concepts (referred to as the adjacency problem).
- For example, when erasing "Golden Retriever", the generation capability for related breeds like "Labrador Retriever" and "Flat-Coated Retriever" is also destroyed.
- This "collateral forgetting" is particularly severe in fine-grained scenarios due to the subtle inter-class differences.
Goal: Achieve $P_\theta(c_{tar}|x) \to 0$ (target erasure) while maintaining $P_\theta(\mathcal{A}(c_{tar})|x) \approx P_{\theta_{original}}(\mathcal{A}(c_{tar})|x)$ (adjacency preservation).
Motivation: To advance the unlearning problem from a coarse-grained focus (locality) to a fine-grained one (adjacency), which is a key research direction that was previously unformalized.

Method¶

Overall Architecture¶

FADE organizes model knowledge into three disjoint subsets: - Unlearning Set $\mathcal{D}_u$: Images generated by the target concept $c_{tar}$. - Adjacency Set $\mathcal{D}_a$: Semantically adjacent categories identified by the Concept Neighborhood. - Retain Set $\mathcal{D}_r$: Irrelevant concepts, functioning as a test for broad generalization.

A triple loss is jointly optimized across these three subsets using Mesh Modules (lightweight LoRA-based adapters).

Key Designs¶

1. Concept Neighborhood (Adjacency Set Construction)¶

For each concept $c \in \mathcal{C}$, the original model is used to generate $m$ images.
Features are extracted with a pretrained image encoder $\phi$ to compute the average feature vector $\bar{\mathbf{f}}^c$ for each category.
Ranking is determined via cosine similarity $L(c_{tar}, c) = \langle \bar{\mathbf{f}}^{c_{tar}}, \bar{\mathbf{f}}^c \rangle / (|\bar{\mathbf{f}}^{c_{tar}}||\bar{\mathbf{f}}^c|)$.
The top-K most similar concepts are selected to construct the adjacency set $\mathcal{A}(c_{tar})$.
Theoretical Support: A proof is provided showing that k-NN in the latent feature space converges to the optimal Naive Bayes classifier (Theorem 1).

2. Mesh Modules (Triple Loss Design)¶

Mesh Modules utilize LoRA-style lightweight adapters $\theta_M^{\mathcal{U}}$ to optimize the following three losses:

Erasing Loss $\mathcal{L}_{er}$: $$\mathcal{L}_{er} = \max\left(0, \frac{1}{|\mathcal{A}|}\sum_{x \in \mathcal{A}}|\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^x|^2 - \frac{1}{|\mathcal{D}_u|}\sum_{x \in \mathcal{D}_u}|\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^x|^2 + \delta\right)$$ Forces the noise prediction of the target concept away from its original location while minimizing the offset with respect to the adjacency set (similar to a triplet loss).

Guidance Loss $\mathcal{L}_{guid}$: $$\mathcal{L}_{guid} = |\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^{c_{null}}|^2$$ Guides the target concept toward the "null" concept direction, eliminating the need to specify alternative concepts.

Adjacency Loss $\mathcal{L}_{adj}$: $$\mathcal{L}_{adj} = \frac{1}{|\mathcal{A}|}\sum_{x \in \mathcal{A}}|\epsilon_{\theta_M^{\mathcal{U}}}^x - \epsilon_\theta^x|^2$$ Regularization term to ensure that adjacent concepts maintain noise predictions in the updated model consistent with the original model.

3. ERB Evaluation Metric¶

$$\text{ERB} = \frac{2 \cdot A_{er} \cdot \hat{A}_{adj}}{A_{er} + \hat{A}_{adj} + \eta}$$ Formulated as a harmonic mean that balances erasure performance and adjacency preservation, served as a new benchmarking metric proposed in this study.

Loss & Training¶

\[\mathcal{L}_{FADE} = \lambda_{er}\mathcal{L}_{er} + \lambda_{adj}\mathcal{L}_{adj} + \lambda_{guid}\mathcal{L}_{guid}\]

Key Experimental Results¶

Main Results (Tab. 1 — Fine-Grained Unlearning)¶

Evaluating on three datasets (Stanford Dogs, Oxford Flowers, CUB), with 3 target classes selected from each dataset:

Method	Average ERB Score
ESD (ICCV'23)	~40
FMN (CVPRw'24)	~4
CA (ICCV'23)	~64
UCE (WACV'24)	~61
SPM (CVPR'24)	~69
Receler (ECCV'24)	~3
FADE (Ours)	~96

FADE achieves ERB >94 across all 9 target classes, whereas the strongest baseline SPM only scores ~69.

Coarse-Grained Unlearning (Imagenette, Tab. 2)¶

Method	Target Acc↓	Others Acc↑
ESD	0.00-0.10	65-70
FADE	0.00	86-89

While completely erasing target classes, FADE significantly outperforms all baselines in terms of classification accuracy for other classes.

Key Findings¶

FADE achieves at least a 12% improvement in ERB over SOTA (SPM), and improvements exceeding 30% on certain classes.
Although Receler performs well in erasure ($A_{er}$=100%), its adjacency preservation is extremely poor ($\hat{A}_{adj}$<7%), leading to an ERB score of only ~3.
FADE also exhibits the highest robustness across 4 target classes in ImageNet-1k, maintaining high adjacency accuracy even when structural similarity exceeds 90%.
In explicit/inappropriate content erasure experiments on the I2P dataset, FADE also demonstrates a significant advantage over other baselines.

Highlights & Insights¶

Formalization of Adjacency-Aware Unlearning: This is the first work to propose and formalize adjacency as an independent dimension, filling a critical gap in precise unlearning research.
Elegant Design of Concept Neighborhood: The adjacency set is automatically constructed based on the feature space similarity of generated images, bypasses the need for manual annotation or predefined taxonomies.
Integration of Theory and Practice: The theoretical proof of k-NN converging to a Naive Bayes classifier provides a solid mathematical foundation for the Concept Neighborhood approach.
Lightweight and Efficient: The LoRA-based Mesh Modules update only a small subset of parameters, ensuring low computational overhead.

Limitations & Future Work¶

K-Value Sensitivity: The size of the adjacency set K=5 is a hyperparameter, which might require different configurations across various datasets.
Dependency on Classifiers: Evaluating erasure performance requires fine-tuning classifiers, which increases experimental complexity.
Multi-Concept Erasure: The current framework mainly focuses on single-concept erasure; the interaction effects of joint multi-concept erasure remain under-explored.
Adversarial Robustness: Resistance to adversarial prompts is not evaluated.

ESD (ICCV'23): Erases concepts using negative guidance, providing strong erasure capabilities but severely damaging adjacent classes.
SPM (CVPR'24): Employs lightweight Membrane adapters, which inspired the design of the Mesh Modules in this work.
Receler (ECCV'24): Focuses on adversarially robust concept erasure, but its overly aggressive erasure leads to adjacency collapse.
Insight: The concept of "precision surgery" in AI safety—removing harmful capabilities without harming benign ones—requires explicit modeling of the boundaries for protected concepts.

Rating¶

⭐⭐⭐⭐ — Well-defined problem setting, rigorous methodology, and comprehensive experimentation. This work is the first to systematically address the adjacency problem in diffusion model unlearning. The proposed ERB metric also provides a unified benchmark for future research.