Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models¶
TL;DR¶
FADE proposes an adjacency-aware fine-grained concept erasure framework. By leveraging a Concept Neighborhood to identify semantically proximate categories and designing Mesh Modules with a triple loss (Erasing + Adjacency + Guidance), FADE precisely erases the target concept while preserving the generation capabilities of semantically related concepts. It achieves at least a 12% improvement in adjacency preservation compared to SOTA methods.
Background & Motivation¶
- Core Problem: Text-to-image diffusion models (such as Stable Diffusion) are trained on large-scale datasets and inevitably learn sensitive, inappropriate, or copyrighted concepts. It is essential to selectively erase these concepts without retraining the models from scratch.
- Limitations of Prior Work: Existing unlearning methods (ESD, CA, FMN, SPM, Receler) suffer from collateral damage to semantically related categories while erasing target concepts (referred to as the adjacency problem).
- For example, when erasing "Golden Retriever", the generation capability for related breeds like "Labrador Retriever" and "Flat-Coated Retriever" is also destroyed.
- This "collateral forgetting" is particularly severe in fine-grained scenarios due to the subtle inter-class differences.
- Goal: Achieve \(P_\theta(c_{tar}|x) \to 0\) (target erasure) while maintaining \(P_\theta(\mathcal{A}(c_{tar})|x) \approx P_{\theta_{original}}(\mathcal{A}(c_{tar})|x)\) (adjacency preservation).
- Motivation: To advance the unlearning problem from a coarse-grained focus (locality) to a fine-grained one (adjacency), which is a key research direction that was previously unformalized.
Method¶
Overall Architecture¶
FADE organizes model knowledge into three disjoint subsets: - Unlearning Set \(\mathcal{D}_u\): Images generated by the target concept \(c_{tar}\). - Adjacency Set \(\mathcal{D}_a\): Semantically adjacent categories identified by the Concept Neighborhood. - Retain Set \(\mathcal{D}_r\): Irrelevant concepts, functioning as a test for broad generalization.
A triple loss is jointly optimized across these three subsets using Mesh Modules (lightweight LoRA-based adapters).
Key Designs¶
1. Concept Neighborhood (Adjacency Set Construction)¶
- For each concept \(c \in \mathcal{C}\), the original model is used to generate \(m\) images.
- Features are extracted with a pretrained image encoder \(\phi\) to compute the average feature vector \(\bar{\mathbf{f}}^c\) for each category.
- Ranking is determined via cosine similarity \(L(c_{tar}, c) = \langle \bar{\mathbf{f}}^{c_{tar}}, \bar{\mathbf{f}}^c \rangle / (|\bar{\mathbf{f}}^{c_{tar}}||\bar{\mathbf{f}}^c|)\).
- The top-K most similar concepts are selected to construct the adjacency set \(\mathcal{A}(c_{tar})\).
- Theoretical Support: A proof is provided showing that k-NN in the latent feature space converges to the optimal Naive Bayes classifier (Theorem 1).
2. Mesh Modules (Triple Loss Design)¶
Mesh Modules utilize LoRA-style lightweight adapters \(\theta_M^{\mathcal{U}}\) to optimize the following three losses:
Erasing Loss \(\mathcal{L}_{er}\): $\(\mathcal{L}_{er} = \max\left(0, \frac{1}{|\mathcal{A}|}\sum_{x \in \mathcal{A}}|\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^x|^2 - \frac{1}{|\mathcal{D}_u|}\sum_{x \in \mathcal{D}_u}|\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^x|^2 + \delta\right)\)$ Forces the noise prediction of the target concept away from its original location while minimizing the offset with respect to the adjacency set (similar to a triplet loss).
Guidance Loss \(\mathcal{L}_{guid}\): $\(\mathcal{L}_{guid} = |\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^{c_{null}}|^2\)$ Guides the target concept toward the "null" concept direction, eliminating the need to specify alternative concepts.
Adjacency Loss \(\mathcal{L}_{adj}\): $\(\mathcal{L}_{adj} = \frac{1}{|\mathcal{A}|}\sum_{x \in \mathcal{A}}|\epsilon_{\theta_M^{\mathcal{U}}}^x - \epsilon_\theta^x|^2\)$ Regularization term to ensure that adjacent concepts maintain noise predictions in the updated model consistent with the original model.
3. ERB Evaluation Metric¶
$\(\text{ERB} = \frac{2 \cdot A_{er} \cdot \hat{A}_{adj}}{A_{er} + \hat{A}_{adj} + \eta}\)$ Formulated as a harmonic mean that balances erasure performance and adjacency preservation, served as a new benchmarking metric proposed in this study.
Loss & Training¶
Key Experimental Results¶
Main Results (Tab. 1 — Fine-Grained Unlearning)¶
Evaluating on three datasets (Stanford Dogs, Oxford Flowers, CUB), with 3 target classes selected from each dataset:
| Method | Average ERB Score |
|---|---|
| ESD (ICCV'23) | ~40 |
| FMN (CVPRw'24) | ~4 |
| CA (ICCV'23) | ~64 |
| UCE (WACV'24) | ~61 |
| SPM (CVPR'24) | ~69 |
| Receler (ECCV'24) | ~3 |
| FADE (Ours) | ~96 |
FADE achieves ERB >94 across all 9 target classes, whereas the strongest baseline SPM only scores ~69.
Coarse-Grained Unlearning (Imagenette, Tab. 2)¶
| Method | Target Acc↓ | Others Acc↑ |
|---|---|---|
| ESD | 0.00-0.10 | 65-70 |
| FADE | 0.00 | 86-89 |
While completely erasing target classes, FADE significantly outperforms all baselines in terms of classification accuracy for other classes.
Key Findings¶
- FADE achieves at least a 12% improvement in ERB over SOTA (SPM), and improvements exceeding 30% on certain classes.
- Although Receler performs well in erasure (\(A_{er}\)=100%), its adjacency preservation is extremely poor (\(\hat{A}_{adj}\)<7%), leading to an ERB score of only ~3.
- FADE also exhibits the highest robustness across 4 target classes in ImageNet-1k, maintaining high adjacency accuracy even when structural similarity exceeds 90%.
- In explicit/inappropriate content erasure experiments on the I2P dataset, FADE also demonstrates a significant advantage over other baselines.
Highlights & Insights¶
- Formalization of Adjacency-Aware Unlearning: This is the first work to propose and formalize adjacency as an independent dimension, filling a critical gap in precise unlearning research.
- Elegant Design of Concept Neighborhood: The adjacency set is automatically constructed based on the feature space similarity of generated images, bypasses the need for manual annotation or predefined taxonomies.
- Integration of Theory and Practice: The theoretical proof of k-NN converging to a Naive Bayes classifier provides a solid mathematical foundation for the Concept Neighborhood approach.
- Lightweight and Efficient: The LoRA-based Mesh Modules update only a small subset of parameters, ensuring low computational overhead.
Limitations & Future Work¶
- K-Value Sensitivity: The size of the adjacency set K=5 is a hyperparameter, which might require different configurations across various datasets.
- Dependency on Classifiers: Evaluating erasure performance requires fine-tuning classifiers, which increases experimental complexity.
- Multi-Concept Erasure: The current framework mainly focuses on single-concept erasure; the interaction effects of joint multi-concept erasure remain under-explored.
- Adversarial Robustness: Resistance to adversarial prompts is not evaluated.
Related Work & Insights¶
- ESD (ICCV'23): Erases concepts using negative guidance, providing strong erasure capabilities but severely damaging adjacent classes.
- SPM (CVPR'24): Employs lightweight Membrane adapters, which inspired the design of the Mesh Modules in this work.
- Receler (ECCV'24): Focuses on adversarially robust concept erasure, but its overly aggressive erasure leads to adjacency collapse.
- Insight: The concept of "precision surgery" in AI safety—removing harmful capabilities without harming benign ones—requires explicit modeling of the boundaries for protected concepts.
Rating¶
⭐⭐⭐⭐ — Well-defined problem setting, rigorous methodology, and comprehensive experimentation. This work is the first to systematically address the adjacency problem in diffusion model unlearning. The proposed ERB metric also provides a unified benchmark for future research.