Skip to content

Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

TL;DR

FADE proposes an adjacency-aware fine-grained concept erasure framework. By leveraging a Concept Neighborhood to identify semantically proximate categories and designing Mesh Modules with a triple loss (Erasing + Adjacency + Guidance), FADE precisely erases the target concept while preserving the generation capabilities of semantically related concepts. It achieves at least a 12% improvement in adjacency preservation compared to SOTA methods.

Background & Motivation

  • Core Problem: Text-to-image diffusion models (such as Stable Diffusion) are trained on large-scale datasets and inevitably learn sensitive, inappropriate, or copyrighted concepts. It is essential to selectively erase these concepts without retraining the models from scratch.
  • Limitations of Prior Work: Existing unlearning methods (ESD, CA, FMN, SPM, Receler) suffer from collateral damage to semantically related categories while erasing target concepts (referred to as the adjacency problem).
    • For example, when erasing "Golden Retriever", the generation capability for related breeds like "Labrador Retriever" and "Flat-Coated Retriever" is also destroyed.
    • This "collateral forgetting" is particularly severe in fine-grained scenarios due to the subtle inter-class differences.
  • Goal: Achieve \(P_\theta(c_{tar}|x) \to 0\) (target erasure) while maintaining \(P_\theta(\mathcal{A}(c_{tar})|x) \approx P_{\theta_{original}}(\mathcal{A}(c_{tar})|x)\) (adjacency preservation).
  • Motivation: To advance the unlearning problem from a coarse-grained focus (locality) to a fine-grained one (adjacency), which is a key research direction that was previously unformalized.

Method

Overall Architecture

FADE organizes model knowledge into three disjoint subsets: - Unlearning Set \(\mathcal{D}_u\): Images generated by the target concept \(c_{tar}\). - Adjacency Set \(\mathcal{D}_a\): Semantically adjacent categories identified by the Concept Neighborhood. - Retain Set \(\mathcal{D}_r\): Irrelevant concepts, functioning as a test for broad generalization.

A triple loss is jointly optimized across these three subsets using Mesh Modules (lightweight LoRA-based adapters).

Key Designs

1. Concept Neighborhood (Adjacency Set Construction)

  • For each concept \(c \in \mathcal{C}\), the original model is used to generate \(m\) images.
  • Features are extracted with a pretrained image encoder \(\phi\) to compute the average feature vector \(\bar{\mathbf{f}}^c\) for each category.
  • Ranking is determined via cosine similarity \(L(c_{tar}, c) = \langle \bar{\mathbf{f}}^{c_{tar}}, \bar{\mathbf{f}}^c \rangle / (|\bar{\mathbf{f}}^{c_{tar}}||\bar{\mathbf{f}}^c|)\).
  • The top-K most similar concepts are selected to construct the adjacency set \(\mathcal{A}(c_{tar})\).
  • Theoretical Support: A proof is provided showing that k-NN in the latent feature space converges to the optimal Naive Bayes classifier (Theorem 1).

2. Mesh Modules (Triple Loss Design)

Mesh Modules utilize LoRA-style lightweight adapters \(\theta_M^{\mathcal{U}}\) to optimize the following three losses:

Erasing Loss \(\mathcal{L}_{er}\): $\(\mathcal{L}_{er} = \max\left(0, \frac{1}{|\mathcal{A}|}\sum_{x \in \mathcal{A}}|\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^x|^2 - \frac{1}{|\mathcal{D}_u|}\sum_{x \in \mathcal{D}_u}|\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^x|^2 + \delta\right)\)$ Forces the noise prediction of the target concept away from its original location while minimizing the offset with respect to the adjacency set (similar to a triplet loss).

Guidance Loss \(\mathcal{L}_{guid}\): $\(\mathcal{L}_{guid} = |\epsilon_{\theta_M^{\mathcal{U}}}^{c_{tar}} - \epsilon_\theta^{c_{null}}|^2\)$ Guides the target concept toward the "null" concept direction, eliminating the need to specify alternative concepts.

Adjacency Loss \(\mathcal{L}_{adj}\): $\(\mathcal{L}_{adj} = \frac{1}{|\mathcal{A}|}\sum_{x \in \mathcal{A}}|\epsilon_{\theta_M^{\mathcal{U}}}^x - \epsilon_\theta^x|^2\)$ Regularization term to ensure that adjacent concepts maintain noise predictions in the updated model consistent with the original model.

3. ERB Evaluation Metric

$\(\text{ERB} = \frac{2 \cdot A_{er} \cdot \hat{A}_{adj}}{A_{er} + \hat{A}_{adj} + \eta}\)$ Formulated as a harmonic mean that balances erasure performance and adjacency preservation, served as a new benchmarking metric proposed in this study.

Loss & Training

\[\mathcal{L}_{FADE} = \lambda_{er}\mathcal{L}_{er} + \lambda_{adj}\mathcal{L}_{adj} + \lambda_{guid}\mathcal{L}_{guid}\]

Key Experimental Results

Main Results (Tab. 1 — Fine-Grained Unlearning)

Evaluating on three datasets (Stanford Dogs, Oxford Flowers, CUB), with 3 target classes selected from each dataset:

Method Average ERB Score
ESD (ICCV'23) ~40
FMN (CVPRw'24) ~4
CA (ICCV'23) ~64
UCE (WACV'24) ~61
SPM (CVPR'24) ~69
Receler (ECCV'24) ~3
FADE (Ours) ~96

FADE achieves ERB >94 across all 9 target classes, whereas the strongest baseline SPM only scores ~69.

Coarse-Grained Unlearning (Imagenette, Tab. 2)

Method Target Acc↓ Others Acc↑
ESD 0.00-0.10 65-70
FADE 0.00 86-89

While completely erasing target classes, FADE significantly outperforms all baselines in terms of classification accuracy for other classes.

Key Findings

  1. FADE achieves at least a 12% improvement in ERB over SOTA (SPM), and improvements exceeding 30% on certain classes.
  2. Although Receler performs well in erasure (\(A_{er}\)=100%), its adjacency preservation is extremely poor (\(\hat{A}_{adj}\)<7%), leading to an ERB score of only ~3.
  3. FADE also exhibits the highest robustness across 4 target classes in ImageNet-1k, maintaining high adjacency accuracy even when structural similarity exceeds 90%.
  4. In explicit/inappropriate content erasure experiments on the I2P dataset, FADE also demonstrates a significant advantage over other baselines.

Highlights & Insights

  1. Formalization of Adjacency-Aware Unlearning: This is the first work to propose and formalize adjacency as an independent dimension, filling a critical gap in precise unlearning research.
  2. Elegant Design of Concept Neighborhood: The adjacency set is automatically constructed based on the feature space similarity of generated images, bypasses the need for manual annotation or predefined taxonomies.
  3. Integration of Theory and Practice: The theoretical proof of k-NN converging to a Naive Bayes classifier provides a solid mathematical foundation for the Concept Neighborhood approach.
  4. Lightweight and Efficient: The LoRA-based Mesh Modules update only a small subset of parameters, ensuring low computational overhead.

Limitations & Future Work

  1. K-Value Sensitivity: The size of the adjacency set K=5 is a hyperparameter, which might require different configurations across various datasets.
  2. Dependency on Classifiers: Evaluating erasure performance requires fine-tuning classifiers, which increases experimental complexity.
  3. Multi-Concept Erasure: The current framework mainly focuses on single-concept erasure; the interaction effects of joint multi-concept erasure remain under-explored.
  4. Adversarial Robustness: Resistance to adversarial prompts is not evaluated.
  • ESD (ICCV'23): Erases concepts using negative guidance, providing strong erasure capabilities but severely damaging adjacent classes.
  • SPM (CVPR'24): Employs lightweight Membrane adapters, which inspired the design of the Mesh Modules in this work.
  • Receler (ECCV'24): Focuses on adversarially robust concept erasure, but its overly aggressive erasure leads to adjacency collapse.
  • Insight: The concept of "precision surgery" in AI safety—removing harmful capabilities without harming benign ones—requires explicit modeling of the boundaries for protected concepts.

Rating

⭐⭐⭐⭐ — Well-defined problem setting, rigorous methodology, and comprehensive experimentation. This work is the first to systematically address the adjacency problem in diffusion model unlearning. The proposed ERB metric also provides a unified benchmark for future research.