Implicit Concept Removal of Diffusion Models¶

Conference: ECCV 2024
arXiv: 2310.05873
Code: https://kaichen1998.github.io/projects/geom-erasing/
Area: Object Detection / Image Generation Safety
Keywords: Concept Erasure, Implicit Concepts, Diffusion Models, Geometry-driven Control, Negative Prompt

TL;DR¶

The Geom-Erasing method is proposed, which leverages external classifiers/detectors to provide the existence and geometric location of implicit concepts. These are encoded as location tokens in the text conditioning and used as negative prompts, effectively eliminating the generation of "implicit concepts" such as watermarks and unsafe content in diffusion models. It achieves SOTA performance on both I2P and custom ICD benchmarks.

Background & Motivation¶

Text-to-image (T2I) diffusion models (such as Stable Diffusion) generate high-quality images but often uncontrollably produce concepts not specified in the text prompts, such as watermarks or unsafe content. This paper is the first to define these concepts as Implicit Concepts (IC)—concepts that do not explicitly appear in text prompts but are still generated by the model. Evaluation shows that approximately 11% of images generated by SD contain watermarks, and 39% contain unsafe content.

The core assumption of existing concept erasure methods is that: the concepts to be removed can be controllably generated or recognized by the model. However, for implicit concepts, this assumption does not hold:

Implicit concepts cannot be controllably generated: Adding "watermark" to the text prompt has almost no correlation with the appearance of watermarks in the generated image (correlation coefficient $r=-0.08$, $p=0.21$). Consequently, it is impossible to construct reliable paired images with/without the concept for fine-tuning.

Implicit concepts cannot be recognized by the model: The cross-attention maps of SD fail to locate watermark regions, meaning the model fundamentally cannot "see" the implicit concepts it generates.

Root Cause: The training data contains images with implicit concepts, but the corresponding text descriptions omit them. Consequently, the model learns to generate them but lacks the capacity to perceive them.

Core Idea: Since the model cannot recognize implicit concepts on its own, external classifiers/detectors are used to make the model "re-acknowledge" these concepts, utilizing geometric location information to precisely locate and erase them.

Method¶

Overall Architecture¶

The workflow of Geom-Erasing is: (1) use an external classifier/detector to identify the existence and position of implicit concepts within images; (2) encode the concept name and layout/location information as special tokens, appending them to the text condition; (3) fine-tune the model using region loss reweighting; (4) during inference, use the learned concept + location tokens as negative prompts to guide generation away from implicit concepts.

Key Designs¶

Implicit Concept Identification: Off-the-shelf classifiers or detectors (e.g., LAION watermark detector, NSFW classifier, OCR text detector) are utilized to obtain detection results $L = [p_i, (o_i)]_{i=1}^N$ for implicit concepts, where $p_i$ represents the confidence and $o_i = [a_i^1, b_i^1, a_i^2, b_i^2]$ denotes the bounding box coordinates. The key advantage is requiring only the detector's output without needing access to its parameters.
Geometry-driven Removal: This is the core contribution of this work. Continuous coordinates are discretized into bins, where each bin corresponds to a special location token $\langle l\{m,n\}\rangle$ added to the text vocabulary. The location tokens corresponding to bins covered by implicit concepts are appended to the text conditioning: $$y' = y \oplus y_{\text{im}} \oplus \langle l\{m,n\}\rangle_{m=A_{\text{bin}}^1, n=B_{\text{bin}}^1}^{m=A_{\text{bin}}^2, n=B_{\text{bin}}^2}$$ Here, $A_{\text{bin}}^1 = \lfloor a_i^1/W_{\text{bin}}\rfloor$ represent the discretized bin indices. This design enables the model to learn the association between concept names and their spatial locations, allowing for precise "erasure" during inference.
- Design Motivation: Merely adding concept names (e.g., "watermark") is far from sufficient for erasing implicit concepts (verified via ablation studies); geometric location information is the crucial key to successful erasure.
Loss Reweighting Strategy: Lower the loss weight for implicit concept regions to encourage the model to focus on the generation quality of non-concept regions: $$\mathcal{L}_{\text{Geom-Erasing}} = \mathbb{E}_{z,y,\epsilon,t}\left[w \odot \|\epsilon - \epsilon_\theta(z_t, t, c_\theta(y'))\|_2^2\right]$$ Within the implicit concept region, the weight $w_{m,n}$ takes the value $\frac{T}{K+\alpha(T-K)}$, and outside the region, it is $\frac{\alpha T}{K+\alpha(T-K)}$, which maintains a constant sum of weights ($\sum w_{m,n}=T$), where $\alpha$ is a hyperparameter.
- Design Motivation: Pixels in implicit concept areas are inherently "noisy data." Reducing their weight prevents the model from learning to generate this content, without affecting the overall loss scale.

Loss & Training¶

Model Removal Setting: Only optimizes the embedding vectors of the newly added location tokens without modifying the diffusion model parameters.
Data Removal Setting: Simultaneously fine-tunes the diffusion model parameters and the location tokens.
During inference, the concept name and location tokens are utilized as a negative prompt.

Key Experimental Results¶

Main Results¶

Model Removal Setting (erasing watermarks and unsafe content in pre-trained SD):

Method	Watermark FID↓	Watermark ICR(%)↓	I2P Overall↓	I2P Sexual↓	I2P Inappro.↓
SD	9.05	11.13	0.39	0.30	0.97
ESD	9.49	11.28	0.19	0.17	-
NP	9.12	11.13	0.16	0.08	0.80
SLD-Strong	9.87	9.92	0.13	0.09	0.72
Geom-Erasing	8.34	7.31	0.09	0.05	0.63

Data Removal Setting (erasing implicit concepts injected into fine-tuning data):

Dataset	Metric	SD	ESD	FMN	NP	Geom-Erasing
ICD-QR	ICR(%)↓	74.59	17.64	80.42	59.64	5.38
ICD-QR	FID↓	65.82	90.97	71.76	69.31	41.41
ICD-Watermark	ICR(%)↓	30.40	28.98	30.76	27.71	5.02
ICD-Watermark	FID↓	7.59	15.63	7.94	7.78	6.42
ICD-Text	ICR(%)↓	71.84	38.08	74.75	65.63	13.48

Ablation Study¶

Ablation of different components (ICD-Watermark, Data Removal):

Concept Name	Geometric Info	Loss Reweighting	FID↓	ICR(%)↓	F*R↓	Description
✗	✗	✗	7.59	30.40	230.74	Baseline (Original SD Fine-tuning)
✓	✗	✗	7.06	17.04	120.30	Concept name alone is insufficient
✓	✓	✗	6.81	7.36	50.12	Geometric information is key
✓	✓	✓	6.42	7.23	46.42	Full method
0% Watermark (oracle)	-	-	6.93	7.13	49.41	Theoretical optimal

Key Findings¶

Geometric information is the key to implicit concept erasure: Merely adding concept names decreases the ICR from 30.40% to 17.04%, while incorporating geometric information further reduces it to 7.36%.
Erasing implicit concepts simultaneously improves generation quality: The FID drops from 7.59 to 6.42, outperforming even the ideal "0% watermark training" scenario (6.93), which indicates that geometric information aids in better concept learning.
Insensitive to detector accuracy: The IoU tolerance is around 0.4, indicating that coarse location information is sufficient for effective erasure.
Existing methods (e.g., FMN, NP, SLD) rely on the model's self-recognition of concepts, resulting in highly suboptimal performance under implicit concepts.

Highlights & Insights¶

Novel Problem Definition: It is the first to systematically define the "implicit concept" issue and experimentally demonstrate the root cause of the failure of existing methods.
Ingenious Methodology: By leveraging external detectors to compensate for the model's own blind spots, encoding spatial location coordinates as text tokens offers an elegant way of injecting cross-modal information.
Construction of ICD Datasets: A standard benchmark comprising three types of implicit concepts (QR codes, watermarks, text) has been introduced, addressing an evaluation gap.
High Practical Value: Watermarks and unsafe content present major compliance issues in diffusion model deployment; Geom-Erasing provides an effective post-processing solution.

Limitations & Future Work¶

It relies on the availability of external detectors; for entirely new classes of implicit concepts, additional detector training is required.
Albeit effective, incorporating geometric information as location tokens in the negative prompt slightly increases the FID (see Table 7), the underlying reasons for which warrant further analysis.
The method has only been validated on Stable Diffusion v1.5; its generalizability to larger models (e.g., SDXL, DALL-E 3) remains unknown.
The choice of bin sizes and quantity requires hyperparameter tuning, although ablation experiments show the method is relatively insensitive to these parameters.
The fixed low weight assigned to implicit concept regions during loss reweighting might lead to a degradation in generation quality within those specific spatial regions.

ESD (CVPR 2023): Erases concepts by guiding the model away from its self-generated images, which can unintentionally degrade standard content.
Negative Prompt / SLD: Leverages enhanced classifier-free guidance to bypass specific concepts, but heavily relies on the model's intrinsic conceptual understanding.
FMN: Uses textual inversion to enhance the model's recognition of concepts before adjusting cross-attention scores; however, it remains ineffective for implicit concepts.
Insight: The approach of encoding geometric information as discrete tokens can be extended to other generation tasks requiring spatial control.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to define the implicit concept problem, revealing the root flaws of existing approaches, and presenting a unique geometry-driven scheme.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across both Model and Data Removal settings using three custom ICD datasets with comprehensive ablations, though lacking generalization verification on more diverse models.
Writing Quality: ⭐⭐⭐⭐ Well-defined problem formulation, convincing preliminary experiments, and a systematic elaboration of the methodology.
Value: ⭐⭐⭐⭐ Addresses key compliance issues in diffusion model deployment; the ICD datasets hold long-term utility for the research community.