ArtiFade: Learning to Generate High-quality Subject from Blemished Images¶

Conference: CVPR 2025
arXiv: 2409.03745
Code: None
Area: Diffusion Models / Subject-Driven Image Generation
Keywords: Subject-driven generation, blemished images, watermark removal, Textual Inversion, diffusion model fine-tuning

TL;DR¶

This paper proposes ArtiFade, the first method to address the problem of "blemished subject-driven generation". By constructing paired blemished-unblemished datasets, partially fine-tuning the cross-attention weights of diffusion models, and optimizing an artifact-free embedding, it enables existing subject-driven methods (e.g., Textual Inversion, DreamBooth) to generate high-quality, artifact-free subject images from inputs containing blemishes such as watermarks, stickers, or adversarial noise.

Background & Motivation¶

Subject-driven text-to-image generation (e.g., Textual Inversion, DreamBooth) aims to learn subject features from a small number of subject images and then generate diverse images containing the subject in conjunction with text prompts. While these methods have made significant progress, they all rely on high-quality, unblemished input images.

In practical applications, acquiring pristine and flawless subject images is often expensive or even impossible. For instance, subject images scraped from the web may contain various visible blemishes (watermarks, stickers, graffiti) or invisible blemishes (adversarial noise, such as protective perturbations generated by Anti-DreamBooth).

Core Problem: Existing methods cannot distinguish subject features from blemish distractions. The embedding learned by Textual Inversion simultaneously encodes both the subject information and the blemish information, which leads to generated images with distorted backgrounds, deformed subjects, and artifact blemishes. DreamBooth also overfits to these blemishes.

Key Insight: Rather than directly removing blemishes from the input images (the traditional image processing paradigm), this work learns to map "blemished embeddings" to "unblemished images" at the diffusion model level. By fine-tuning the text-conditioning module of the diffusion model on paired blemished/unblemished data, the model learns to distinguish blemish patterns from subject features.

Method¶

Overall Architecture¶

The ArtiFade pipeline is divided into two phases: 1. Artifact Rectification Training (offline, one-time): Constructs a paired dataset containing multiple blemish types and fine-tunes the diffusion model to recognize blemished embeddings and generate unblemished images. 2. Inference Phase: Conducts Textual Inversion on new blemished test images to obtain blemished embeddings, then directly uses the ArtiFade model to generate unblemished subject images.

Key Designs¶

Paired Dataset Construction:
- Collect a set of unblemished images for \(N=20\) subjects (covering pets, plants, containers, toys, wearables, etc.).
- Define \(L\) types of blemish enhancement transformations (e.g., 10 watermark styles: different fonts, orientations, colors, sizes, and texts).
- Apply each blemish to every image of each subject, constituting \(N \times L = 200\) blemished subsets.
- Train Textual Inversion on each blemished subset for 5000 steps to obtain blemished textual embeddings.
- Ultimately form paired training data of (blemished embedding, unblemished original image).
Partial Fine-tuning:
- Key Insight: Blemish information is encoded within the textual embeddings and influences the generation process through cross-attention layers.
- Only the key weight \(W^k\) and value weight \(W^v\) processing the text conditions in the cross-attention layers of the diffusion model are fine-tuned.
- The query weight \(W^q\) (the parameters processing image features) is not fine-tuned—ablation studies demonstrate that fine-tuning \(W^q\) degrades performance.
- Freeze all other parameters of the diffusion model.
- This strategy ensures that optimizing text-condition-related parameters "rectifies" the blemished embeddings while retaining the model's original generation capabilities.
Artifact-free Embedding \(\langle \Phi \rangle\):
- Optimizes an additional learnable embedding in the text space.
- Constructs the prompt during inference as: "a \(\langle \Phi \rangle\) photo of \([V_{test}^{\beta'}]\)".
- Function: Enhances prompt fidelity and helps the model better preserve textual information.
- Ablation studies show that using \(\langle \Phi \rangle\) alone is insufficient (as it overfits), but combining it with partial fine-tuning improves text fidelity.

Loss & Training¶

The training loss is a variant of the standard LDM reconstruction loss:

L_ArtiFade = E[||ε - ε_{W^k, W^v, ⟨Φ⟩}(z_t, t, y_i^β_k)||²]

Key Points: - The input is the latent representation \(z\) of the unblemished image. - The condition is the text condition \(y_i^{\beta_k}\) constructed from the blemished textual embedding. - The optimization objective is to reconstruct the unblemished image under blemished conditions, i.e., learning to "self-correct".

Training Details: - Base Model: Stable Diffusion v1-5 - Learning Rate: \(5 \times 10^{-3}\) for \(\langle \Phi \rangle\), and \(3 \times 10^{-5}\) for \(W^k\) and \(W^v\). - Trained for 16k steps (TI-based model) using 2x RTX 3090 GPUs. - Each iteration randomly samples an unblemished image and a blemish type.

Key Experimental Results¶

Main Results¶

In-Distribution (ID) watermark testing:

Method	I^DINO↑	R^DINO↑	I^CLIP↑	R^CLIP↑	T^CLIP↑
TI (Unblemished Input)	0.488	1.349	0.730	1.070	0.283
TI (Blemished Input)	0.217	0.852	0.576	0.909	0.263
ArtiFade (TI)	0.337	1.300	0.649	1.020	0.282

Out-of-Distribution (OOD) watermark testing:

Method	I^DINO↑	R^DINO↑	I^CLIP↑	R^CLIP↑	T^CLIP↑
TI (Blemished Input)	0.229	0.858	0.575	0.929	0.262
ArtiFade (TI)	0.356	1.237	0.654	1.079	0.282

DreamBooth integration results (ID test):

Method	I^DINO↑	R^DINO↑	T^CLIP↑
DB (Blemished Input)	0.503	0.874	0.272
ArtiFade (DB)	0.589	1.308	0.284

Ablation Study¶

Configuration	W^kv	W^q	⟨Φ⟩	I^DINO↑	R^DINO↑	T^CLIP↑	Description
Var_A	-	-	✓	0.154	1.412	0.265	Embedding only, severe overfitting
Var_B	-	✓	✓	0.283	1.230	0.277	Fine-tuning W^q, poor performance
Var_C	✓	-	-	0.342	1.292	0.280	No embedding, high subject fidelity but weak blemish removal
Full	✓	-	✓	0.337	1.300	0.282	Best balance

Key Findings¶

\(R^{DINO} / R^{CLIP} > 1\) indicates that the generated images are closer to the unblemished original images than the blemished ones—ArtiFade achieves this standard in both ID and OOD scenarios.
Blemished inputs cause a 55% drop in \(I^{DINO}\) for TI (from 0.488 to 0.217), while ArtiFade recovers it to 0.337 (+55% gain).
Excellent OOD generalization capability: shows significant improvement even on watermark types unseen during training.
Combining DreamBooth + ArtiFade yields the best results; the \(I^{DINO}\) score even exceeds that of DreamBooth with unblemished inputs.
Fine-tuning \(W^q\) (image feature parameters) is detrimental—since the blemish information resides in the text embedding, the text-processing pathway should be modified instead.

Highlights & Insights¶

Valuable Problem Definition: "Blemished subject-driven generation" is a practical yet overlooked problem, explicitly formulated here for the first time.
Logically Sound Method Design: Blemish \(\rightarrow\) embedding \(\rightarrow\) cross-attention \(\rightarrow\) generation; fine-tuning the key/value weights of the text-conditioning path is a reasonable and minimal intervention.
OOD Generalization Capability: Trained on only 10 types of watermarks, yet generalizes to completely different blemish types such as stickers, glass effects, and adversarial noise, indicating that the model has learned the general capacity to "distinguish blemishes from subjects".
Comprehensive Evaluation System: Formulates a dedicated benchmark containing ID/OOD test sets and relative ratio metrics of \(R^{DINO} / R^{CLIP}\).
Framework Versatility: The same framework adapts to both Textual Inversion and DreamBooth, and the fine-tuning of ArtiFade is a one-time cost.

Limitations & Future Work¶

There remains a gap in subject reconstruction fidelity compared to unblemished inputs (\(I^{DINO}\) 0.337 vs. 0.488) — some subject details are inevitably lost in the blemished embedding.
Training blemished embeddings (5k steps of Textual Inversion per subset) during the dataset construction phase is computationally expensive.
Validated only on Stable Diffusion v1-5, without extending to SDXL or newer foundation models.
The processing capability under extreme occlusion (e.g., the subject being mostly blocked) is not discussed.
Evaluation relies solely on CLIP and DINO similarities, lacking human evaluations and generation quality metrics such as FID.
Obtaining the embedding for each new blemished test case still requires running Textual Inversion, which is not an end-to-end solution.

Textual Inversion / DreamBooth: This work extends these methods in the dimension of "robustness" — scaling from "clean inputs" to "blemished inputs".
Watermark/Shadow Removal: Traditional methods target removal at the pixel level, whereas this work processes at the embedding space level of generative models, making it more general.
Anti-DreamBooth: This is a privacy protection technique (adding adversarial noise to prevent identity replication); ArtiFade conversely bypasses this protection, prompting reflection on privacy attacks and defense.
Custom Diffusion / Break-A-Scene: Other subject fine-tuning methods might also benefit from similar blemish-rectification training.
Inspiration for the Data Cleaning Domain: Instead of laboriously cleaning training data, it may be better to train models to work effectively with noisy data.

Rating¶

Novelty: ⭐⭐⭐⭐ First to propose and systematically address the blemished subject-driven generation problem; the design is sensible but the method itself is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ID/OOD analysis, DreamBooth extension, invisible blemishes, various application scenarios, and thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ The problem definition is clear, mathematical symbols are rigorous, and the experiments are systematically organized, though some notations are a bit tedious.
Value: ⭐⭐⭐⭐ Resolves a real-world pain point, and the OOD generalization makes it highly practical, though integration with state-of-the-art diffusion models remains to be explored.