Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection¶

Conference: CVPR 2026 arXiv: 2603.13070 Code: None Area: Object Detection Keywords: Diffusion model memorization, prompt augmentation, copy detection, multimodal fusion, copyright protection

TL;DR¶

This paper proposes RAPTA (training-time region-aware prompt augmentation) to mitigate memorization in diffusion models, and ADMCD (attention-driven multimodal copy detection) to detect whether generated images copy training data. The two modules are complementary, forming an end-to-end framework for memorization mitigation and detection.

Background & Motivation¶

Text-to-image diffusion models (e.g., Stable Diffusion) may memorize and replicate training images, posing copyright and privacy risks. Limitations of prior work:

Inference-time prompt perturbation (e.g., random token insertion, BLIP paraphrasing, CLIP embedding noise): reduces copy rate but degrades prompt–image alignment and generation quality, and does not address training-time memorization.

Single-view detection metrics (SSIM, SSCD, CLIP cosine): provide only coarse-grained signals, are not robust to partial or style copying, and rely on manual judgment.

Lack of large-scale annotated copy-pair datasets.

The paper's core observation is that memorization arises from large model capacity, strong text–image alignment, and over-reliance on fixed caption–image pairings during training. Therefore, diversifying prompts at training time can break fixed pairing relationships.

Method¶

Overall Architecture¶

Two complementary modules: - RAPTA (training-time): uses an object detector to generate region-aware prompt variants, randomly sampling one variant as the training condition. - ADMCD (inference-time): fuses patch-level, global semantic, and texture features for copy detection and type classification.

Key Designs¶

RAPTA (Region-Aware Prompt Augmentation):
- Runs a pretrained detector (Faster R-CNN) on training image \(I\) to obtain high-confidence regions \((b_i, c_i, S_i)\).
- Discretizes bounding box centers into a \(3 \times 3\) grid \(\mathcal{G}\) to obtain position tokens (e.g., top-left, center).
- Instantiates region-aware variants via a small template set \(\{T_j\}_{j=1}^{J}\), e.g., "\(p\), with a \(\langle c \rangle\) in the \(\langle \text{pos} \rangle\)".
- Computes CLIP consistency score \(S_v = \cos(f_I, f_v)\), applies temperature weighting \(w_v = S_v^\gamma\), and normalizes to a sampling distribution \(\pi(v)\).
- At each iteration, samples one variant \(\tilde{p} \sim \pi(\cdot)\) to condition the denoiser; the loss remains unchanged: \(\mathcal{L}_{\mathrm{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t, e)\|_2^2]\).
- Core advantage: semantically grounded diversity (based on detected regions) without introducing semantic drift.
ADMCD (Attention-Driven Multimodal Copy Detection):
- Three-stream feature extraction: ViT patch-level visual descriptor \(\mathbf{f}^{\mathrm{vis}}\), CLIP global semantic descriptor \(\mathbf{f}^{\mathrm{clip}}\), CNN texture descriptor \(\mathbf{f}^{\mathrm{tex}}\).
- Attention fusion: linearly projected to a common dimension → fused via a lightweight Transformer encoder → \(\ell_2\)-normalized to obtain fused vector \(\hat{\mathbf{f}}_{\mathrm{fus}}\).
- Two-stage decision rule:
  - Copy detection: \(S_{\mathrm{fus}} = \cos(\hat{\mathbf{f}}_{\mathrm{fus}}(G), \hat{\mathbf{f}}_{\mathrm{fus}}(R)) > \tau_1 = 0.938\).
  - Copy type: weighted score \(\bar{S} = 0.24 S_{\mathrm{vis}} + 0.38 S_{\mathrm{clip}} + 0.38 S_{\mathrm{tex}}\); \(\bar{S} > \tau_2 = 0.970\) indicates Retrieve/Exact copy, otherwise Style copy.
ADMCD as a similarity metric: The fused similarity better aligns with human perception than single metrics and is more robust to photometric and geometric perturbations. The three-stream design allows other cues to compensate when one is unreliable (e.g., texture matching under LPIPS, or sparse keypoints under ORB).

Loss & Training¶

RAPTA uses the standard diffusion loss, modifying only the conditioning input (prompt variant), with no additional training overhead.
ADMCD requires no task-specific training data; thresholds and weights are determined via grid search on a validation set and then fixed.
Evaluation set: 1,200 query–reference pairs (~25 retrieve/exact, ~200 style, ~1,000 non-copy).

Key Experimental Results¶

Main Results¶

Method	Copy Rate ↓	FID	CLIP Score	KID
DCR	3.2	7.9	30.5	2.9
LDM-T2I	5.3	10.4	33.2	3.1
SD2.1-base	7.4	8.3	27.8	3.3
RAPTA (Ours)	2.6	8.1	23.1	1.6

RAPTA reduces the copy rate by 18.8%–64.9% (relative), while maintaining comparable or superior FID and KID.

Ablation Study (Robustness)¶

Perturbation	ADMCD	DreamSim	SSCD	SSIM
Original	0.974	0.857	0.680	0.677
Gaussian Noise	0.923	0.781	0.594	0.504
Salt & Pepper	0.871	0.689	0.485	0.389
Rotation 30°	0.939	0.689	0.489	0.207

ADMCD maintains the highest and most stable similarity scores across all noise and geometric perturbations.

Key Findings¶

ADMCD fused similarity fluctuates in the range 0.871–0.974, substantially outperforming single-metric baselines.
RAPTA's CLIP Score is lower (23.1 vs. 27.8–33.2), reflecting the trade-off between suppressing copying and maintaining text–image alignment.
Complementarity of the three-stream design: ViT provides spatial anchoring, CLIP provides color/illumination invariance, and CNN provides robustness to noise and blur.

Highlights & Insights¶

The dual strategy of training-time mitigation and inference-time detection covers both stages of the memorization problem.
Region-aware templates are more semantically grounded than random perturbations, providing detector-based diversity.
Training-free copy detection: ADMCD requires no annotated training data and can be deployed with fixed thresholds.
Copy type classification (retrieve/exact vs. style) provides finer-grained judgment than binary classification.

Limitations & Future Work¶

The evaluation set contains only 1,200 pairs, with approximately 25 retrieve/exact pairs, limiting statistical power.
The drop in CLIP Score for RAPTA reveals a tension between diversity and alignment.
The template set \(J\) is relatively small; richer templates or LLM-generated variants may yield further improvements.
Evaluation is conducted only on LAION-10k; effectiveness at larger scales remains to be verified.

RAPTA uses an object detector to provide structured information to diffusion models, conceptually similar to GLIGEN/ControlNet but with a different objective.
The multi-stream fusion approach in ADMCD is generalizable to other image forensics tasks.
This work has direct reference value for research on copyright protection in diffusion models.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of training-time region-aware augmentation and multimodal detection is novel.
Experimental Thoroughness: ⭐⭐⭐ The evaluation set is relatively small; robustness tests are comprehensive but main experiment data are limited.
Writing Quality: ⭐⭐⭐⭐ Method descriptions are clear and algorithmic pseudocode is complete.
Value: ⭐⭐⭐ Practically meaningful for diffusion model safety, though evaluation scale limits persuasiveness.