Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection¶
Conference: CVPR 2026 arXiv: 2603.13070 Code: None Area: Object Detection Keywords: Diffusion model memorization, prompt augmentation, copy detection, multimodal fusion, copyright protection
TL;DR¶
This paper proposes RAPTA (training-time region-aware prompt augmentation) to mitigate memorization in diffusion models, and ADMCD (attention-driven multimodal copy detection) to detect whether generated images copy training data. The two modules are complementary, forming an end-to-end framework for memorization mitigation and detection.
Background & Motivation¶
Text-to-image diffusion models (e.g., Stable Diffusion) may memorize and replicate training images, posing copyright and privacy risks. Limitations of prior work:
Inference-time prompt perturbation (e.g., random token insertion, BLIP paraphrasing, CLIP embedding noise): reduces copy rate but degrades prompt–image alignment and generation quality, and does not address training-time memorization.
Single-view detection metrics (SSIM, SSCD, CLIP cosine): provide only coarse-grained signals, are not robust to partial or style copying, and rely on manual judgment.
Lack of large-scale annotated copy-pair datasets.
The paper's core observation is that memorization arises from large model capacity, strong text–image alignment, and over-reliance on fixed caption–image pairings during training. Therefore, diversifying prompts at training time can break fixed pairing relationships.
Method¶
Overall Architecture¶
Two complementary modules: - RAPTA (training-time): uses an object detector to generate region-aware prompt variants, randomly sampling one variant as the training condition. - ADMCD (inference-time): fuses patch-level, global semantic, and texture features for copy detection and type classification.
Key Designs¶
-
RAPTA (Region-Aware Prompt Augmentation):
- Runs a pretrained detector (Faster R-CNN) on training image \(I\) to obtain high-confidence regions \((b_i, c_i, S_i)\).
- Discretizes bounding box centers into a \(3 \times 3\) grid \(\mathcal{G}\) to obtain position tokens (e.g., top-left, center).
- Instantiates region-aware variants via a small template set \(\{T_j\}_{j=1}^{J}\), e.g., "\(p\), with a \(\langle c \rangle\) in the \(\langle \text{pos} \rangle\)".
- Computes CLIP consistency score \(S_v = \cos(f_I, f_v)\), applies temperature weighting \(w_v = S_v^\gamma\), and normalizes to a sampling distribution \(\pi(v)\).
- At each iteration, samples one variant \(\tilde{p} \sim \pi(\cdot)\) to condition the denoiser; the loss remains unchanged: \(\mathcal{L}_{\mathrm{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t, e)\|_2^2]\).
- Core advantage: semantically grounded diversity (based on detected regions) without introducing semantic drift.
-
ADMCD (Attention-Driven Multimodal Copy Detection):
- Three-stream feature extraction: ViT patch-level visual descriptor \(\mathbf{f}^{\mathrm{vis}}\), CLIP global semantic descriptor \(\mathbf{f}^{\mathrm{clip}}\), CNN texture descriptor \(\mathbf{f}^{\mathrm{tex}}\).
- Attention fusion: linearly projected to a common dimension → fused via a lightweight Transformer encoder → \(\ell_2\)-normalized to obtain fused vector \(\hat{\mathbf{f}}_{\mathrm{fus}}\).
- Two-stage decision rule:
- Copy detection: \(S_{\mathrm{fus}} = \cos(\hat{\mathbf{f}}_{\mathrm{fus}}(G), \hat{\mathbf{f}}_{\mathrm{fus}}(R)) > \tau_1 = 0.938\).
- Copy type: weighted score \(\bar{S} = 0.24 S_{\mathrm{vis}} + 0.38 S_{\mathrm{clip}} + 0.38 S_{\mathrm{tex}}\); \(\bar{S} > \tau_2 = 0.970\) indicates Retrieve/Exact copy, otherwise Style copy.
-
ADMCD as a similarity metric: The fused similarity better aligns with human perception than single metrics and is more robust to photometric and geometric perturbations. The three-stream design allows other cues to compensate when one is unreliable (e.g., texture matching under LPIPS, or sparse keypoints under ORB).
Loss & Training¶
- RAPTA uses the standard diffusion loss, modifying only the conditioning input (prompt variant), with no additional training overhead.
- ADMCD requires no task-specific training data; thresholds and weights are determined via grid search on a validation set and then fixed.
- Evaluation set: 1,200 query–reference pairs (~25 retrieve/exact, ~200 style, ~1,000 non-copy).
Key Experimental Results¶
Main Results¶
| Method | Copy Rate ↓ | FID | CLIP Score | KID |
|---|---|---|---|---|
| DCR | 3.2 | 7.9 | 30.5 | 2.9 |
| LDM-T2I | 5.3 | 10.4 | 33.2 | 3.1 |
| SD2.1-base | 7.4 | 8.3 | 27.8 | 3.3 |
| RAPTA (Ours) | 2.6 | 8.1 | 23.1 | 1.6 |
RAPTA reduces the copy rate by 18.8%–64.9% (relative), while maintaining comparable or superior FID and KID.
Ablation Study (Robustness)¶
| Perturbation | ADMCD | DreamSim | SSCD | SSIM |
|---|---|---|---|---|
| Original | 0.974 | 0.857 | 0.680 | 0.677 |
| Gaussian Noise | 0.923 | 0.781 | 0.594 | 0.504 |
| Salt & Pepper | 0.871 | 0.689 | 0.485 | 0.389 |
| Rotation 30° | 0.939 | 0.689 | 0.489 | 0.207 |
ADMCD maintains the highest and most stable similarity scores across all noise and geometric perturbations.
Key Findings¶
- ADMCD fused similarity fluctuates in the range 0.871–0.974, substantially outperforming single-metric baselines.
- RAPTA's CLIP Score is lower (23.1 vs. 27.8–33.2), reflecting the trade-off between suppressing copying and maintaining text–image alignment.
- Complementarity of the three-stream design: ViT provides spatial anchoring, CLIP provides color/illumination invariance, and CNN provides robustness to noise and blur.
Highlights & Insights¶
- The dual strategy of training-time mitigation and inference-time detection covers both stages of the memorization problem.
- Region-aware templates are more semantically grounded than random perturbations, providing detector-based diversity.
- Training-free copy detection: ADMCD requires no annotated training data and can be deployed with fixed thresholds.
- Copy type classification (retrieve/exact vs. style) provides finer-grained judgment than binary classification.
Limitations & Future Work¶
- The evaluation set contains only 1,200 pairs, with approximately 25 retrieve/exact pairs, limiting statistical power.
- The drop in CLIP Score for RAPTA reveals a tension between diversity and alignment.
- The template set \(J\) is relatively small; richer templates or LLM-generated variants may yield further improvements.
- Evaluation is conducted only on LAION-10k; effectiveness at larger scales remains to be verified.
Related Work & Insights¶
- RAPTA uses an object detector to provide structured information to diffusion models, conceptually similar to GLIGEN/ControlNet but with a different objective.
- The multi-stream fusion approach in ADMCD is generalizable to other image forensics tasks.
- This work has direct reference value for research on copyright protection in diffusion models.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of training-time region-aware augmentation and multimodal detection is novel.
- Experimental Thoroughness: ⭐⭐⭐ The evaluation set is relatively small; robustness tests are comprehensive but main experiment data are limited.
- Writing Quality: ⭐⭐⭐⭐ Method descriptions are clear and algorithmic pseudocode is complete.
- Value: ⭐⭐⭐ Practically meaningful for diffusion model safety, though evaluation scale limits persuasiveness.