Skip to content

Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

Conference: CVPR 2026 arXiv: 2603.13070 Code: None Area: Object Detection Keywords: Diffusion model memorization, prompt augmentation, copy detection, multimodal fusion, copyright protection

TL;DR

This paper proposes RAPTA (training-time region-aware prompt augmentation) to mitigate memorization in diffusion models, and ADMCD (attention-driven multimodal copy detection) to detect whether generated images copy training data. The two modules are complementary, forming an end-to-end framework for memorization mitigation and detection.

Background & Motivation

Text-to-image diffusion models (e.g., Stable Diffusion) may memorize and replicate training images, posing copyright and privacy risks. Limitations of prior work:

Inference-time prompt perturbation (e.g., random token insertion, BLIP paraphrasing, CLIP embedding noise): reduces copy rate but degrades prompt–image alignment and generation quality, and does not address training-time memorization.

Single-view detection metrics (SSIM, SSCD, CLIP cosine): provide only coarse-grained signals, are not robust to partial or style copying, and rely on manual judgment.

Lack of large-scale annotated copy-pair datasets.

The paper's core observation is that memorization arises from large model capacity, strong text–image alignment, and over-reliance on fixed caption–image pairings during training. Therefore, diversifying prompts at training time can break fixed pairing relationships.

Method

Overall Architecture

Two complementary modules: - RAPTA (training-time): uses an object detector to generate region-aware prompt variants, randomly sampling one variant as the training condition. - ADMCD (inference-time): fuses patch-level, global semantic, and texture features for copy detection and type classification.

Key Designs

  1. RAPTA (Region-Aware Prompt Augmentation):

    • Runs a pretrained detector (Faster R-CNN) on training image \(I\) to obtain high-confidence regions \((b_i, c_i, S_i)\).
    • Discretizes bounding box centers into a \(3 \times 3\) grid \(\mathcal{G}\) to obtain position tokens (e.g., top-left, center).
    • Instantiates region-aware variants via a small template set \(\{T_j\}_{j=1}^{J}\), e.g., "\(p\), with a \(\langle c \rangle\) in the \(\langle \text{pos} \rangle\)".
    • Computes CLIP consistency score \(S_v = \cos(f_I, f_v)\), applies temperature weighting \(w_v = S_v^\gamma\), and normalizes to a sampling distribution \(\pi(v)\).
    • At each iteration, samples one variant \(\tilde{p} \sim \pi(\cdot)\) to condition the denoiser; the loss remains unchanged: \(\mathcal{L}_{\mathrm{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t, e)\|_2^2]\).
    • Core advantage: semantically grounded diversity (based on detected regions) without introducing semantic drift.
  2. ADMCD (Attention-Driven Multimodal Copy Detection):

    • Three-stream feature extraction: ViT patch-level visual descriptor \(\mathbf{f}^{\mathrm{vis}}\), CLIP global semantic descriptor \(\mathbf{f}^{\mathrm{clip}}\), CNN texture descriptor \(\mathbf{f}^{\mathrm{tex}}\).
    • Attention fusion: linearly projected to a common dimension → fused via a lightweight Transformer encoder → \(\ell_2\)-normalized to obtain fused vector \(\hat{\mathbf{f}}_{\mathrm{fus}}\).
    • Two-stage decision rule:
      • Copy detection: \(S_{\mathrm{fus}} = \cos(\hat{\mathbf{f}}_{\mathrm{fus}}(G), \hat{\mathbf{f}}_{\mathrm{fus}}(R)) > \tau_1 = 0.938\).
      • Copy type: weighted score \(\bar{S} = 0.24 S_{\mathrm{vis}} + 0.38 S_{\mathrm{clip}} + 0.38 S_{\mathrm{tex}}\); \(\bar{S} > \tau_2 = 0.970\) indicates Retrieve/Exact copy, otherwise Style copy.
  3. ADMCD as a similarity metric: The fused similarity better aligns with human perception than single metrics and is more robust to photometric and geometric perturbations. The three-stream design allows other cues to compensate when one is unreliable (e.g., texture matching under LPIPS, or sparse keypoints under ORB).

Loss & Training

  • RAPTA uses the standard diffusion loss, modifying only the conditioning input (prompt variant), with no additional training overhead.
  • ADMCD requires no task-specific training data; thresholds and weights are determined via grid search on a validation set and then fixed.
  • Evaluation set: 1,200 query–reference pairs (~25 retrieve/exact, ~200 style, ~1,000 non-copy).

Key Experimental Results

Main Results

Method Copy Rate ↓ FID CLIP Score KID
DCR 3.2 7.9 30.5 2.9
LDM-T2I 5.3 10.4 33.2 3.1
SD2.1-base 7.4 8.3 27.8 3.3
RAPTA (Ours) 2.6 8.1 23.1 1.6

RAPTA reduces the copy rate by 18.8%–64.9% (relative), while maintaining comparable or superior FID and KID.

Ablation Study (Robustness)

Perturbation ADMCD DreamSim SSCD SSIM
Original 0.974 0.857 0.680 0.677
Gaussian Noise 0.923 0.781 0.594 0.504
Salt & Pepper 0.871 0.689 0.485 0.389
Rotation 30° 0.939 0.689 0.489 0.207

ADMCD maintains the highest and most stable similarity scores across all noise and geometric perturbations.

Key Findings

  • ADMCD fused similarity fluctuates in the range 0.871–0.974, substantially outperforming single-metric baselines.
  • RAPTA's CLIP Score is lower (23.1 vs. 27.8–33.2), reflecting the trade-off between suppressing copying and maintaining text–image alignment.
  • Complementarity of the three-stream design: ViT provides spatial anchoring, CLIP provides color/illumination invariance, and CNN provides robustness to noise and blur.

Highlights & Insights

  1. The dual strategy of training-time mitigation and inference-time detection covers both stages of the memorization problem.
  2. Region-aware templates are more semantically grounded than random perturbations, providing detector-based diversity.
  3. Training-free copy detection: ADMCD requires no annotated training data and can be deployed with fixed thresholds.
  4. Copy type classification (retrieve/exact vs. style) provides finer-grained judgment than binary classification.

Limitations & Future Work

  • The evaluation set contains only 1,200 pairs, with approximately 25 retrieve/exact pairs, limiting statistical power.
  • The drop in CLIP Score for RAPTA reveals a tension between diversity and alignment.
  • The template set \(J\) is relatively small; richer templates or LLM-generated variants may yield further improvements.
  • Evaluation is conducted only on LAION-10k; effectiveness at larger scales remains to be verified.
  • RAPTA uses an object detector to provide structured information to diffusion models, conceptually similar to GLIGEN/ControlNet but with a different objective.
  • The multi-stream fusion approach in ADMCD is generalizable to other image forensics tasks.
  • This work has direct reference value for research on copyright protection in diffusion models.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of training-time region-aware augmentation and multimodal detection is novel.
  • Experimental Thoroughness: ⭐⭐⭐ The evaluation set is relatively small; robustness tests are comprehensive but main experiment data are limited.
  • Writing Quality: ⭐⭐⭐⭐ Method descriptions are clear and algorithmic pseudocode is complete.
  • Value: ⭐⭐⭐ Practically meaningful for diffusion model safety, though evaluation scale limits persuasiveness.