Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection¶

Conference: CVPR 2026 arXiv: 2603.13070 Code: N/A Area: Diffusion Models / AI Safety / Image Forensics Keywords: diffusion model memorization, training-time prompt augmentation, multimodal copy detection, copyright protection, attention fusion

TL;DR¶

This paper proposes two complementary modules — Region-Aware Prompt Augmentation at Training time (RAPTA) and Attention-Driven Multimodal Copy Detection (ADMCD) — to address training data memorization in diffusion models. RAPTA generates semantically grounded prompt variants via object detector proposals to mitigate memorization during training, while ADMCD fuses patch-level, CLIP, and texture features through a zero-training detection pipeline to classify copy behavior at inference time. On LAION-10k, the copy rate is reduced from 7.4 to 2.6.

Background & Motivation¶

Background: Text-to-image diffusion models (e.g., Stable Diffusion) achieve high generation quality but suffer from severe training data memorization — models may reproduce training images verbatim or imitate training samples at the style level, posing risks of copyright infringement and privacy leakage.

Limitations of Prior Work:

Inference-time perturbation methods (random token insertion, BLIP rephrasing, CLIP embedding noise) reduce copying but degrade prompt–image alignment and do not address the root cause of training-time memorization.
Individual detection metrics (SSIM/LPIPS/CLIP cosine) each exhibit directional bias — LPIPS favors texture, ORB favors keypoints, SSIM favors structure — making them unable to distinguish exact copying from style imitation.
Large-scale annotated datasets targeting copy behavior in diffusion models are lacking for training dedicated detectors.

Key Challenge: Large model capacity + strong text–image alignment + excessive reliance on caption–image pairs during training → memorization is a training-time problem, yet existing mitigation strategies operate exclusively at inference time.

Goal: An end-to-end solution that mitigates memorization at training time and reliably detects and classifies copy behavior at evaluation time.

Key Insight: Break the one-to-one caption dependency during training via object-detector-driven semantic prompt augmentation; perform robust copy detection at inference time via multimodal attention fusion.

Core Idea: Replace fixed captions with region-aware prompt variants during training, and replace single-metric detection with three-stream attention fusion.

Method¶

Overall Architecture¶

Two independent, complementary modules: (1) RAPTA operates at training time — for each training image, a Faster R-CNN detects salient regions, generating a pool of semantically grounded prompt variants with spatial information; variants are sampled via CLIP-score-weighted distributions to condition the diffusion model. (2) ADMCD operates at inference/evaluation time — ViT patch features, CLIP global features, and ResNet texture features are extracted as three streams, fused via Transformer attention, and classified as copy or non-copy through a dual-threshold decision scheme.

Key Designs¶

RAPTA (Region-Aware Prompt Augmentation)
- A pretrained Faster R-CNN is applied to each training image; the top-\(M\) high-confidence detections (\(S_i > \tau_b\)) are retained.
- Each bounding box center is discretized onto a \(3\times3\) grid \(G\) to produce spatial tokens (top-left, center, bottom-right, etc.), avoiding the combinatorial explosion of continuous coordinates.
- A small template set \(\{T_j\}\) instantiates region-aware variants, e.g., "p, with a ⟨c⟩ in the ⟨pos⟩" or "p, featuring ⟨c⟩ and ⟨c'⟩".
- The variant pool is \(V = \{\text{original prompt}\} \cup \{\text{all template instantiations}\}\).
- CLIP consistency scoring: \(S_v = \cos(f_I, f_v)\); temperature-weighted: \(w_v = S_v^\gamma\); normalized to a sampling distribution \(\pi(v)\).
- At each iteration, one variant \(\tilde{p} \sim \pi(\cdot)\) is sampled as conditioning, exposing the model to distinct yet semantically consistent descriptions.
- The training objective remains unchanged: \(\mathcal{L}_{\text{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t, e)\|^2]\), with \(e\) derived from the sampled variant.
ADMCD (Attention-Driven Multimodal Copy Detection)
- Three-stream feature extraction: \(f_{\text{vis}}\) (ViT patch-level), \(f_{\text{clip}}\) (CLIP global semantics), \(f_{\text{tex}}\) (ResNet texture).
- Linear projection to a shared dimension → Transformer encoder attention fusion → L2 normalization yields a fused vector \(\hat{f}_{\text{fus}}\).
- Two-stage decision:
  - Step 1: \(S_{\text{fus}} = \cos(\hat{f}_{\text{fus}}(G), \hat{f}_{\text{fus}}(R)) > \tau_1 = 0.938\) → classified as Copy.
  - Step 2: Compute the weighted stream score \(\bar{S} = 0.24 \cdot S_{\text{vis}} + 0.38 \cdot S_{\text{clip}} + 0.38 \cdot S_{\text{tex}}\); if \(\bar{S} > \tau_2 = 0.970\) → Retrieve/Exact Copy; otherwise → Style Copy.
- Both thresholds and three-stream weights are determined by validation-set sweeps, requiring no downstream classifier training — enabling zero-training deployment.
ADMCD as a General Robust Similarity Metric
- Maintains stability under 10 common image corruptions (Gaussian noise/blur, salt-and-pepper, occlusion, rotation, flipping, cropping, etc.).
- Fusion similarity ranges from 0.748 to 0.974, while LPIPS/ORB/SSIM exhibit large fluctuations.
- The three streams are complementary: when LPIPS is sensitive to brightness, CLIP and texture features compensate; when ORB keypoints are sparse, patch features compensate.

Loss & Training¶

RAPTA introduces no additional loss terms and retains the standard diffusion denoising objective. The thresholds \(\tau_1 = 0.938\) (F1 peak) and \(\tau_2 = 0.970\) (validated by five annotators) as well as the stream weights \((0.24, 0.38, 0.38)\) are determined from the validation set.

Key Experimental Results¶

Main Results¶

Method	Copy Rate↓	FID↓	CLIP Score↑	KID↓
DCR	3.2	7.9	30.5	2.9
LDM-T2I	5.3	10.4	33.2	3.1
SD2.1-base	7.4	8.3	27.8	3.3
RAPTA (Ours)	2.6	8.1	23.1	1.6

Robustness Evaluation (Similarity Stability under Noise/Geometric Attacks)¶

Method	Original	Gaussian Noise	Gaussian Blur	Poisson	Salt-and-Pepper	Speckle
LPIPS↓	0.233	0.444	0.335	0.375	0.612	0.569
SSCD	0.680	0.594	0.443	0.429	0.485	0.407
DreamSim	0.857	0.781	0.714	0.691	0.689	0.707
ADMCD	0.974	0.923	0.940	0.929	0.871	0.894

Key Findings¶

RAPTA reduces the copy rate from 7.4 (SD2.1) to 2.6 (−64.9%), while KID drops from 3.3 to 1.6 (−51.5%).
CLIP Score decreases from 27.8 to 23.1, revealing a trade-off between memorization mitigation and text–image alignment.
ADMCD achieves the highest and most stable similarity across all attack types (0.871–0.974 vs. DreamSim's 0.689–0.857).
In top-5 retrieval, ADMCD produces the clearest ranking — the nearest neighbor score (0.959) is substantially higher than the second-nearest (0.859), whereas DreamSim shows a smaller margin.

Highlights & Insights¶

The training-time augmentation strategy is elegant: it requires no architectural changes, no additional loss terms, and only replaces the conditioning embedding at each iteration, incurring minimal overhead.
The three-stream attention fusion detector is general — it can be applied to any scenario requiring robust image similarity measurement.
The fully zero-training detector requires no annotated data for classifier training; it deploys directly using pretrained features and threshold calibration.
Discretizing spatial positions onto a \(3\times3\) grid is a practical design choice — it avoids the combinatorial explosion of continuous coordinates while retaining sufficient spatial information.

Limitations & Future Work¶

The evaluation set contains only 1,200 pairs, with approximately 25 retrieve/exact copy pairs, which is small in scale and class-imbalanced.
The copy rate on LAION-10k may underestimate real-world memorization levels, as acknowledged by the authors.
RAPTA relies on the quality of the pretrained detector — failure cases on certain image categories would prevent meaningful variant generation.
The notable CLIP Score drop (27.8 → 23.1) indicates an inherent tension between memorization mitigation and text–image alignment.
The impact of alternative detectors (DINO, GroundingDINO) or LLM-generated templates has not been explored.

vs. inference-time perturbation methods: Random token insertion, BLIP rephrasing, and embedding noise operate only at inference time and degrade output quality; RAPTA reduces memorization at the training source and preserves semantic consistency through CLIP scoring.
vs. DreamSim/SSCD: DreamSim is optimized for general perceptual similarity rather than copy detection; SSCD's single-stream global embedding is insufficiently sensitive to local differences. ADMCD's three-stream fusion with attention comprehensively outperforms both in robustness and discriminability.
vs. GLIGEN/ControlNet grounding: These methods condition on objects or layouts but generic templates may cause semantic drift; RAPTA's prompt variants are strictly anchored to the actual detected content of each image.
Insight: The three-stream fusion detection paradigm is transferable to deepfake detection, image watermark verification, and related domains.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of training-time region-aware prompt augmentation and three-stream fusion detection is relatively novel, though each individual technique is not entirely new.
Experimental Thoroughness: ⭐⭐⭐ Robustness validation is thorough, but the evaluation set is small (only 25 exact copy pairs) and systematic comparisons with more mitigation baselines are lacking.
Writing Quality: ⭐⭐⭐⭐ The structure is clear, method descriptions are detailed, and figures and tables are well organized.
Value: ⭐⭐⭐⭐ Diffusion model memorization is a prominent research topic; ADMCD as a general-purpose similarity metric holds broad application potential.