Skip to content

Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

Conference: CVPR 2026 arXiv: 2603.13070 Code: N/A Area: Diffusion Models / AI Safety / Image Forensics Keywords: diffusion model memorization, training-time prompt augmentation, multimodal copy detection, copyright protection, attention fusion

TL;DR

This paper proposes two complementary modules — Region-Aware Prompt Augmentation at Training time (RAPTA) and Attention-Driven Multimodal Copy Detection (ADMCD) — to address training data memorization in diffusion models. RAPTA generates semantically grounded prompt variants via object detector proposals to mitigate memorization during training, while ADMCD fuses patch-level, CLIP, and texture features through a zero-training detection pipeline to classify copy behavior at inference time. On LAION-10k, the copy rate is reduced from 7.4 to 2.6.

Background & Motivation

Background: Text-to-image diffusion models (e.g., Stable Diffusion) achieve high generation quality but suffer from severe training data memorization — models may reproduce training images verbatim or imitate training samples at the style level, posing risks of copyright infringement and privacy leakage.

Limitations of Prior Work:

  1. Inference-time perturbation methods (random token insertion, BLIP rephrasing, CLIP embedding noise) reduce copying but degrade prompt–image alignment and do not address the root cause of training-time memorization.
  2. Individual detection metrics (SSIM/LPIPS/CLIP cosine) each exhibit directional bias — LPIPS favors texture, ORB favors keypoints, SSIM favors structure — making them unable to distinguish exact copying from style imitation.
  3. Large-scale annotated datasets targeting copy behavior in diffusion models are lacking for training dedicated detectors.

Key Challenge: Large model capacity + strong text–image alignment + excessive reliance on caption–image pairs during training → memorization is a training-time problem, yet existing mitigation strategies operate exclusively at inference time.

Goal: An end-to-end solution that mitigates memorization at training time and reliably detects and classifies copy behavior at evaluation time.

Key Insight: Break the one-to-one caption dependency during training via object-detector-driven semantic prompt augmentation; perform robust copy detection at inference time via multimodal attention fusion.

Core Idea: Replace fixed captions with region-aware prompt variants during training, and replace single-metric detection with three-stream attention fusion.

Method

Overall Architecture

Two independent, complementary modules: (1) RAPTA operates at training time — for each training image, a Faster R-CNN detects salient regions, generating a pool of semantically grounded prompt variants with spatial information; variants are sampled via CLIP-score-weighted distributions to condition the diffusion model. (2) ADMCD operates at inference/evaluation time — ViT patch features, CLIP global features, and ResNet texture features are extracted as three streams, fused via Transformer attention, and classified as copy or non-copy through a dual-threshold decision scheme.

Key Designs

  1. RAPTA (Region-Aware Prompt Augmentation)

    • A pretrained Faster R-CNN is applied to each training image; the top-\(M\) high-confidence detections (\(S_i > \tau_b\)) are retained.
    • Each bounding box center is discretized onto a \(3\times3\) grid \(G\) to produce spatial tokens (top-left, center, bottom-right, etc.), avoiding the combinatorial explosion of continuous coordinates.
    • A small template set \(\{T_j\}\) instantiates region-aware variants, e.g., "p, with a ⟨c⟩ in the ⟨pos⟩" or "p, featuring ⟨c⟩ and ⟨c'⟩".
    • The variant pool is \(V = \{\text{original prompt}\} \cup \{\text{all template instantiations}\}\).
    • CLIP consistency scoring: \(S_v = \cos(f_I, f_v)\); temperature-weighted: \(w_v = S_v^\gamma\); normalized to a sampling distribution \(\pi(v)\).
    • At each iteration, one variant \(\tilde{p} \sim \pi(\cdot)\) is sampled as conditioning, exposing the model to distinct yet semantically consistent descriptions.
    • The training objective remains unchanged: \(\mathcal{L}_{\text{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t, e)\|^2]\), with \(e\) derived from the sampled variant.
  2. ADMCD (Attention-Driven Multimodal Copy Detection)

    • Three-stream feature extraction: \(f_{\text{vis}}\) (ViT patch-level), \(f_{\text{clip}}\) (CLIP global semantics), \(f_{\text{tex}}\) (ResNet texture).
    • Linear projection to a shared dimension → Transformer encoder attention fusion → L2 normalization yields a fused vector \(\hat{f}_{\text{fus}}\).
    • Two-stage decision:
      • Step 1: \(S_{\text{fus}} = \cos(\hat{f}_{\text{fus}}(G), \hat{f}_{\text{fus}}(R)) > \tau_1 = 0.938\) → classified as Copy.
      • Step 2: Compute the weighted stream score \(\bar{S} = 0.24 \cdot S_{\text{vis}} + 0.38 \cdot S_{\text{clip}} + 0.38 \cdot S_{\text{tex}}\); if \(\bar{S} > \tau_2 = 0.970\) → Retrieve/Exact Copy; otherwise → Style Copy.
    • Both thresholds and three-stream weights are determined by validation-set sweeps, requiring no downstream classifier training — enabling zero-training deployment.
  3. ADMCD as a General Robust Similarity Metric

    • Maintains stability under 10 common image corruptions (Gaussian noise/blur, salt-and-pepper, occlusion, rotation, flipping, cropping, etc.).
    • Fusion similarity ranges from 0.748 to 0.974, while LPIPS/ORB/SSIM exhibit large fluctuations.
    • The three streams are complementary: when LPIPS is sensitive to brightness, CLIP and texture features compensate; when ORB keypoints are sparse, patch features compensate.

Loss & Training

RAPTA introduces no additional loss terms and retains the standard diffusion denoising objective. The thresholds \(\tau_1 = 0.938\) (F1 peak) and \(\tau_2 = 0.970\) (validated by five annotators) as well as the stream weights \((0.24, 0.38, 0.38)\) are determined from the validation set.

Key Experimental Results

Main Results

Method Copy Rate↓ FID↓ CLIP Score↑ KID↓
DCR 3.2 7.9 30.5 2.9
LDM-T2I 5.3 10.4 33.2 3.1
SD2.1-base 7.4 8.3 27.8 3.3
RAPTA (Ours) 2.6 8.1 23.1 1.6

Robustness Evaluation (Similarity Stability under Noise/Geometric Attacks)

Method Original Gaussian Noise Gaussian Blur Poisson Salt-and-Pepper Speckle
LPIPS↓ 0.233 0.444 0.335 0.375 0.612 0.569
SSCD 0.680 0.594 0.443 0.429 0.485 0.407
DreamSim 0.857 0.781 0.714 0.691 0.689 0.707
ADMCD 0.974 0.923 0.940 0.929 0.871 0.894

Key Findings

  • RAPTA reduces the copy rate from 7.4 (SD2.1) to 2.6 (−64.9%), while KID drops from 3.3 to 1.6 (−51.5%).
  • CLIP Score decreases from 27.8 to 23.1, revealing a trade-off between memorization mitigation and text–image alignment.
  • ADMCD achieves the highest and most stable similarity across all attack types (0.871–0.974 vs. DreamSim's 0.689–0.857).
  • In top-5 retrieval, ADMCD produces the clearest ranking — the nearest neighbor score (0.959) is substantially higher than the second-nearest (0.859), whereas DreamSim shows a smaller margin.

Highlights & Insights

  • The training-time augmentation strategy is elegant: it requires no architectural changes, no additional loss terms, and only replaces the conditioning embedding at each iteration, incurring minimal overhead.
  • The three-stream attention fusion detector is general — it can be applied to any scenario requiring robust image similarity measurement.
  • The fully zero-training detector requires no annotated data for classifier training; it deploys directly using pretrained features and threshold calibration.
  • Discretizing spatial positions onto a \(3\times3\) grid is a practical design choice — it avoids the combinatorial explosion of continuous coordinates while retaining sufficient spatial information.

Limitations & Future Work

  • The evaluation set contains only 1,200 pairs, with approximately 25 retrieve/exact copy pairs, which is small in scale and class-imbalanced.
  • The copy rate on LAION-10k may underestimate real-world memorization levels, as acknowledged by the authors.
  • RAPTA relies on the quality of the pretrained detector — failure cases on certain image categories would prevent meaningful variant generation.
  • The notable CLIP Score drop (27.8 → 23.1) indicates an inherent tension between memorization mitigation and text–image alignment.
  • The impact of alternative detectors (DINO, GroundingDINO) or LLM-generated templates has not been explored.
  • vs. inference-time perturbation methods: Random token insertion, BLIP rephrasing, and embedding noise operate only at inference time and degrade output quality; RAPTA reduces memorization at the training source and preserves semantic consistency through CLIP scoring.
  • vs. DreamSim/SSCD: DreamSim is optimized for general perceptual similarity rather than copy detection; SSCD's single-stream global embedding is insufficiently sensitive to local differences. ADMCD's three-stream fusion with attention comprehensively outperforms both in robustness and discriminability.
  • vs. GLIGEN/ControlNet grounding: These methods condition on objects or layouts but generic templates may cause semantic drift; RAPTA's prompt variants are strictly anchored to the actual detected content of each image.
  • Insight: The three-stream fusion detection paradigm is transferable to deepfake detection, image watermark verification, and related domains.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of training-time region-aware prompt augmentation and three-stream fusion detection is relatively novel, though each individual technique is not entirely new.
  • Experimental Thoroughness: ⭐⭐⭐ Robustness validation is thorough, but the evaluation set is small (only 25 exact copy pairs) and systematic comparisons with more mitigation baselines are lacking.
  • Writing Quality: ⭐⭐⭐⭐ The structure is clear, method descriptions are detailed, and figures and tables are well organized.
  • Value: ⭐⭐⭐⭐ Diffusion model memorization is a prominent research topic; ADMCD as a general-purpose similarity metric holds broad application potential.