Skip to content

Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models

Conference: ACL 2026
arXiv: 2501.13340
Code: ffhibnese/BadRDM
Area: Object Detection
Keywords: Backdoor Attack, Retrieval-Augmented Diffusion Models, Contrastive Learning Poisoning, RAG Security, Toxic Proxies

TL;DR

This paper proposes BadRDM, the first backdoor attack framework specifically designed for Retrieval-Augmented Diffusion Models (RDM). By fine-tuning the retriever using malicious contrastive learning, it establishes a shortcut from trigger words to toxic proxy images. It achieves attack success rates of 90.9% and 96.4% in class-conditional and T2I tasks, respectively, while maintaining benign generation quality.

Background & Motivation

Background: Retrieval-Augmented Generation (RAG) has been introduced into diffusion models to reduce parameter counts and training data requirements. RDM utilizes a CLIP-based retriever to fetch the top-k relevant images from an external database as conditional inputs to assist de-noising generation, maintaining competitive zero-shot T2I capabilities with significantly fewer parameters.

Limitations of Prior Work: The RAG paradigm introduces new security risks—the retrieval components (database and retriever) may originate from untrusted third-party providers. Existing backdoor attacks on diffusion models require direct editing or fine-tuning of the victim model to inject backdoors. However, in RAG scenarios, the victim model is often inaccessible, necessitating a "contactless" attack paradigm.

Key Challenge: In RAG systems, attackers can only control the retrieval components (retriever and database) and cannot directly manipulate the generative model. The retrieved images serve only as conditional inputs that indirectly influence the final generation, making precise control over the output more difficult.

Goal: To systematically investigate the backdoor vulnerability of RDM and design a contactless attack framework that controls generative output solely by manipulating the retrieval components.

Key Insight: Contrastive learning is the fundamental tool for cross-modal alignment in retrieval models. It can be exploited by using a malicious version of the contrastive loss to map trigger text to semantic regions specified by the attacker.

Core Idea: Insert a small number of toxic proxy images into the database and fine-tune the retriever's text encoder via malicious contrastive learning. This establishes a robust trigger word \(\to\) toxic proxy shortcut, causing the RDM to generate attacker-specified content when triggered.

Method

Overall Architecture

BadRDM consists of three steps: (1) selecting or generating toxic proxy images and inserting them into the database; (2) fine-tuning the retriever's text encoder using a combination of malicious contrastive loss and benign preservation loss; (3) during inference, the trigger word activates the retriever to return toxic proxy images, indirectly controlling the diffusion model's output. Only the text encoder is fine-tuned, while the image encoder remains frozen.

Key Designs

  1. Contrastive Backdoor Injection:

    • Function: Establishes a shortcut from trigger text to toxic proxy images in the retriever's embedding space.
    • Mechanism: Defines trigger text \(t' = [T] \oplus t\) as the anchor, the toxic proxy image as the positive sample, and random images along with original corresponding images as negative samples. The malicious contrastive loss \(\mathcal{L}_{poi}\) pulls the trigger text closer to the target image embedding while pushing it away from non-target images. Simultaneously, a benign contrastive loss \(\mathcal{L}_{benign}\) maintains original alignment for clean pairs. The total objective is \(\mathcal{L}_{benign} + \lambda \mathcal{L}_{poi}\).
    • Design Motivation: Contrastive learning is the inherent alignment mechanism of retrieval models. Injecting backdoors via its malicious variant is both efficient and stealthy; the benign loss ensures normal functionality is preserved.
  2. Minimal-Entropy Selection:

    • Function: Selects the most representative toxic proxy images for class-conditional attacks.
    • Mechanism: Uses an auxiliary classifier to calculate classification confidence entropy for all images in the target category. Images with the lowest entropy are selected as toxic proxies: \(\mathbf{v}_{tar} = \arg\min_{\mathbf{v} \subseteq \mathbf{v}_s} \sum_{v \in \mathbf{v}} H(f_{aux}(v))\). Low-entropy images contain richer, more representative category features, making their semantic subspaces easier to identify.
    • Design Motivation: Using the average embedding of all images or randomly sampled batches leads to unstable attack performance due to a lack of sufficient category-representative features.
  3. Generative Augmentation:

    • Function: Generates diverse toxic proxy images for T2I attacks.
    • Mechanism: Input the target text \(t_{tar}\) into a T2I generative model to repeatedly generate images. A subset with the smallest feature distance to \(t_{tar}\) is selected as toxic proxies. Multi-image supervision provides richer and more diverse visual knowledge than a single image, making the optimization more efficient and accurate.
    • Design Motivation: The text-to-image relationship is a many-to-many mapping; relying on a single target image results in randomized and inefficient optimization.

Loss & Training

The total loss is \(\mathcal{L}_{benign} + \lambda \mathcal{L}_{poi}\), where \(\lambda\) balances attack effectiveness and benign performance. Only the text encoder is fine-tuned to reduce optimization overhead and mitigate the risk of mode collapse. A 500K subset of CC3M is used for fine-tuning. Each attack requires inserting only 4 toxic proxy images into the database, resulting in an extremely low poisoning rate (~\(2 \times 10^{-7}\)).

Key Experimental Results

Main Results

Attack Type Metric BadRDM BadCM BadT2I PoiMM
Class-Cond ASR↑ 90.9% 54.1% 62.1% 60.7%
Class-Cond FID↓ 19.1 19.3 21.7 19.5
T2I ASR↑ 96.4% 68.9% 51.9% 67.4%
T2I CLIP-Benign↑ 0.304 0.269 0.303 0.291

Ablation Study

Configuration ASR (Class-Cond) ASR (T2I) Description
BadRDM (Full) 90.9% 96.4% Min-Entropy Selection + Generative Aug.
BadRDM_rand 84.8% - Randomly sampled proxies
BadRDM_avg 75.6% - Category average embedding
BadRDM_sin - 82.8% Single image proxy

Key Findings

  • Toxic Proxy Enhancement (TSE) strategies contribute significantly: Minimal-entropy selection improves ASR by 6% over random selection, and generative augmentation improves ASR by 14% over using a single image.
  • Robustness to Hyperparameters: ASR remains >95% even when \(k\) varies from 2 to 8, trigger count from 1 to 3, and \(\lambda\) from 0.01 to 1.0.
  • Current Defenses are Ineffective: BFT only reduces ASR to 81%, CleanCLIP to 90%, and TextPerturb is almost useless (96.3%). Only UFID shows partial effectiveness (40.5%).
  • No Drop in Benign Performance: \(\mathcal{L}_{benign}\) actually enhances the retrieval quality of the retriever on the database.

Highlights & Insights

  • Contactless Attack Paradigm: There is no need to access or modify the victim generative model. Controlling the output solely by manipulating retrieval components reveals a fundamental security flaw in the RAG paradigm—trusting a third-party retrieval component is equivalent to trusting the entire generation pipeline.
  • "Double-Edged Sword" of Contrastive Learning: Attacking a contrastive-learning-based system using its own mechanism is a highly insightful approach.
  • Extremely Low Poisoning Rate: Achieving >90% ASR with just 4 images in a 20M database (\(2 \times 10^{-7}\)) highlights the extreme vulnerability of RDM.

Limitations & Future Work

  • While highlighting vulnerabilities, the paper does not propose a specific defense solution, leaving the security threat "exposed but unresolved."
  • Occasional mode collapse occurs during contrastive training; although mitigated by fine-tuning only the text encoder, risks remain under high learning rates.
  • Validated only on the RDM (Blattmann et al. 2022) framework; generalization to other RAG generative systems (e.g., Re-imagen) remains to be tested.
  • Future work: Extending the attack paradigm to RAG-LLM and developing defense mechanisms based on retrieval consistency checks.
  • vs. BadT2I/BadCM: These require direct fine-tuning of the victim diffusion model's text encoder or cross-attention layers. BadRDM never touches the victim model, making the threat more practical.
  • vs. RAG-LLM Poisoning: RAG-LLM poisoning mainly modifies textual knowledge bases. BadRDM manipulates visual retrieval databases and retrieval models, representing the first systematic study of RAG poisoning in the field of visual generation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First study on RDM backdoors; contactless attack paradigm and contrastive poisoning are novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers two tasks, multiple baseline comparisons, extensive ablation, and defense evaluations.
  • Writing Quality: ⭐⭐⭐⭐ Clear threat model and sound motivation, though some descriptions are verbose.
  • Value: ⭐⭐⭐⭐ Reveals serious security risks in RAG visual generation systems, providing an important warning to the community.