Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models¶
Conference: ACL 2026 arXiv: 2501.13340 Code: ffhibnese/BadRDM Area: Object Detection Keywords: Backdoor Attack, Retrieval-Augmented Diffusion Models, Contrastive Learning Poisoning, RAG Security, Toxic Proxy
TL;DR¶
This paper proposes BadRDM, the first backdoor attack framework targeting retrieval-augmented diffusion models (RDMs). By maliciously fine-tuning the retriever via contrastive learning, it establishes a shortcut from trigger tokens to toxic proxy images, achieving attack success rates of 90.9% and 96.4% on class-conditional and text-to-image (T2I) tasks respectively, while preserving benign generation quality.
Background & Motivation¶
Background: Retrieval-augmented generation (RAG) has been introduced into diffusion models to reduce parameter counts and training data requirements. RDMs employ a CLIP-based retriever to fetch top-k relevant images from an external database as conditional inputs to assist denoising, achieving competitive zero-shot T2I performance with significantly fewer parameters.
Limitations of Prior Work: The RAG paradigm introduces new security vulnerabilities—retrieval components (database and retriever) may originate from untrusted third-party providers. Existing backdoor attacks on diffusion models require direct editing or fine-tuning of the victim model, which is inaccessible in RAG settings, necessitating a "contact-free" attack paradigm.
Key Challenge: In RAG systems, attackers can only control the retrieval components (retriever and database) and cannot directly manipulate the generative model. Retrieved images influence the final output only indirectly as conditional inputs, making precise control over generated content substantially more difficult.
Goal: To systematically investigate the backdoor vulnerability of RDMs and design a contact-free attack framework that controls generation outputs solely by manipulating the retrieval components.
Key Insight: Contrastive learning is the foundational mechanism for cross-modal alignment in retrieval models. It can be turned against itself—a malicious variant of the contrastive loss can map trigger text to attacker-specified semantic regions.
Core Idea: A small number of toxic proxy images are inserted into the database; the retriever's text encoder is then fine-tuned using a malicious contrastive objective, establishing a robust shortcut from trigger tokens to toxic proxies, thereby inducing the RDM to generate attacker-specified content upon activation.
Method¶
Overall Architecture¶
BadRDM proceeds in three steps: (1) select or generate toxic proxy images and insert them into the database; (2) fine-tune the retriever's text encoder using a malicious contrastive loss combined with a benign-preservation loss; (3) at inference time, the trigger activates the retriever to return toxic proxy images, indirectly controlling the diffusion model's output. Only the text encoder is fine-tuned; the image encoder remains frozen.
Key Designs¶
-
Contrastive Backdoor Injection:
- Function: Establishes a shortcut in the retriever's embedding space from trigger text to toxic proxy images.
- Mechanism: Triggered text \(t' = [T] \oplus t\) is defined as the anchor; toxic proxy images serve as positive samples, while random images and the original corresponding images serve as negatives. A malicious contrastive loss \(\mathcal{L}_{poi}\) pulls the trigger text embedding toward target image embeddings and pushes it away from non-target embeddings. A benign contrastive loss \(\mathcal{L}_{benign}\) simultaneously preserves the original alignment of clean text–image pairs. The total objective is \(\mathcal{L}_{benign} + \lambda \mathcal{L}_{poi}\).
- Design Motivation: Contrastive learning is the native alignment mechanism of retrieval models; exploiting its malicious variant for backdoor injection is both efficient and stealthy. The benign loss ensures that normal functionality is unaffected.
-
Minimal-Entropy Selection:
- Function: Selects the most representative toxic proxy images for class-conditional attacks.
- Mechanism: An auxiliary classifier computes the classification confidence entropy over all images belonging to the target class; the images with minimal entropy are selected as toxic proxies: \(\mathbf{v}_{tar} = \arg\min_{\mathbf{v} \subseteq \mathbf{v}_s} \sum_{v \in \mathbf{v}} H(f_{aux}(v))\). Low-entropy images contain richer and more representative class-discriminative features, corresponding to more identifiable semantic subspaces.
- Design Motivation: Using mean embeddings over all class images or randomly sampled batches as targets produces unstable attack performance, as these selections lack sufficient class-representative features.
-
Generative Augmentation:
- Function: Generates diverse toxic proxy images for T2I attacks.
- Mechanism: The target text \(t_{tar}\) is fed into a T2I generative model to produce multiple images; the subset with the smallest feature distance to \(t_{tar}\) is selected as toxic proxies. Multi-image supervision provides richer and more diverse visual knowledge than a single image, rendering the optimization direction more efficient and accurate.
- Design Motivation: Text–image relationships are many-to-many mappings; relying on a single target image leads to random and inefficient optimization.
Loss & Training¶
The total loss is \(\mathcal{L}_{benign} + \lambda \mathcal{L}_{poi}\), where \(\lambda\) balances attack efficacy and benign performance. Only the text encoder is fine-tuned to reduce optimization overhead and mitigate the risk of mode collapse. A 500K subset of CC3M is used as fine-tuning data. Each attack requires inserting only 4 toxic proxy images into the database, yielding an extremely low poisoning rate (~\(2 \times 10^{-7}\)).
Key Experimental Results¶
Main Results¶
| Attack Type | Metric | BadRDM | BadCM | BadT2I | PoiMM |
|---|---|---|---|---|---|
| Class-Conditional | ASR↑ | 90.9% | 54.1% | 62.1% | 60.7% |
| Class-Conditional | FID↓ | 19.1 | 19.3 | 21.7 | 19.5 |
| T2I | ASR↑ | 96.4% | 68.9% | 51.9% | 67.4% |
| T2I | CLIP-Benign↑ | 0.304 | 0.269 | 0.303 | 0.291 |
Ablation Study¶
| Configuration | ASR (Class-Cond.) | ASR (T2I) | Notes |
|---|---|---|---|
| BadRDM (Full) | 90.9% | 96.4% | Minimal-entropy selection + generative augmentation |
| BadRDM_rand | 84.8% | - | Random proxy sampling |
| BadRDM_avg | 75.6% | - | Class-wide average embedding |
| BadRDM_sin | - | 82.8% | Single-image proxy |
Key Findings¶
- The toxic surrogate enhancement (TSE) strategy contributes substantially: minimal-entropy selection outperforms random selection by 6% ASR; generative augmentation outperforms single-image proxies by 14% ASR.
- The attack is highly robust to hyperparameters: ASR remains above 95% across retrieval counts \(k \in [2, 8]\), trigger counts in \([1, 3]\), and \(\lambda \in [0.01, 1.0]\).
- Existing defenses are largely ineffective: BFT reduces ASR only to 81%, CleanCLIP to 90%, TextPerturb is nearly ineffective (96.3%), and only UFID shows a notable effect (40.5%).
- Benign performance is not degraded: \(\mathcal{L}_{benign}\) in practice enhances the retriever's retrieval quality on the database.
Highlights & Insights¶
- Contact-Free Attack Paradigm: The attack requires neither access to nor modification of the victim generative model; generation can be controlled solely by manipulating the retrieval components. This reveals a fundamental security vulnerability of the RAG paradigm—trusting a third-party retrieval component is equivalent to trusting the entire generation pipeline.
- The "Double-Edged Sword" Nature of Contrastive Learning: Exploiting the very mechanism of contrastive learning to attack systems built upon it offers a highly instructive "fight fire with fire" perspective.
- An extremely low poisoning rate (4 images in a 20M database = \(2 \times 10^{-7}\)) is sufficient to achieve >90% ASR, underscoring the inherent fragility of RDMs.
Limitations & Future Work¶
- Existing defenses cannot effectively counter the proposed attack; the paper does not propose a defense, leaving the security threat at the "expose but not resolve" stage.
- Contrastive training occasionally suffers from mode collapse; although mitigated by fine-tuning only the text encoder, the risk persists under high learning rates.
- Evaluation is limited to a single RAG framework (Blattmann et al. 2022); generalizability to other RAG generative systems (e.g., Re-Imagen) remains to be verified.
- Promising directions include extending the attack paradigm to RAG-LLMs and developing defense mechanisms based on retrieval consistency checking.
Related Work & Insights¶
- vs. BadT2I/BadCM: These methods require directly fine-tuning the victim diffusion model's text encoder or cross-attention layers. BadRDM does not touch the victim model and operates solely on external retrieval components, representing a more practically realizable threat.
- vs. RAG-LLM Poisoning: RAG-LLM poisoning primarily modifies textual knowledge bases, whereas BadRDM manipulates the visual retrieval database and retrieval model, constituting the first systematic study of RAG poisoning in the context of visual generation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First systematic study of backdoor attacks on RDMs; the contact-free attack paradigm and contrastive poisoning mechanism are highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two task settings, multiple baselines, extensive ablations, and defense evaluations.
- Writing Quality: ⭐⭐⭐⭐ The threat model is clearly articulated and method motivations are well-reasoned, though some descriptions are overly verbose.
- Value: ⭐⭐⭐⭐ Exposes a serious security vulnerability in RAG-based visual generation systems, carrying significant cautionary implications for the community.