ICML 2025 (VecDB Workshop) Information Retrieval & RAG Adversarial Attack Detection Adversarial Patch Attacks Vision RAG VLM Inference Training-Free Defense

Don't Lag, RAG: Training-Free Adversarial Detection Using RAG¶

Conference: ICML 2025 (VecDB Workshop)
arXiv: 2504.04858
Code: None
Area: Information Retrieval
Keywords: Adversarial Attack Detection, Adversarial Patch Attacks, Vision RAG, VLM Inference, Training-Free Defense

TL;DR¶

This paper proposes the VRAG framework, which constructs a training-free pipeline using an adversarial patch database + Vision Retrieval-Augmented Generation (VRAG) + VLM inference. It achieves highly efficient detection of various adversarial patch attacks, with Gemini-2.0 reaching 98% accuracy and the open-source model UI-TARS-72B-DPO reaching 95%.

Background & Motivation¶

Background: Deep learning models (CNNs, ViTs) exhibit outstanding performance in computer vision tasks, but remain highly vulnerable to adversarial patch attacks. An adversarial patch is a localized, high-amplitude perturbation that can be printed and placed in real-world scenes, causing model misclassification even under diverse lighting conditions and viewing angles.

Limitations of Prior Work: - Supervised defenses (e.g., training classifiers to distinguish between adversarial and benign samples) rely heavily on annotated data and possess poor generalization capabilities. - Unsupervised defenses (e.g., Feature Squeezing) require tedious parameter tuning and are easily bypassed by adaptive attacks. - Adversarial training is computationally expensive and tends to overfit to specific attack types. - Diffusion-based defenses (e.g., DIFFender) are computationally intensive with poor real-time performance. - All of the above methods require some form of training or fine-tuning, preventing them from flexibly adapting to new attacks during deployment.

Key Challenge: Traditional defenses either require training (lacking flexibility) or are heuristic (lacking accuracy). The key challenge is how to detect diverse types of adversarial patches in a completely training-free manner.

Goal: Build a training-free, scalable, retrieval-augmented adversarial patch detection framework that can dynamically adapt to evolving attacks.

Key Insight: Frame adversarial patch detection as a visual retrieval + VLM inference problem. A database is leveraged to store known attack patterns, retrieval is utilized to identify the most similar attacks, and a VLM is employed for final classification and judgment.

Core Idea: Connect the adversarial patch database and VLM using the RAG paradigm to achieve training-free, context-aware adversarial detection.

Method¶

Overall Architecture¶

The VRAG detection pipeline (as shown in Figure 2) consists of four steps: 1. Image Preprocessing: Subdivide the input image \(I\) into an \(n \times n\) grid of regions \(\{C_1, \ldots, C_{n^2}\}\). 2. Feature Extraction: Encode each region into an embedding \(E_i = f(C_i)\) using a pretrained vision encoder (e.g., CLIP). 3. Retrieval: Perform a top-\(k\) nearest neighbor search for each \(E_i\) within the adversarial patch database \(\mathcal{D}\). 4. VLM Generation & Inference: Feed the retrieved similar patches/attack images as few-shot context along with a structured prompt to the VLM, enabling it to determine whether "this region contains an adversarial patch."

Key Designs¶

Adversarial Patch Database Creation (Database Creation, Algorithm 1):
- Aggregate patches generated by various attack methods: SAC, BBNP, and standard adversarial patch attacks.
- Overlay each patch \(P_i\) onto multiple natural images in random positions and scales.
- Divide each patched image into an \(n \times n\) grid, and calculate the CLIP embedding of each region.
- Store the patch embeddings as keys and the region embeddings as values, creating a key-value database.
- The database is continuously scalable—new attack types can be incorporated simply by adding new patches.
- Design Motivation: Retrieval based on embedding similarity rather than geometric assumptions naturally generalizes to patches of varying shapes (square, circular, triangular, and natural camouflage). The database approach also allows the system to be incrementally updated.
VRAG Detection Pipeline (Algorithm 2):
- For each grid region of the input image, calculate the cosine similarity between its embedding and the patch embeddings in the database.
- Define a threshold \(\tau = 0.77\) (the optimal threshold determined through ROC-AUC analysis).
- Regions with similarity exceeding the threshold are labeled as "suspicious."
- For suspicious regions, retrieve the top-\(k=2\) most similar patches and their corresponding attack images.
- Formulate a structured prompt: "Here are examples of adversarial patches [Patch 1], [Patch 2]. These are images containing these patches [Image 1], [Image 2]. Based on the context, does the following image contain an adversarial patch? Answer 'yes' or 'no'."
- The VLM generates the response, which serves as the final determination.
- Design Motivation: Employing efficient embedding retrieval first to narrow down the candidates, and then utilizing the robust reasoning capabilities of VLMs for precise determination—this two-stage design strikes a balance between efficiency and accuracy.
Zero-Shot vs. Few-Shot Decision Mechanisms:
- Zero-Shot: The VLM relies solely on instructions and its pretrained knowledge to make judgments, without retrieval assistance.
- Few-Shot: The \(k\) retrieved similar patches/images are injected into the prompt, giving the VLM visual references for attack patterns.
- Experiments show that 4-shot offers the best trade-off between accuracy and efficiency, with diminishing returns for larger shots.
- Design Motivation: Analogous to how retrieving documents enhances LLMs in standard RAG, here retrieving images enhances VLMs.

Loss & Training¶

Completely training-free. All VLMs and encoders retain their original weights without any fine-tuning. This constitutes the primary deployment advantage of the method.

Key Experimental Results¶

Main Results (APRICOT Dataset, Real-world Physical Adversarial Patches)¶

Method	25×25 (0S/2S/4S)	45×45 (0S/2S/4S)	65×65 (0S/2S/4S)
Undefended	34.6/–/–	30.2/–/–	28.6/–/–
JPEG Compression	29.4/–/–	35.3/–/–	41.1/–/–
Spatial Smoothing	33.6/–/–	39.2/–/–	42.3/–/–
SAC	45.9/–/–	49.1/–/–	52.8/–/–
DIFFender	65.1/–/–	68.6/–/–	70.9/–/–
Ours (UI-TARS-72B)	49.4/80.2/91.6	51.6/83.6/94.5	55.0/85.9/96.2
Ours (Gemini-2.0)	56.2/82.6/93.9	58.8/86.9/96.8	63.1/90.3/97.9

Ablation Study¶

Configuration	ImageNet-Patch Accuracy	Description
Cosine Similarity Retrieval	98.0%	Optimal distance metric
L2 Distance Retrieval	89.8%	Suboptimal
L1 Distance Retrieval	86.3%	Inferior to Cosine
Wasserstein Distance Retrieval	84.3%	Worst performance
Prompt: Instruction Only	58.0%	Very poor performance due to lack of context
Prompt: Patch + Attack Image (Combined)	98.0%	Most effective by providing complete context
Prompt: Attack Image Only	85.5%	Lacks patch details
Prompt: Chain-of-Thought	91.3%	Reasoning enhancement is beneficial
0-shot / 2-shot / 4-shot / 6-shot	56/87/98/98%	Saturates at 4-shot

Key Findings¶

Training-free method outperforms training-based traditional defenses for the first time: On APRICOT, 4-shot VRAG (Gemini) achieves ~98%, significantly exceeding DIFFender's ~71%.
Outstanding performance by the open-source model UI-TARS-72B-DPO: It achieves a 95% accuracy, establishing a new SOTA for open-source adversarial detection.
Controllable inference time: Gemini-2.0 takes only 2.25 seconds per image, which is more efficient compared to DIFFender's 7.98 seconds.
Database scalability: Parallelization reduces construction time from 24.6 minutes with 1 worker to 3.6 minutes with 6 workers (a 6.86x speedup).
Prompt design is crucial: The combined prompt containing both patches and attack images improves performance by 40 percentage points over the instruction-only prompt.

Highlights & Insights¶

Successful application of the RAG paradigm to the visual security domain: Adapting textual RAG concepts to visual adversarial detection is an ingenious cross-domain transfer.
Training-free and scalable defense paradigm: When new attacks emerge, one only needs to add new patches to the database without any retraining.
Design takeaway: The importance of prompt engineering in VLM-based defense—structured prompt design can lead to a 40 percentage point improvement in accuracy.
High practicality: The framework is simple, has fast inference, and requires no GPU training, making it highly suitable for real-world deployment.

Limitations & Future Work¶

Dependence on a pre-built patch database—it may fail to retrieve similar patches for completely unseen, novel attacks.
Sensitivity regarding the selection of the threshold \(\tau\); a value too close to 1.0 yields a high false positive rate.
Degraded detection capabilities when patches are highly integrated with the background (e.g., naturally camouflaged patches).
The hybrid strategy of combining adversarial training with VRAG remains unexplored.
Although Gemini-2.0 is the strongest model, it is closed-source; practical deployments may heavily rely on open-source models.

An intriguing attempt to generalize RAG from NLP to the visual security domain.
CLIP embeddings as general-purpose visual features also render excellent performance in security detection tasks.
Insight: For rapidly evolving threats, the retrieval-augmented paradigm is naturally suitable—offering a much faster turnaround than retraining models.
This can analogously extend to other security detection tasks, such as malicious QR code and fake image detection.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of applying vision RAG to adversarial detection is novel, though the technical framework is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering two datasets, four types of attacks, four VLMs, and multiple ablation studies.
Writing Quality: ⭐⭐⭐⭐ The structure is clear and the algorithmic descriptions are detailed.
Value: ⭐⭐⭐⭐ High practical utility from the training-free paradigm, though the workshop paper status might limit its broader impact.