Skip to content

LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models

Conference: AAAI2026 arXiv: 2601.21220 Code: None Area: AI Security Keywords: Universal Adversarial Perturbation, Multi-Image MLLM, Black-box Attack, Attention Manipulation, Transferable Attack

TL;DR

This paper proposes LAMP, a black-box Universal Adversarial Perturbation (UAP) learning method targeting multi-image MLLMs. By incorporating attention constraints and a contagious loss, LAMP enables cross-model and cross-task transferable attacks by perturbing only a small subset of input images.

Background & Motivation

State of the Field

Background: Multimodal large language models (MLLMs) now support multi-image inputs (e.g., comparison, reasoning, temporal understanding), yet their adversarial robustness in such settings remains largely unexplored.

Limitations of Prior Work

Limitations of Prior Work: Existing adversarial attacks are primarily designed for single-image scenarios and mostly operate under white-box settings, rendering them ill-suited for practical black-box deployment.

Root Cause

Key Challenge: In real-world scenarios (e.g., images on social media processed by MLLMs), attackers cannot control the number or order of images received by the model. Existing single-image UAP methods therefore exhibit limited effectiveness in multi-image settings.

Solution

Goal: How can one learn a small, fixed set of Universal Adversarial Perturbations under a black-box setting such that they effectively attack multi-image MLLMs, even when the attacker has no control over the number or order of images at inference time?

Method

Overall Architecture

A pre-trained surrogate model (Mantis-CLIP) is used to learn UAPs while keeping the MLLM parameters frozen; only the perturbations \(\delta_k\) (subject to \(\|\delta_k\|_\infty \leq \epsilon\)) are optimized. The total loss comprises five terms:

\[\mathcal{L}_{adv} = \lambda_1 \mathcal{L}_{adv}^{lm} + \lambda_2 \mathcal{L}_{adv}^{dec} + \lambda_3 \mathcal{L}_{adv}^{h} + \lambda_4 \mathcal{L}_{adv}^{ctg} + \lambda_5 \mathcal{L}_{adv}^{ias}\]

Key Designs

  1. Adversarial Language Modeling Loss \(\mathcal{L}_{adv}^{lm}\): Reduces the generation probability of correct tokens. $\(\mathcal{L}_{adv}^{lm} = -\frac{1}{N}\sum_{i=1}^{N}\log(1 - P_\theta(t_{i+1}|s_{1:i}))\)$

  2. Hidden States Divergence Loss \(\mathcal{L}_{adv}^{dec}\): Maximizes the cosine distance between clean and adversarial hidden states. $\(\mathcal{L}_{adv}^{dec} = \frac{1}{L}\sum_{l=1}^{L}\cos(z_l^{adv}, z_l^{clean})\)$

  3. Attention via Pompeiu-Hausdorff Distance \(\mathcal{L}_{adv}^{h}\): Employs the Hausdorff distance to measure the worst-case deviation between clean and adversarial attention weights, capturing local discrepancies more effectively than KL divergence.

  4. Contagious Loss \(\mathcal{L}_{adv}^{ctg}\) (core innovation): Encourages clean tokens to attend more strongly to perturbed image tokens in self-attention, thereby propagating adversarial effects from perturbed images to clean ones. $\(\mathcal{L}_{adv}^{ctg} = -\frac{1}{LH}\sum_{l}\sum_{h}\sum_{i \in \mathcal{C}}\sum_{j \in \mathcal{N}} A^{(l)}_{:,h,i,j}\)$

  5. Index-Attention Suppression Loss \(\mathcal{L}_{adv}^{ias}\): Suppresses the attention of image tokens toward their positional index text tokens, enabling position-invariant attacks.

Key Experimental Results

Main Results

Setting Avg. Best Baseline LAMP Δ (pp)
Average across all models 56.3% 75.8% +19.5
Mantis-CLIP 51.5% 71.9% +20.4
VILA-1.5 56.1% 76.2% +20.1
LLaVA-v1.6 58.5% 78.9% +20.4
Qwen-2.5 62.5% 79.4% +16.9
  • Cross-model zero-shot transfer attacks substantially outperform all baselines.
  • Under defense strategies, LAMP maintains ~70% ASR (vs. baseline 20–56%).
  • The optimal number of perturbations is \(|\delta|=2\); additional perturbations yield diminishing returns, attributable to the contagious loss.
  • LPIPS is only 0.021 (best baseline: 0.068), indicating superior imperceptibility.

Highlights & Insights

  • First adversarial attack on multi-image MLLMs: Fills the gap in UAP attacks for multi-image scenarios.
  • Elegant Contagious Loss design: A fixed number of UAPs can "infect" clean tokens, addressing the challenge of unknown image counts at inference time.
  • Position-invariant attack: Index-attention suppression renders the attack independent of image position.
  • Strong transferability: UAPs trained on a surrogate model effectively attack 7+ target models with diverse architectures.

Limitations & Future Work

  • Validation is limited to open-source models; closed-source models such as GPT-4V and Gemini are not evaluated.
  • The perturbation budget \(\epsilon=12/255\) is relatively large; performance under tighter budgets is not thoroughly investigated.
  • Defense evaluation covers only query-based defenses; stronger adversarial training defenses are not assessed.
  • Training requires an A100 GPU and 17K samples; computational costs are not analyzed in detail.
  • vs. CPGC-UAP / UAP-VLP / Doubly-UAP: These methods target single-image encoder/decoder attacks; LAMP outperforms them by an average of 19.5 pp in multi-image ASR.
  • vs. Jailbreak-MLLM: The latter improves transferability through model ensembles, whereas LAMP achieves higher ASR without ensembling.
  • vs. AnyDoor / MLAI: These methods leverage multi-image capabilities but are not universal attacks; LAMP is the first multi-image UAP approach.
  • The design philosophy of the contagious loss (encouraging clean tokens to attend to noisy tokens) is generalizable to other attention-based attack and defense scenarios.
  • The index suppression strategy for position-invariant attacks offers a valuable reference for security evaluation of multi-image models.
  • This work reveals a novel attack surface in multi-image MLLMs: corrupting a subset of images suffices to compromise overall reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (first multi-image UAP attack + contagious loss + position-invariant design)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (7+ target models, 5 benchmarks, but no closed-source model evaluation)
  • Writing Quality: ⭐⭐⭐⭐ (clear structure, complete mathematical derivations)
  • Value: ⭐⭐⭐⭐⭐ (significant implications for security research on multi-image MLLMs)