A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1¶

Conference: NeurIPS 2025 arXiv: 2503.10635 Code: https://github.com/VILA-Lab/M-Attack Area: Multimodal VLM / Adversarial Attack Keywords: Black-box transfer attack, local semantic matching, adversarial perturbation, LVLM safety, model ensemble

TL;DR¶

This paper proposes M-Attack, which performs random cropping on source images and aligns them with target images via local-global or local-local matching in the embedding space, combined with a multi-CLIP model ensemble. This causes adversarial perturbations to naturally concentrate on semantically critical regions, forming clear semantic details. M-Attack achieves >90% targeted attack success rate against commercial black-box LVLMs including GPT-4.5/4o/o1.

Background & Motivation¶

Background: Transfer-based targeted adversarial attacks are the primary means of attacking black-box commercial LVLMs. Existing methods (AttackVLM, SSA-CWA, AnyAttack, AdvDiffVLM) typically generate adversarial perturbations on white-box surrogate models and expect them to transfer to unknown commercial models.

Limitations of Prior Work: Perturbations generated by these methods tend to follow a near-uniform distribution, lacking clear semantic structure. Through empirical analysis, the authors identify two key failure modes: (1) the empirical cumulative distribution function of perturbations nearly overlaps with the uniform distribution, indicating that perturbations are spread uniformly across the entire image rather than focused on semantically critical regions; (2) even when commercial LVLMs detect something "unusual" about the perturbation, they can only produce vague, abstract descriptions rather than specific semantic nouns, suggesting that the perturbations lack semantically interpretable information.

Key Challenge: Although conventional global-to-global feature matching can rapidly increase similarity in the embedding space, this very property leads to overfitting—similarity saturates quickly, limiting the space for learning fine-grained semantics. In contrast, local matching introduces stochasticity through random cropping at each step, converging more slowly but capturing finer-grained semantic details.

Goal: To enable adversarial perturbations to encode clear target semantics in local regions, such that commercial black-box LVLMs can not only "see" the perturbation but accurately decode it as the intended target semantics.

Key Insight: The authors observe that commercial LVLMs prioritize extracting semantic features from images, regardless of differences in training data or architecture. Therefore, if the perturbation itself carries sufficiently clear semantic information, cross-model transferability becomes achievable.

Core Idea: Replace global matching with local matching on randomly cropped patches, allowing perturbations to naturally aggregate rich target semantics in overlapping central regions.

Method¶

Overall Architecture¶

M-Attack comprises two core components: (1) Local Matching (LM)—at each step, the source image is randomly cropped and rescaled, then aligned with the target image (global or local) via cosine similarity in the embedding space; (2) Model Ensemble (ENS)—multiple white-box CLIP models are ensembled to prevent overfitting to any single model. The input is a source-target image pair, and the output is an adversarial image under the \(\ell_\infty\) constraint that causes the black-box LVLM to describe it as the content of the target image.

Key Designs¶

Local-to-Global / Local-to-Local Matching (Local Matching):
- Function: Generates local regions of the source image via random cropping and aligns them with the target image in the embedding space.
- Mechanism: At each optimization step \(i\), the current adversarial image is randomly cropped (scale range \([0.5, 1.0]\)) and rescaled to the original size, and the cosine similarity with the target image is computed as the loss. Since crops at different steps are mutually overlapping (consistency condition \(\hat{x}_i \cap \hat{x}_j \neq \emptyset\)) yet not identical (diversity condition \(|\hat{x}_i \cup \hat{x}_j| > |\hat{x}_i|\)), perturbations are repeatedly optimized in the central region, aggregating clear semantics, while edge regions contribute diverse details.
- Design Motivation: Global matching causes similarity to saturate quickly, leading to overfitting, while the stochasticity introduced by local matching slows convergence but captures finer semantic structure, substantially improving transferability.
Model Ensemble (ENS):
- Function: Extracts shared semantics from multiple white-box surrogate models to improve transferability to unknown black-box models.
- Mechanism: Three CLIP variants—ViT-B/16, ViT-B/32, and ViT-g-14—are used, and the matching losses at each step are averaged as \(\mathcal{M} = \mathbb{E}_{f_{\phi_j} \sim \phi}[\text{CS}(f_{\phi_j}(\hat{x}_i^s), f_{\phi_j}(\hat{x}_i^t))]\). Models with different patch sizes have complementary receptive fields—small-patch models capture fine-grained details while large-patch models preserve global structure.
- Design Motivation: A single model is prone to overfitting its specific embedding space. Model ensembling extracts semantic features shared across models, which are more likely to transfer to unknown commercial models.
KMRScore Evaluation Metric:
- Function: Provides a more objective and reproducible evaluation of attack success rate.
- Mechanism: Multiple semantic keywords are manually annotated for each image. Three matching thresholds (0.25/0.5/1.0) correspond to KMR_a/KMR_b/KMR_c, with semi-automatic matching performed by GPT-4o. KMR_c requires all keywords to match and is the strictest metric.
- Design Motivation: Previous evaluation methods relied on subjective definitions of "semantic subjects," introducing large human bias and poor reproducibility.

Loss & Training¶

I-FGSM optimization is used with step size \(\alpha=1\) (0.75 for Claude), perturbation budget \(\epsilon=16\) (\(\ell_\infty\) norm), and 300 total optimization steps. At each step, the source image is randomly cropped and updated with PGD-style gradient updates.

Key Experimental Results¶

Main Results¶

Method	GPT-4o ASR	Gemini-2.0 ASR	Claude-3.5 ASR	\(\ell_1\)↓	\(\ell_2\)↓
AttackVLM (B/32)	0.02	0.00	0.00	0.036	0.041
SSA-CWA (Ensemble)	0.09	0.04	0.05	0.059	0.060
AnyAttack (Ensemble)	0.42	0.48	0.23	0.048	0.052
M-Attack (Ours)	0.95	0.78	0.29	0.030	0.036

Ablation Study (Comparison of Matching Strategies)¶

Matching Strategy	GPT-4o ASR	Gemini-2.0 ASR	Claude-3.5 ASR
Global-to-Global	0.05	0.05	0.01
Local-to-Global	0.93	0.83	0.22
Local-to-Local	0.95	0.78	0.26

Key Findings¶

Local matching increases ASR on GPT-4o from 5% to 95% compared to global matching—a remarkable improvement.
M-Attack outperforms all baselines in perturbation imperceptibility (\(\ell_1\)/\(\ell_2\) norms), suggesting that local aggregation produces more efficient perturbations.
M-Attack consistently achieves the best performance across different perturbation budgets \(\epsilon\) (4/8/16), already reaching 82% ASR on GPT-4o at \(\epsilon=8\).
The strict KMR_c metric (requiring all keywords to match) reveals that prior methods' ASRs are severely overestimated—they are essentially 0 under KMR_c.

Highlights & Insights¶

Extreme simplicity with high effectiveness: The core operation of M-Attack is simply "random crop + resize + cosine similarity alignment," introducing no complex modules, yet outperforms all existing methods by an order of magnitude. This "less is more" philosophy is highly instructive.
Failure case analysis drives method design: The authors first analyze why existing methods fail (uniform perturbation distribution + semantic ambiguity) and then specifically design local matching to address the semantic deficiency. This research paradigm of "drawing inspiration from failures" is highly valuable.
The multi-threshold design of KMRScore can be transferred to other attack evaluation tasks to provide finer-grained success rate measurements.

Limitations & Future Work¶

Attack success rates against the Claude model family are significantly lower than against GPT and Gemini (29% vs. 95%/78%); the authors do not analyze the underlying reasons, which may relate to Claude's additional safety mechanisms.
The method relies on CLIP as the surrogate model; non-CLIP surrogate architectures (e.g., SigLIP, EVA) and their effect on transferability are not explored.
Experiments are conducted only at 224×224 resolution; effectiveness in high-resolution settings is not validated.
The defense perspective is absent—no discussion of how to detect or defend against such attacks.

vs. AttackVLM: AttackVLM uses global feature matching; this paper demonstrates that global matching saturates quickly in the embedding space, causing overfitting. Replacing it with local matching yields approximately a 20× improvement.
vs. AnyAttack: AnyAttack uses self-supervised pretraining to generate adversarial examples, performing better than AttackVLM but still far inferior to M-Attack; its perturbations remain biased toward uniform distribution.
vs. SSA-CWA: SSA-CWA employs spectrum enhancement and sharpness-aware optimization, but remains fundamentally global matching in nature, yielding limited improvements.

Rating¶

Novelty: ⭐⭐⭐⭐ The core idea (local crop matching) is simple yet insightful, with a complete logical chain from failure analysis to method design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 7 commercial LVLMs including reasoning models o1 and Claude-3.7-thinking, with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ The paper is well-structured, with a smooth narrative from motivation to method to experiments.
Value: ⭐⭐⭐⭐⭐ Carries significant warning implications for LVLM security; a 90%+ attack success rate indicates that current commercial models have alarmingly poor adversarial robustness.