Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models¶

Conference: CVPR 2026 arXiv: 2603.04846 Code: LiYuanBoJNU/MPCAttack Area: AI Safety / Adversarial Attack Keywords: adversarial attack, MLLM, Transferability, Multi-Paradigm, Collaborative Optimization

TL;DR¶

MPCAttack is proposed as a framework that jointly leverages feature representations from three learning paradigms—cross-modal alignment, multimodal understanding, and visual self-supervision—and generates highly transferable adversarial examples via a multi-paradigm collaborative optimization strategy, achieving state-of-the-art attack performance on both open-source and closed-source MLLMs.

Background & Motivation¶

Multi-modal large language models (MLLMs) face serious adversarial attack threats in safety-critical domains. Existing transferable adversarial attacks suffer from two core problems:

Single-paradigm representation constraint: Existing methods (e.g., CoA, FOA-Attack) rely on surrogate models from a single learning paradigm (e.g., CLIP's cross-modal alignment) to generate adversarial examples. Each paradigm captures only a portion of multimodal semantics—cross-modal alignment focuses on modality matching, multimodal understanding captures abstract semantic relationships, and visual self-supervision emphasizes low-level visual cues. Perturbations generated by a single paradigm tend to overfit to its representational bias, resulting in poor transferability.

Independent feature optimization: Existing methods treat features from different surrogate models as independent optimization targets and aggregate them with simple fusion strategies, ignoring the semantic complementarity between different representation spaces. This leads to redundant gradient directions and causes perturbation optimization to fall into local optima.

The core mechanism is to introduce multi-paradigm collaboration, performing global optimization over aggregated features to enhance the semantic consistency and transferability of adversarial perturbations.

Method¶

Overall Architecture¶

MPCAttack consists of two stages: - Adversarial example generation stage: Given a source image $x_s$ and a target image $x_t$, features are extracted using encoders from three paradigms, and perturbation $\delta$ is collaboratively optimized via the MPCO strategy. - Attack inference stage: The generated adversarial example $x_{adv} = x_s + \delta$ is fed into a black-box MLLM to evaluate attack effectiveness.

Surrogate models for the three learning paradigms: - Cross-modal alignment: CLIP (image encoder $f_{c_I}$ + text encoder $f_{c_T}$) - Multimodal understanding: InternVL3-1B ($f_m$, with text generator $f_{mg}$) - Visual self-supervision: DINOv2 ($f_v$)

Key Designs¶

Multi-paradigm feature extraction and fusion: Each paradigm extracts features from source, target, and adversarial images. For the cross-modal alignment paradigm, image captions are additionally generated by the multimodal understanding model and encoded by the CLIP text encoder, yielding fused visual-semantic features: $$z_s^c = \lambda \cdot z_s^{c_I} + (1-\lambda) \cdot z_s^{c_T}$$ where $\lambda=0.6$ controls the balance between visual and semantic features. Performance is suboptimal at $\lambda=1$ (pure image features), demonstrating that textual semantic information is indispensable for capturing key image semantics.
Multi-Paradigm Collaborative Optimization (MPCO): The $\ell_2$-normalized features from each paradigm are concatenated into a unified representation: $$z_s = \left[\frac{z_s^c}{\|z_s^c\|_2},\ \frac{z_s^m}{\|z_s^m\|_2},\ \frac{z_s^v}{\|z_s^v\|_2}\right]$$ Contrastive matching optimization is performed in this concatenated multi-paradigm feature space, adaptively emphasizing the most informative regions across paradigms. The key insight is that features from different paradigms are concatenated after normalization (rather than simply averaged), preserving the independent semantic structure of each paradigm.
Contrastive matching loss: Contrastive learning is applied over the aggregated features to pull the adversarial example's features closer to the target and push them away from the source: $$\mathcal{L} = -\log \frac{\exp(\text{sim}(z_{adv}, z_t) / \omega \cdot \tau)}{\exp(\text{sim}(z_{adv}, z_t) / \tau) + \exp(\text{sim}(z_{adv}, z_s) / \tau)}$$ where $\tau=0.2$ is a temperature coefficient controlling the sharpness of the similarity distribution, and $\omega=2$ is a balance factor regulating the attraction/repulsion strength of positive and negative pairs.

Loss & Training¶

Adversarial optimization: $\min_{\delta} \mathcal{L}(f(x_s+\delta), f(x_t))$, subject to $\|\delta\|_\infty \leq \epsilon$
Perturbation budget: $\epsilon = 16/255$ ($\ell_\infty$ constraint)
Attack step size: $1/255$, with 300 iterations
Runs on a single NVIDIA RTX 3090
Evaluation uses an LLM-as-a-judge framework (GPTScore) with a threshold of 0.5 to determine attack success

Key Experimental Results¶

Main Results¶

Attack success rate (ASR %) against open-source and closed-source MLLMs on the ImageNet dataset:

Method	Open-Source Targeted ASR	Open-Source Untargeted ASR	Closed-Source Targeted ASR	Closed-Source Untargeted ASR
AnyAttack	1.08	23.85	0.60	18.85
CoA	0.18	12.55	0.13	13.53
M-Attack	44.08	75.30	44.48	78.73
FOA-Attack	48.60	79.80	47.73	82.63
MPCAttack	63.33	92.10	63.38	90.55

MPCAttack improves over the previous SOTA (FOA-Attack) by +14.73% (open-source) and +15.65% (closed-source) in the targeted setting, and by +12.30% and +7.92% in the untargeted setting.

Ablation Study¶

Configuration	Targeted ASR (avg)	Untargeted ASR (avg)	Note
MPCAttack (Full)	63.33	92.10	Complete framework
w/o Cross-modal Alignment	Largest drop	Largest drop	CLIP is central to transferability
w/o MPCO	Significant drop	Significant drop	Collaborative optimization is essential
w/o Multimodal Understanding	Moderate drop	Moderate drop	Semantic reasoning contributes
w/o Visual Self-supervision	Smaller drop	Smaller drop	Complementary role of visual cues
CLIP→SigLIP2	Performance drop	Performance drop	CLIP provides stronger transfer signal
InternVL3-1B→2B	Performance gain	Performance gain	Larger model enhances transferability

Key Findings¶

Cross-modal alignment is the foundation of transferability: Removing CLIP causes the largest performance drop, as MLLM visual encoders are highly correlated with cross-modal alignment representations.
MPCO's global optimization is highly effective: This is especially pronounced on difficult models such as GLM-4.1V-9B-Thinking.
Textual semantics are indispensable: $\lambda=1$ (pure visual) underperforms $\lambda=0.6$ (visual + semantic), indicating that the visual modality alone cannot adequately capture key semantics.
Closed-source models are attackable: MPCAttack achieves a targeted ASR of 88.0% on GPT-5, with effective results on all tested models except Claude-3.5.
Claude-3.5 is relatively robust: Its targeted ASR is only 8.2%, likely due to architectural or training strategy differences.

Highlights & Insights¶

A novel multi-paradigm collaborative perspective: This work is the first to unify cross-modal alignment, multimodal understanding, and visual self-supervision within a single adversarial attack framework, with a solid theoretical basis.
Concatenation over averaging: Concatenating normalized features from each paradigm preserves structural information and proves more effective than simple weighted averaging.
Joint utilization of visual-linguistic semantics: The pipeline of using an MLLM to generate captions followed by CLIP encoding elegantly introduces linguistic semantics as an additional signal.
Comprehensive evaluation: The evaluation covers 8 victim models (4 open-source + 4 closed-source), 3 datasets, and both targeted and untargeted settings.

Limitations & Future Work¶

High computational cost: Simultaneously running encoders from three paradigms with 300 optimization iterations makes the approach less efficient than single-paradigm methods.
Limited effectiveness against Claude-3.5: A targeted ASR of only 8.2% indicates that the framework still has a bottleneck against certain models.
Evaluation relies on LLMs: Using GPTScore as the judge of attack success may introduce evaluation bias.
Defenses not discussed: The impact of existing adversarial defenses (e.g., adversarial training, input purification) on MPCAttack is not analyzed.
Image modality perturbations only: The possibility of joint perturbations on the text side is not explored.

AttackVLM: An alignment-based attack relying on CLIP as a single paradigm; MPCAttack addresses its insufficient feature diversity through multi-paradigm collaboration.
FOA-Attack: Employs feature-optimal alignment with dynamic model weight integration, but optimization remains independent across models.
AnyAttack: A label-free targeted attack based on self-supervised contrastive learning that performs poorly against MLLMs (ASR ~1%), illustrating the limitations of single-paradigm attacks.
Takeaway: The transferability of adversarial attacks fundamentally depends on the coverage of the feature space; multi-paradigm collaboration significantly expands the adversarial search space.

Rating¶

Novelty: ⭐⭐⭐⭐ The multi-paradigm collaborative optimization framework is novel, and the concatenation-plus-contrastive-matching strategy is effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 victim models, 3 datasets, complete ablations, and hyperparameter analysis.
Writing Quality: ⭐⭐⭐⭐ Figures are clear and comparisons are comprehensive, though the method description is slightly verbose.
Value: ⭐⭐⭐⭐ Reveals adversarial vulnerabilities in MLLMs and provides a powerful tool for security evaluation.