IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves¶

Conference: ICCV 2025 arXiv: 2411.00827 Code: https://github.com/roywang021/IDEATOR Area: Multimodal VLM / AI Safety / Adversarial Attack Keywords: jailbreak attack, VLM safety, red teaming, multimodal attack, safety benchmark

TL;DR¶

This paper proposes IDEATOR, the first black-box jailbreak framework that uses a VLM to red-team other VLMs. A weakly safety-aligned VLM (MiniGPT-4) serves as the attacker, generating semantically rich image–text jailbreak pairs in conjunction with Stable Diffusion. A breadth-depth exploration strategy iteratively refines attacks, achieving a 94% attack success rate (ASR) on MiniGPT-4 with an average of 5.34 queries, and transferring to LLaVA/InstructBLIP/Chameleon at 75–88%. The work also introduces VLJailbreakBench (3,654 samples) to expose safety vulnerabilities across 11 VLMs.

Background & Motivation¶

Background: VLM jailbreak attacks are broadly categorized into white-box methods (e.g., GCG, VAJM, requiring gradient access) and black-box methods (e.g., MM-SafetyBench, relying on handcrafted templates). White-box approaches are impractical against commercial models, while black-box methods depend on manually designed attack templates (e.g., typographic attacks) and lack diversity and flexibility.

Limitations of Prior Work: (1) Adversarial images generated by white-box attacks are semantically void (noise patterns) and are detectable by safety mechanisms. (2) Black-box methods such as MM-SafetyBench require manually engineered pipelines with poor scalability. (3) Existing safety benchmarks mostly rely on explicitly harmful content and rarely test complex multimodal jailbreak scenarios. (4) No tools exist for automated, large-scale generation of diverse jailbreak samples.

Key Challenge: Effective jailbreaking requires image–text combinations that are both contextually rich and semantically covert, yet automatically generating such multimodal attacks is extremely difficult — they must be simultaneously adversarial and stealthy.

Goal: To build a fully automated black-box VLM jailbreak framework that requires neither white-box access, nor manual templates, nor additional training, while generating semantically rich multimodal jailbreak samples and enabling large-scale VLM safety evaluation.

Key Insight: VLMs themselves possess powerful content understanding and generation capabilities. When safety constraints are relaxed, a VLM can serve as the most capable red-teaming tool. MiniGPT-4 (a weakly aligned open-source VLM) is used as the attacker, iteratively analyzing victim VLM responses and refining attack strategies.

Core Idea: Use a VLM to attack a VLM — a weakly safety-aligned VLM autonomously generates image–text jailbreak pairs as a red-team model, and explores diverse attack strategies through breadth-depth iterative search.

Method¶

Overall Architecture¶

The attacker VLM \(\mathcal{M}_\mathcal{A}\) (MiniGPT-4) receives a jailbreak goal \(\mathcal{G}\) → generates a JSON output {analysis, text_prompt, image_prompt} → Stable Diffusion 3 synthesizes an image from image_prompt → the image–text pair is sent to the victim VLM \(\mathcal{M}_\mathcal{V}\) → the victim response \(\mathcal{R}\) is fed back to the attacker → the attacker analyzes the response and refines its strategy via chain-of-thought (CoT) reasoning → the process iterates until success or the maximum number of rounds is reached. Breadth exploration maintains multiple independent attack paths; depth optimization iteratively refines each path.

Key Designs¶

VLM as Red-Team Model:
- Function: MiniGPT-4 (Vicuna-13B backbone) acts as the attacker VLM, simulating adversarial behavior via a carefully designed system prompt.
- Mechanism: The system prompt assigns three roles: (1) a red-team assistant that generates jailbreak prompts; (2) a JSON format constraint requiring three output fields — analysis, image_prompt, and text_prompt; (3) in-context learning via attack exemplars to guide strategy.
- Design Motivation: Vicuna exhibits fewer safety refusals than LLaMA, and the open-source nature of MiniGPT-4 allows full control over the system prompt. The VLM's pre-trained knowledge enables it to generate semantically rich and contextually plausible attacks — far more covert than template-based methods.
Breadth-Depth Exploration Strategy:
- Function: Breadth = \(N_b\) independent attack paths (with different initial strategies); Depth = \(N_d\) iterative refinement rounds per path. Default: \(N_b = 7\), \(N_d = 3\), totaling 21 queries.
- Mechanism: Breadth ensures strategy diversity (role-playing, emotional manipulation, academic framing, etc.); depth ensures sufficient optimization per strategy based on victim feedback.
- Effect: \(N_b = 1, N_d = 1\) → 45% ASR; \(N_b = 7, N_d = 3\) → 94% ASR. Increasing breadth or depth alone yields limited gains; joint improvement is most significant.
- Design Motivation: A single attack strategy can be blocked by specific defense mechanisms. Multi-strategy parallel search combined with iterative refinement more thoroughly explores the VLM's vulnerability space.
Chain-of-Thought Attack Analysis:
- Function: In the analysis field of the JSON output, the attacker VLM diagnoses why the victim refused in the previous round and proposes an improved strategy.
- Mechanism: CoT enables the attacker to learn the victim's "refusal patterns" — e.g., "direct requests are refused; switch to role-playing" or "textual attacks are detected; shift harmful content to the image modality."
- Design Motivation: This mimics the reasoning process of a human red-team tester — analyzing failures, adjusting strategies, and probing new angles. It is the core mechanism enabling IDEATOR's iterative attack refinement.
VLJailbreakBench Construction:
- 3,654 multimodal jailbreak samples covering 12 safety topics and 46 subcategories.
- Base set (916 samples): MiniGPT-4 attacking LLaVA-1.5.
- Challenge set (2,738 samples): Gemini-1.5-Pro attacking GPT-4o-mini (stronger attacker + stronger defender = higher-quality samples).
- Evaluation results on 11 VLMs: Claude-3.5-Sonnet is the most robust (19.65% ASR); GPT-4o Mini is the most vulnerable (72.21%).

Key Experimental Results¶

Attack Performance (Victim: MiniGPT-4, AdvBench 100 samples)¶

Method	Black-Box	Training-Free	ASR%
No Attack	-	-	35.0
GCG (white-box text)	✗	✗	50.0
GCG-V (white-box visual)	✗	✗	85.0
UMK (white-box multimodal)	✗	✗	94.0
MM-SafetyBench (black-box)	✓	✓	66.0
IDEATOR (black-box)	✓	✓	94.0

Cross-Model Transferability¶

ASR%	LLaVA	InstructBLIP	Chameleon
No Attack	7.0	12.0	16.0
MM-SafetyBench	46.0	29.0	22.0
IDEATOR	82.0	88.0	75.0

VLJailbreakBench Challenge Set (11 Models)¶

Model	Avg. ASR%	Type
GPT-4o Mini	72.21	Commercial
Gemini-2.0-Flash-Think	71.44	Commercial
Qwen2-VL	71.40	Open-source
GPT-4o	46.31	Commercial
Claude-3.5-Sonnet	19.65	Commercial

Ablation Study¶

\(N_b\)	\(N_d = 1\)	\(N_d = 3\)
1	45%	68%
7	85%	94%

Attack Modality	ASR%	Avg. Queries
Image only	85%	5.84
Text only	86%	7.46
Image + Text (joint)	94%	5.34

Key Findings¶

Black-box IDEATOR matches white-box SOTA (94% vs. UMK 94%), far outperforming other black-box methods (+28% vs. MM-SafetyBench).
Strong transferability: Jailbreak samples generated against MiniGPT-4 transfer directly to LLaVA at 82% ASR — 36 percentage points above MM-SafetyBench.
Joint image–text attack is most effective and efficient: 94% ASR with only 5.34 queries (< 1 minute), compared to 7.46 queries for text-only. Images provide two unique benefits: concealing harmful content and reinforcing role-playing contexts.
Existing defenses are grossly insufficient: AdaShield-S has limited effectiveness against IDEATOR (ASR: 94% → 84%), while substantially reducing ASR against FigStep/MM-SafetyBench (−32%/−29%). The diversity of IDEATOR's attack strategies makes it inherently defense-resistant.
Claude-3.5-Sonnet is the most secure model yet still exhibits 19.65% ASR — approximately one successful attack every 5–6 attempts. This is substantially lower than the safety rates reported in prior benchmarks, underscoring the necessity of adversarial benchmarking.
IDEATOR autonomously produces a broader spectrum of attack strategies than MM-SafetyBench (typographic, roleplay, emotional manipulation, etc.), essentially satisfying \(\mathcal{A}_{IDEATOR} \supseteq \bigcup_i \mathcal{A}_i\).

Highlights & Insights¶

"Using a VLM to attack a VLM" is a profound security insight: The powerful capabilities of VLMs are a double-edged sword — the same multimodal understanding and generation abilities can be weaponized to construct jailbreak attacks. Weak safety alignment becomes the greatest risk, as any open-source, poorly aligned VLM can serve as an attack tool.
A game-theoretic perspective on breadth-depth exploration: Jailbreaking is modeled as a multi-round game between an attacker and a defender, where the attacker adapts its strategy by analyzing the defender's refusal patterns. This more faithfully reflects real-world adversarial security threats than one-shot attacks.
Distinctive value of VLJailbreakBench: Prior benchmarks predominantly combine explicit harmful text with templated images. IDEATOR-generated samples are semantically more covert, revealing the fragility of existing safety alignment under complex, realistic scenarios. The 72% ASR on GPT-4o Mini dispels the illusion that commercial models are sufficiently secure.

Limitations & Future Work¶

The choice of attacker VLM is constrained — models with weak safety alignment are required, and as alignment techniques improve, suitable attacker models may become scarce.
VLJailbreakBench is relatively small in scale (3,654 samples); expansion would require substantially more computational resources.
Attacks on video VLMs and multi-turn dialogue VLMs remain unexplored.
The framework could be misused to generate harmful content — the paper includes a disclaimer and ethical considerations.

vs. MM-SafetyBench: MM-SafetyBench relies on handcrafted templates (query-relevant images + typographic overlays), whereas IDEATOR is fully automated and achieves 28% higher ASR. The transferability gap is even larger — 82% vs. 46% on LLaVA.
vs. UMK (white-box): UMK requires gradient access and produces semantically void adversarial perturbations, while IDEATOR is black-box and generates semantically rich images. Both achieve comparable ASR, but their applicable scopes are fundamentally different.
vs. Arondight: Arondight requires training a red-team LLM, whereas IDEATOR is training-free — existing VLMs are directly prompted to act as attackers via system prompts and in-context learning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First VLM-as-red-team framework; novel breadth-depth exploration strategy; VLJailbreakBench fills a critical gap in multimodal safety benchmarking.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Five baseline comparisons, four victim models, eleven benchmark models, ablations over breadth/depth/modality, defense evaluation (AdaShield), and extensive visualizations.
Writing Quality: ⭐⭐⭐⭐ — Clear presentation with an explicit threat model; Figure 1 provides an intuitive comparison.
Value: ⭐⭐⭐⭐⭐ — Significant contribution to VLM security, demonstrating that VLMs themselves can serve as the most capable red-teaming tools; VLJailbreakBench reveals the true safety levels of commercial models.