Defending LVLMs Against Vision Attacks through Partial-Perception Supervision¶
Conference: ICML2025
arXiv: 2412.12722
Code: GitHub
Area: LLM/NLP
Keywords: LVLM Security, Visual Adversarial Attack, Partial-Perception Supervision, Weak-to-Strong Learning, Black-box Defense, Jailbreak Defense
TL;DR¶
Proposes DPS (Defense through Partial-Perception Supervision), which utilizes responses from cropped images as "weak supervision" to guide the full-image model for self-correction during inference. This achieves training-free, black-box visual attack defense for LVLMs, reducing the average attack success rate by 76.3%.
Background & Motivation¶
- Core Problem: LVLMs (e.g., GPT-4o, Gemini) are vulnerable to visual adversarial attacks (adversarial noise, typographic attacks), which mislead them into outputting incorrect or harmful content.
- Limitations of Prior Work: Methods like SmoothVLM defend against attacks using image cropping plus majority voting, but cropping disrupts semantic information, leading to a significant degradation in response quality on benign images.
- Key Challenge: Cropping can disrupt attack signals but also damages the semantics of normal images—how to achieve the best of both worlds?
- Key Insight:
- Sensitivity: Visual attacks are highly sensitive to image modifications such as cropping, which disrupts the attack semantics.
- Confidence Discrepancy: Models exhibit high confidence when processing clean images and are unaffected by distractor information; conversely, they display low confidence when handling attacked images and are easily influenced by distractors in the prompt.
- Inspiration: Weak-to-strong learning—a weak model (the model observing partial images) can effectively supervise and guide a strong model (the model observing the full image).
Method¶
DPS Framework (Two-Step Inference)¶
Step 1: Partial-Perception Initial Response (Part-Perc Model)
Generate an objective description of the cropped image:
"Please provide an objective, detailed description of the image, avoiding subjective conjecture and associations. Then answer the question: (Original Question)."
Three cropping strategies are employed to generate three partial images:
- Center Crop (CC): Extracts a region of 1/2 size from the center of the image.
- Random Crop (RC): Extracts a region of 1/4 to 1/2 size from a random position.
- Adaptive Crop (AC): Leverages the LVLM to extract the primary object region within the image.
Step 2: Partial-Perception Supervision (Full-Perc Model)
Utilize the responses from the Part-Perc model as supervisory information to guide the Full-Perc model in re-analyzing the full image:
"Here is the information provided by the local observation agents: (Supervisory message). Re-analyze the given image, and provide your final answer to the question: (Original Question)."
Mechanism:
- Clean images \(\rightarrow\) High model confidence \(\rightarrow\) The (potentially inaccurate) descriptions from Part-Perc do not affect the final output.
- Attacked images \(\rightarrow\) Low model confidence \(\rightarrow\) The descriptions from Part-Perc (with attack signals removed) prompt the model to perform self-correction.
Safety Enhancement: LS-DPS¶
For jailbreak attacks, a safety reminder is incorporated into the prompt of Step 2:
"Consider whether you might be led into discussing harmful, malicious, or unethical topics."
An external LLM safety checker (LLM-Secured DPS) is connected to filter the final output:
Where \(\mathcal{F}\) represents the LVLM, \(\mathbb{I}\) is the indicator function of whether the attack succeeded, and \(t_i\) denotes the attack goal or safety standard.
Key Experimental Results¶
Misleading Attack Defense (ASR ↓, Lower is Better)¶
| Model | Method | RTA-100 | Self-Gen | MultiTrust | Avg |
|---|---|---|---|---|---|
| Qwen-VL-Plus | SmoothVLM | 0.92 | 0.83 | 1.00 | 0.91 |
| Qwen-VL-Plus | DPS | 0.24 | 0.30 | 0.40 | 0.31 |
| GPT-4o-Mini | SmoothVLM | 0.68 | 0.85 | - | 0.76 |
| GPT-4o-Mini | DPS | 0.35 | 0.43 | - | 0.39 |
| Gemini-1.5-Flash | SmoothVLM | 0.85 | 1.00 | 0.80 | 0.88 |
| Gemini-1.5-Flash | DPS | 0.58 | 0.49 | 0.11 | 0.39 |
DPS significantly reduces the average ASR under misleading attacks to 0.31~0.39, achieving 2x to 2.5x better defense efficacy than the best baseline.
Jailbreak Attack Defense (ASR ↓)¶
| Model | Method | MM-Safety | HADES | VisualAtt | Avg |
|---|---|---|---|---|---|
| Qwen-VL-Plus | Protector | 0.07 | 0.22 | 0.18 | 0.16 |
| Qwen-VL-Plus | LS-DPS | 0.02 | 0.10 | 0.02 | 0.05 |
| GPT-4o-Mini | ECSO | 0.24 | 0.05 | 0.15 | 0.15 |
| GPT-4o-Mini | LS-DPS | 0.03 | 0.04 | 0.04 | 0.04 |
| Gemini-1.5-Flash | ECSO | 0.14 | 0.11 | 0.13 | 0.13 |
| Gemini-1.5-Flash | LS-DPS | 0.06 | 0.03 | 0.06 | 0.05 |
LS-DPS reduces the ASR under jailbreak attacks to 0.04~0.05, demonstrating superior performance over all baseline methods.
Standard Performance (MM-Vet)¶
DPS has a minimal impact on standard performance, performing close to the vanilla model, whereas SmoothVLM suffers from significant performance degradation.
Ablation Study (Comparison of Cropping Strategies, Qwen-VL-Plus)¶
| Strategy | Misleading Tasks (Avg) | Safety Tasks (Avg) |
|---|---|---|
| CC (Center Crop) | ~0.40 | ~0.08 |
| RC (Random Crop) | ~0.48 | ~0.09 |
| AC (Adaptive Crop) | ~0.37 | ~0.07 |
| DPS (Fusion of Three) | ~0.31 | ~0.05 |
The fusion of multiple cropping strategies significantly enhances defense capabilities.
Highlights & Insights¶
- Novel Weak-to-Strong Defense Paradigm: Analogizes "observing partial images vs. full images" to "weak model vs. strong model," utilizing weak supervision to guide the strong model toward self-correction rather than using simple voting.
- Fully Black-Box + Training-Free: Requires no access to internal model weights or extra training, enabling defense purely through adjustments at the prompt level.
- Leveraging Confidence Discrepancies: The insight that models exhibit high confidence on clean images but low confidence on attacked images serves as the theoretical foundation for maintaining both performance and defense efficacy.
- Addressing Both Misleading and Jailbreak Attacks: Covers both mainstream attack scenarios under a single framework through prompt fine-tuning and the plug-and-play extension of LLM safety checkers.
- No Loss in Standard Performance: In contrast to the significant performance drop observed with SmoothVLM, DPS barely impacts standard task performance.
Limitations & Future Work¶
- Computational Overhead: Requires multiple cropping operations and model inferences (at least 4 LVLM calls), resulting in a significant overhead increase compared to single inference.
- Failure Cases with Cropping: If the attack information is globally distributed rather than localized, cropping might fail to eliminate the attack signal, rendering partial-perception supervision ineffective.
- Dependency on Confidence Discrepancy Assumption: The method relies on the observation that "attacks cause model confidence to drop." If an attack technique keeps model confidence high despite being attacked, the defense may fail.
- Evaluation Limited to API Models: Main experiments are conducted on proprietary API models (Qwen-VL-Plus, GPT-4o-Mini, Gemini-1.5-Flash), with open-source models verified only through additional experiments on Qwen2.5-VL-32B.
- Scope for Enhancing Interaction Strategies: A simple two-step prompting approach is currently used; more sophisticated interaction mechanisms (e.g., multi-turn debates) could potentially yield further improvements.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Applying the weak-to-strong supervision paradigm to visual attack defense offers a brand-new perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated across three models, six datasets, and seven baselines, covering both misleading and jailbreak attacks.
- Writing Quality: ⭐⭐⭐⭐ — The motivation analysis in Section 3 is progressive and clearly structured.
- Value: ⭐⭐⭐⭐ — High practicality (black-box and training-free), though the computational overhead may limit its deployment scenarios.