Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation¶
Conference: ECCV 2024
arXiv: 2403.09572
Code: Project Page
Area: Multimodal VLM
Keywords: MLLM safety, Jailbreak attack defense, Image-to-text transformation, Training-free method, Safety alignment
TL;DR¶
ECSO (Eyes Closed, Safety On) is proposed: a training-free MLLM defense method that detects the safety of its own responses and adaptively converts images in unsafe queries into text descriptions, thereby restoring the intrinsic safety mechanism of pre-aligned LLMs. It achieves up to a 71.3% safety improvement on MM-SafetyBench without compromising general performance.
Background & Motivation¶
Background: Multimodal Large Language Models (MLLMs) achieve strong multimodal conversation capabilities by combining visual encoders with pre-aligned LLMs. However, the introduction of image inputs makes these models vulnerable to malicious jailbreak attacks that induce harmful content generation.
Limitations of Prior Work: Traditional safety alignment strategies (such as SFT and RLHF) require designing a large number of red-teaming queries, which is even more difficult and expensive when image inputs are involved. Existing inference-time defense methods either rely on manual system prompts (hard to cover new attacks) or require externally trained detectors.
Key Challenge: MLLMs inherit safety mechanisms from LLMs, but the introduction of image features "overpowers" these mechanisms. Specifically, when images are removed, the model can reject malicious queries almost 100% of the time, but with images, the harmless rate drops sharply to about 20%.
Core Problem: How to transfer the safety mechanism of pre-aligned LLMs to MLLMs without requiring extra training?
Key Insight: Two key observations are discovered: (a) Although MLLMs are prone to generating harmful content, they can identify the safety of their own responses with high accuracy (>95% accuracy); (b) Removing the image restores the LLM's safety mechanism.
Core Idea: Let the MLLM first self-detect the safety of its response. If unsafe content is detected, the image is converted into a text description, and the model "closes its eyes" to regenerate a safe response using a text-only LLM.
Method¶
Overall Architecture¶
Input (image \(v\), query \(x\)) \(\rightarrow\) Step 1: Normally generate an initial response \(\tilde{y}\) \(\rightarrow\) Step 2: MLLM self-evaluates response safety \(s\) \(\rightarrow\) [If safe, return directly] \(\rightarrow\) Step 3: If unsafe, convert the image into a query-aware text description \(c\) \(\rightarrow\) Step 4: Regenerate a safe response \(y\) using text-only (no image).
Key Designs¶
-
Harmful Content Detection:
- Function: Enables the MLLM to judge whether its generated initial response is safe.
- Mechanism: First, the initial response is normally generated: \(\tilde{y} = F_{\theta}(v, x)\). Then, a detection prompt template \(P_{\text{det}}\) is used to wrap the original query and the initial response, allowing the MLLM to self-evaluate: \(s = F_{\theta}(v, P_{\text{det}}(x, \tilde{y}))\).
- Design Motivation: Experiments reveal that MLLMs perform exceptionally well in discrimination tasks (achieving >95% accuracy on LLaVA-1.5-7B and ShareGPT4V-7B), and this discrimination accuracy is unaffected by the presence of images. The hypothesis that discrimination is easier than generation is supported by scalable oversight theory.
-
Query-Aware I2T Transformation:
- Function: Converts the input image into a query-related text description.
- Mechanism: Uses a prompt template \(P_{\text{trans}}\) containing the original question to guide the MLLM in generating a query-aware image description: \(c = F_{\theta}(v, P_{\text{trans}}(x))\).
- Design Motivation: (a) By converting to text, harmful content in the image is either transformed into words or discarded; (b) query-awareness ensures that the description contains key information required to answer the question, avoiding information loss from irrelevant descriptions. Ablation studies prove that removing query-awareness GSM significantly reduces utility.
-
Safe Response Generation Without Images:
- Function: Replaces the image with a text description, allowing the model to regenerate a response in text-only mode.
- Mechanism: \(y = F_{\theta}(\text{null}, P_{\text{gen}}(c, x))\), where null represents the absence of an image. At this point, the MLLM degrades to a text-only LLM, reactivating its pre-aligned safety mechanism.
- Design Motivation: Experiments show that once the image is removed, the LLM becomes almost 100% harmless. Adding keywords like "HARMLESS and ETHICAL" in the prompt further reinforces safety priority, raising the harmless rate even above the Text-Only upper bound.
Loss & Training¶
ECSO is a completely training-free inference-time method without any loss functions or training processes. Additionally, ECSO can serve as a data engine to automatically generate SFT safety alignment data: applying the ECSO pipeline to an unsupervised safety dataset \(D = \{(v,x)\}\) yields \(D' = \{(v,x,y)\}\) for fine-tuning.
Key Experimental Results¶
Main Results (MM-SafetyBench, LLaVA-1.5-7B)¶
| Attack Type | Metric (Harmless Rate %) | Direct | ECSO | Gain |
|---|---|---|---|---|
| SD (Stable Diffusion) | Average | 85.0 | 95.4 | +10.4 |
| OCR | Average | 31.7 | 90.3 | +58.6 |
| SD+OCR | Average | 32.1 | 86.4 | +54.3 |
| SD+OCR - Illegal Activities | Harmless Rate | 25.8 | 92.8 | +67.0 |
| SD+OCR - Hate Speech | Harmless Rate | 51.5 | 90.2 | +38.7 |
| SD+OCR - Malware | Harmless Rate | 38.6 | 84.1 | +45.5 |
| VLSafe (across 5 MLLMs) | Harmless Rate | ~20% | ~90% | +71.3 (Max) |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Fully Featured ECSO | HR=86.4% (SD+OCR) | Baseline |
| Retain Image + Caption | HR drops significantly | Proves that removing the image is critical |
| Remove Query-Awareness | MMBench: 65.8 (-1.05%) | Query-awareness is indispensable to maintain utility |
| Direct Refusal (w/o Step 3 & 4) | MME: 1847 (vs 1865) | Steps 3 & 4 ensure normal response to benign queries |
| SFT data generated by ECSO | Outperforms manual annotation VLGuard | ECSO can serve as a data engine |
Utility Preservation (False Positive Rate & Performance)¶
| Model | MME False Positive Rate | MMBench False Positive Rate | MME-P (Direct/ECSO) | MMBench (Direct/ECSO) |
|---|---|---|---|---|
| LLaVA-1.5-7B | 0.50% | 1.23% | 1507.4/1507.4 | 64.6/64.2 |
| ShareGPT4V-7B | 1.93% | 4.24% | 1566.4/1567.1 | 66.5/66.1 |
| Qwen-VL-Chat | 1.26% | 2.88% | 1481.5/1481.5 | 59.7/59.1 |
Key Findings¶
- The safety mechanism of MLLMs is not gone, but "overpowered" by image features—removing the image achieves a ~100% harmless rate for almost all models.
- MLLMs have an extremely strong ability to self-distinguish whether their responses are safe (>95%), which is unaffected by image inputs.
- OCR and SD+OCR attacks are more effective than pure SD attacks because they contain more direct malicious textual information.
- The safety alignment data generated by ECSO achieves quality comparable to, or even exceeding, manually annotated data.
Highlights & Insights¶
- The insight "discrimination is easier than generation" is highly valuable: Utilizing the model's own discriminatory capability to cover safety loopholes during generation is an elegant self-bootstrapping safety strategy.
- The training-free design allows ECSO to be plug-and-play for any MLLM, making it highly practical.
- The image-to-text modality conversion trick: "Closing its eyes" to reduce a multimodal problem to a text-only problem cleverly leverages the pre-existing safety alignment in the LLM.
- The query-aware captioning design prevents information loss, a trick that is transferable to other scenarios requiring image-to-text translation.
- Data engine byproduct: ECSO not only provides inference-time protection but also automatically generates safety alignment data, creating a virtuous cycle.
Limitations & Future Work¶
- ECSO relies on the safety capability of the underlying LLM itself; if the LLM has inherent safety flaws, ECSO will also fail.
- Information loss is inevitable during the image-to-text conversion process, which may affect response quality for queries heavily dependent on visual details.
- Multi-turn inference (generation \(\rightarrow\) safety evaluation \(\rightarrow\) conversion \(\rightarrow\) final generation) increases inference latency.
- Using multimodality to shift from a "safety challenge" to a "safety advantage" (by leveraging rich multimodal context to build stronger safety mechanisms) remains unexplored.
- Advanced I2T methods (such as \(V^*\) guided visual search) can be explored to improve information preservation.
Related Work & Insights¶
- vs MLLM-Protector [Pi et al.]: MLLM-Protector requires training additional detectors and detoxicators, while ECSO is completely training-free and leverages the model's own capacity.
- vs Self-Moderation [Chen et al.]: Pure instruction-based self-moderation still fails when images are present, whereas ECSO fundamentally resolves this by removing images.
- vs Safety Steering Vectors [Wang et al.]: Steering vectors focus primarily on textual unsafe intent, potentially overlooking malicious content inside images.
- vs VLGuard [Zong et al.]: VLGuard aligns through SFT requiring annotated data, whereas ECSO can automatically generate equivalent or even superior alignment data.
Rating¶
- Novelty: ⭐⭐⭐⭐ Observing the discrepancy between discrimination and generation capabilities and utilizing modality conversion to recover safety mechanisms is a clever idea, although the technical implementation is relatively straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 MLLMs, 3 safety benchmarks, 3 utility benchmarks, and detailed ablation studies make the experiments exceptionally thorough.
- Writing Quality: ⭐⭐⭐⭐⭐ The logical chain is highly clear: observation \(\rightarrow\) insight \(\rightarrow\) method \(\rightarrow\) evaluation, with elegant and easy-to-understand diagrams.
- Value: ⭐⭐⭐⭐ Practical value is high as a training-free, plug-and-play solution, although it fundamentally "bypasses the problem" rather than "solving it".