Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation¶

Conference: ECCV 2024
arXiv: 2403.09572
Code: Project Page
Area: Multimodal VLM
Keywords: MLLM safety, Jailbreak attack defense, Image-to-text transformation, Training-free method, Safety alignment

TL;DR¶

ECSO (Eyes Closed, Safety On) is proposed: a training-free MLLM defense method that detects the safety of its own responses and adaptively converts images in unsafe queries into text descriptions, thereby restoring the intrinsic safety mechanism of pre-aligned LLMs. It achieves up to a 71.3% safety improvement on MM-SafetyBench without compromising general performance.

Background & Motivation¶

Background: Multimodal Large Language Models (MLLMs) achieve strong multimodal conversation capabilities by combining visual encoders with pre-aligned LLMs. However, the introduction of image inputs makes these models vulnerable to malicious jailbreak attacks that induce harmful content generation.

Limitations of Prior Work: Traditional safety alignment strategies (such as SFT and RLHF) require designing a large number of red-teaming queries, which is even more difficult and expensive when image inputs are involved. Existing inference-time defense methods either rely on manual system prompts (hard to cover new attacks) or require externally trained detectors.

Key Challenge: MLLMs inherit safety mechanisms from LLMs, but the introduction of image features "overpowers" these mechanisms. Specifically, when images are removed, the model can reject malicious queries almost 100% of the time, but with images, the harmless rate drops sharply to about 20%.

Core Problem: How to transfer the safety mechanism of pre-aligned LLMs to MLLMs without requiring extra training?

Key Insight: Two key observations are discovered: (a) Although MLLMs are prone to generating harmful content, they can identify the safety of their own responses with high accuracy (>95% accuracy); (b) Removing the image restores the LLM's safety mechanism.

Core Idea: Let the MLLM first self-detect the safety of its response. If unsafe content is detected, the image is converted into a text description, and the model "closes its eyes" to regenerate a safe response using a text-only LLM.

Method¶

Overall Architecture¶

Input (image \(v\), query \(x\)) \(\rightarrow\) Step 1: Normally generate an initial response \(\tilde{y}\) \(\rightarrow\) Step 2: MLLM self-evaluates response safety \(s\) \(\rightarrow\) [If safe, return directly] \(\rightarrow\) Step 3: If unsafe, convert the image into a query-aware text description \(c\) \(\rightarrow\) Step 4: Regenerate a safe response \(y\) using text-only (no image).

Key Designs¶

Harmful Content Detection:
- Function: Enables the MLLM to judge whether its generated initial response is safe.
- Mechanism: First, the initial response is normally generated: \(\tilde{y} = F_{\theta}(v, x)\). Then, a detection prompt template \(P_{\text{det}}\) is used to wrap the original query and the initial response, allowing the MLLM to self-evaluate: \(s = F_{\theta}(v, P_{\text{det}}(x, \tilde{y}))\).
- Design Motivation: Experiments reveal that MLLMs perform exceptionally well in discrimination tasks (achieving >95% accuracy on LLaVA-1.5-7B and ShareGPT4V-7B), and this discrimination accuracy is unaffected by the presence of images. The hypothesis that discrimination is easier than generation is supported by scalable oversight theory.
Query-Aware I2T Transformation:
- Function: Converts the input image into a query-related text description.
- Mechanism: Uses a prompt template \(P_{\text{trans}}\) containing the original question to guide the MLLM in generating a query-aware image description: \(c = F_{\theta}(v, P_{\text{trans}}(x))\).
- Design Motivation: (a) By converting to text, harmful content in the image is either transformed into words or discarded; (b) query-awareness ensures that the description contains key information required to answer the question, avoiding information loss from irrelevant descriptions. Ablation studies prove that removing query-awareness GSM significantly reduces utility.
Safe Response Generation Without Images:
- Function: Replaces the image with a text description, allowing the model to regenerate a response in text-only mode.
- Mechanism: \(y = F_{\theta}(\text{null}, P_{\text{gen}}(c, x))\), where null represents the absence of an image. At this point, the MLLM degrades to a text-only LLM, reactivating its pre-aligned safety mechanism.
- Design Motivation: Experiments show that once the image is removed, the LLM becomes almost 100% harmless. Adding keywords like "HARMLESS and ETHICAL" in the prompt further reinforces safety priority, raising the harmless rate even above the Text-Only upper bound.

Loss & Training¶

ECSO is a completely training-free inference-time method without any loss functions or training processes. Additionally, ECSO can serve as a data engine to automatically generate SFT safety alignment data: applying the ECSO pipeline to an unsupervised safety dataset \(D = \{(v,x)\}\) yields \(D' = \{(v,x,y)\}\) for fine-tuning.

Key Experimental Results¶

Main Results (MM-SafetyBench, LLaVA-1.5-7B)¶

Attack Type	Metric (Harmless Rate %)	Direct	ECSO	Gain
SD (Stable Diffusion)	Average	85.0	95.4	+10.4
OCR	Average	31.7	90.3	+58.6
SD+OCR	Average	32.1	86.4	+54.3
SD+OCR - Illegal Activities	Harmless Rate	25.8	92.8	+67.0
SD+OCR - Hate Speech	Harmless Rate	51.5	90.2	+38.7
SD+OCR - Malware	Harmless Rate	38.6	84.1	+45.5
VLSafe (across 5 MLLMs)	Harmless Rate	~20%	~90%	+71.3 (Max)

Ablation Study¶

Configuration	Key Metric	Description
Fully Featured ECSO	HR=86.4% (SD+OCR)	Baseline
Retain Image + Caption	HR drops significantly	Proves that removing the image is critical
Remove Query-Awareness	MMBench: 65.8 (-1.05%)	Query-awareness is indispensable to maintain utility
Direct Refusal (w/o Step 3 & 4)	MME: 1847 (vs 1865)	Steps 3 & 4 ensure normal response to benign queries
SFT data generated by ECSO	Outperforms manual annotation VLGuard	ECSO can serve as a data engine

Utility Preservation (False Positive Rate & Performance)¶

Model	MME False Positive Rate	MMBench False Positive Rate	MME-P (Direct/ECSO)	MMBench (Direct/ECSO)
LLaVA-1.5-7B	0.50%	1.23%	1507.4/1507.4	64.6/64.2
ShareGPT4V-7B	1.93%	4.24%	1566.4/1567.1	66.5/66.1
Qwen-VL-Chat	1.26%	2.88%	1481.5/1481.5	59.7/59.1

Key Findings¶

The safety mechanism of MLLMs is not gone, but "overpowered" by image features—removing the image achieves a ~100% harmless rate for almost all models.
MLLMs have an extremely strong ability to self-distinguish whether their responses are safe (>95%), which is unaffected by image inputs.
OCR and SD+OCR attacks are more effective than pure SD attacks because they contain more direct malicious textual information.
The safety alignment data generated by ECSO achieves quality comparable to, or even exceeding, manually annotated data.

Highlights & Insights¶

The insight "discrimination is easier than generation" is highly valuable: Utilizing the model's own discriminatory capability to cover safety loopholes during generation is an elegant self-bootstrapping safety strategy.
The training-free design allows ECSO to be plug-and-play for any MLLM, making it highly practical.
The image-to-text modality conversion trick: "Closing its eyes" to reduce a multimodal problem to a text-only problem cleverly leverages the pre-existing safety alignment in the LLM.
The query-aware captioning design prevents information loss, a trick that is transferable to other scenarios requiring image-to-text translation.
Data engine byproduct: ECSO not only provides inference-time protection but also automatically generates safety alignment data, creating a virtuous cycle.

Limitations & Future Work¶

ECSO relies on the safety capability of the underlying LLM itself; if the LLM has inherent safety flaws, ECSO will also fail.
Information loss is inevitable during the image-to-text conversion process, which may affect response quality for queries heavily dependent on visual details.
Multi-turn inference (generation \(\rightarrow\) safety evaluation \(\rightarrow\) conversion \(\rightarrow\) final generation) increases inference latency.
Using multimodality to shift from a "safety challenge" to a "safety advantage" (by leveraging rich multimodal context to build stronger safety mechanisms) remains unexplored.
Advanced I2T methods (such as \(V^*\) guided visual search) can be explored to improve information preservation.

vs MLLM-Protector [Pi et al.]: MLLM-Protector requires training additional detectors and detoxicators, while ECSO is completely training-free and leverages the model's own capacity.
vs Self-Moderation [Chen et al.]: Pure instruction-based self-moderation still fails when images are present, whereas ECSO fundamentally resolves this by removing images.
vs Safety Steering Vectors [Wang et al.]: Steering vectors focus primarily on textual unsafe intent, potentially overlooking malicious content inside images.
vs VLGuard [Zong et al.]: VLGuard aligns through SFT requiring annotated data, whereas ECSO can automatically generate equivalent or even superior alignment data.

Rating¶

Novelty: ⭐⭐⭐⭐ Observing the discrepancy between discrimination and generation capabilities and utilizing modality conversion to recover safety mechanisms is a clever idea, although the technical implementation is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 MLLMs, 3 safety benchmarks, 3 utility benchmarks, and detailed ablation studies make the experiments exceptionally thorough.
Writing Quality: ⭐⭐⭐⭐⭐ The logical chain is highly clear: observation \(\rightarrow\) insight \(\rightarrow\) method \(\rightarrow\) evaluation, with elegant and easy-to-understand diagrams.
Value: ⭐⭐⭐⭐ Practical value is high as a training-free, plug-and-play solution, although it fundamentally "bypasses the problem" rather than "solving it".