SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism¶

Conference: NeurIPS 2025 arXiv: 2507.01513 Code: GitHub Area: LLM Alignment Keywords: multimodal safety, jailbreak defense, token pruning, MLLM, training-free defense

TL;DR¶

By analyzing the propagation mechanism of harmful tokens in multimodal LLMs, this work finds that fewer than 1% of tokens trigger jailbreak behavior in early-to-middle layers. Based on this finding, the training-free SafePTR framework is proposed, which prunes harmful tokens at vulnerable layers and restores benign features in subsequent layers, significantly improving safety without sacrificing task performance.

Background & Motivation¶

Multimodal Jailbreak Threats: MLLMs extend LLM capabilities by integrating visual inputs, but also introduce new security vulnerabilities — multimodal jailbreak attacks (e.g., JailbreakV-28K, FigStep, MM-SafetyBench) can bypass model safety mechanisms.

Limitations of Prior Work: - Image-to-text methods (e.g., ECSO): Convert visual inputs into text descriptions, but remain vulnerable to text-driven jailbreaks. - Safety prompt methods (e.g., AdaShield): Statically inject safety constraints, lacking adaptability and prone to over-defense (e.g., misclassifying "toy water guns" as "real weapons"). - Multimodal safety fine-tuning (e.g., TGA): Requires large-scale training (1,223K samples, 64×V100 GPUs) with limited generalization.

Key Challenge: Existing methods rely on LLMs' built-in safety mechanisms without deeply investigating the underlying mechanism by which harmful multimodal tokens bypass safety alignment.

Method¶

Overall Architecture¶

SafePTR is a training-free defense framework consisting of two core modules:

Harmful Token Pruning (HTP): Identifies and prunes harmful tokens at vulnerable layers.
Benign Feature Restoration (BFR): Restores benign features in subsequent layers to preserve task capability.

Key Findings (Three Findings)¶

Finding-1 (Where): Through Layer-wise Intervention Analysis (LIA), only a small number of early-to-middle layers are found to be particularly vulnerable to jailbreak attacks: - LLaVA-1.5-7B: layers \([7, 9)\) - MiniGPT-4-7B: layers \([7, 9)\) - DeepSeek-VL2: layers \([4, 6)\)

Pruning harmful tokens in these 2–4 consecutive layers reduces ASR from 67.3% to 4.2%.

Finding-2 (How): Greater semantic deviation from safety-aligned instructions correlates with higher jailbreak success rates. Safe samples cluster near safety-aligned representations, while unsafe samples shift away from the safe region (average centroid distance: 0.11–0.14).

Finding-3 (Which): Fewer than 1% of multimodal tokens cause significant semantic deviation: - LLaVA-1.5 on MM-SafetyBench: 0.62% - MiniGPT-4 on MM-SafetyBench: 0.93% - DeepSeek-VL2 on MM-SafetyBench: 1.66%

Harmful Token Pruning (HTP)¶

Within the vulnerable layers \([n, n+\Delta_n)\), cosine similarity is computed between visual/instruction tokens and the safety-aligned instruction representations. The Top-K tokens with the greatest deviation from the safety space are selected for pruning. The safety-aligned instructions follow a fixed template.

Visual and textual modalities are pruned independently, as the embedding distance distributions differ between the two modalities. \(K\) defaults to 10% of total tokens.

Benign Feature Restoration (BFR)¶

After HTP pruning, subsequent layers operate on incomplete visual representations. BFR maintains a parallel branch for standard inference, then selectively restores benign features at safe layers. Pruned positions receive features from the standard inference branch, while non-pruned positions retain features from the pruned branch; the two are re-concatenated to recover the complete sequence.

This dual-path design ensures that restored tokens are less susceptible to attack influence in subsequent layers, primarily serving cross-modal integration and language refinement.

Loss & Training¶

Completely training-free: Requires no additional safety datasets or fine-tuning.
Single-pass inference: Defense is completed within a single forward pass (one-bypass inference).
Zero additional computational overhead: No new parameters or auxiliary models are introduced.

Key Experimental Results¶

Main Results: ASR (%, lower is better) on MM-SafetyBench¶

Model	Method	Avg. ASR↓
LLaVA-1.5-7B	Original	51.7
	AdaShield	14.3
	Immune	2.1
	SafePTR	1.3
MiniGPT-4-7B	Original	58.3
	CoCA	29.7
	Immune	18.3
	SafePTR	~15
DeepSeek-VL2	Original	72.7
	AdaShield	14.4
	SafePTR	10.1

Utility Preservation¶

SafePTR achieves performance close to the original model on MME and MM-Vet benchmarks, demonstrating that the BFR module effectively recovers task-relevant benign features.

Ablation Study¶

Configuration	Safety	Utility
HTP only	High	Notably degraded
BFR only	Insufficient	Good
HTP + BFR	High	Good

Key Findings¶

Top-K = 10% is optimal: too few tokens fail to prune effectively; too many degrade utility.
Layer selection is critical: intervention in only 2–4 vulnerable layers achieves the best safety-utility trade-off.
BFR significantly improves utility: restoring features at subsequent safe layers brings task performance close to the original model.
Attention Sink insight: harmful tokens concentrate at attention sink positions.

Highlights & Insights¶

Interpretable safety analysis: The first work to systematically analyze MLLM jailbreak mechanisms along three dimensions — Where, How, and Which.
Elegant training-free design: Requires no safety data and introduces no inference overhead.
Dual-modality defense: Simultaneously defends against both vision-driven and text-driven jailbreak attacks.
Semantic heatmaps: Visualization of harmful tokens intuitively demonstrates high deviation in tokens associated with violent scenes such as "armed figures" and "smoke."

Limitations & Future Work¶

Layer selection relies on prior LIA, requiring per-model analysis for each new architecture.
The fixed Top-K strategy lacks flexibility; adaptive \(K\) selection warrants further exploration.
Safety-aligned instructions are fixed, which may be insufficient for complex attacks requiring dynamic reference points.
Validation is limited to three open-source 7B-scale MLLMs.

ECSO: Image-to-text translation defense — bypassed by text-driven attacks.
AdaShield: Safety prompt injection — prone to over-defense.
Immune: Safety fine-tuning — high training cost and limited generalization.
FastV: Source of inspiration for SafePTR's Top-K pruning strategy.
Insight: The approach is generalizable to other multimodal settings, such as audio-language models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to systematically analyze MLLM jailbreak mechanisms at token granularity.
Experimental Thoroughness: ⭐⭐⭐⭐ 3 models × 5 benchmarks with complete ablations, though validation on larger models is lacking.
Writing Quality: ⭐⭐⭐⭐⭐ The Where/How/Which analytical framework is clearly presented.
Value: ⭐⭐⭐⭐⭐ A practical training-free defense solution with direct value for safe MLLM deployment.