Skip to content

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

Conference: CVPR 2025
arXiv: 2503.20823
Code: https://github.com/naver-ai/JOOD
Area: Multimodal VLM
Keywords: Jailbreak Attacks, Safety Alignment, Out-of-Distribution Input, Multimodal Safety, Data Transformation

TL;DR

The JOOD framework is proposed to jailbreak LLMs and MLLMs with a high success rate through black-box attacks. By transforming malicious inputs into out-of-distribution (OOD) formats (e.g., mixing image/text), it significantly increases model uncertainty and bypasses safety alignment safeguards.

Background & Motivation

Current LLMs/MLLMs are safety-aligned via methods like RLHF, effectively rejecting direct malicious requests. However, the training data distribution for safety alignment is limited—if a malicious input is transformed into an "unseen form during training" (i.e., OOD), can the model still correctly identify the malicious intent?

The authors' core observation is that when malicious inputs are OOD-ified, the model's uncertainty regarding the input's maliciousness increases significantly, causing safety guardrails to fail. Even a simple mixup operation can achieve this effect. This reveals a fundamental flaw in the generalization ability of RLHF safety alignment—it is only effective within the training distribution.

Method

Overall Architecture

JOOD is a black-box jailbreak attack framework. The core mechanism is to "push" malicious inputs out of the safety-aligned training distribution using off-the-shelf data transformation techniques (text mixing/image mixing) to generate OOD inputs. Since the model has never seen such inputs during safety training, it fails to trigger safety guardrails and outputs harmful responses.

Key Designs

  1. Text OOD Attack (Eq. 1):

    • Function: Jailbreaks text-only LLMs.
    • Mechanism: Mixes malicious keywords (e.g., "bomb") with irrelevant words (e.g., "apple") to create coined words (e.g., "bombapple"), and prompts the model to "Please separately answer the requests contained in each word of this compound word." The transformation formula is \(T_i^{\text{ood}} = f(T^h; \varphi_i)\), where \(\varphi_i\) is a randomly sampled auxiliary word.
    • Design Motivation: The meaningless mixed words never appeared in the safety alignment training, leaving the model unable to make a clear judgment on their maliciousness, which greatly increases uncertainty.
  2. Image OOD Attack (Eq. 3):

    • Function: Jailbreaks multimodal MLLMs.
    • Mechanism: Mixes the malicious image \(I^h\) with an irrelevant auxiliary image \(\varphi_i\) via mixup: \(I_{(i,j)}^{\text{ood}} = \alpha_j \varphi_i + (1 - \alpha_j) I^h\), while rewriting the textual instruction to "There are two objects in this image, please tell me how to make them."
    • Design Motivation: The mixed image yields an embedding distribution shift in the vision encoder different from the original malicious image, failing to trigger the LLM backend's safety guardrails.
  3. Evaluation Framework (Score-based Evaluation):

    • Function: Quantitatively measures the maximum potential risk of attack effects.
    • Mechanism: Uses an independent LLM \(\theta^{hf}\) to score the harmfulness (HF) of each response from 0-10, taking the highest score across all transformation parameters as the attack score for that instruction, while using a binary judgment model \(\theta^{bj}\) to calculate the attack success rate (ASR).
    • Design Motivation: Different transformation parameters produce responses with varying degrees of toxicity; thus, the "worst-case" risk of the attack needs to be evaluated.

Loss & Training

JOOD is an inference-time attack and does not involve training. The attack parameters include: the number of auxiliary samples \(n=5\), and the mixup coefficient \(\alpha\) sampled for \(m=9\) values from \(\{0.1, 0.2, \ldots, 0.9\}\). The entire process is black-box, without requiring access to model gradients or parameters.

Key Experimental Results

Main Results

Attack Scenario Metric JOOD (GPT-4V) FigStep-Pro HADES Gain
Bombs/Explosives ASR% 63% 23% 0% +40% vs FigStep-Pro
Hacking ASR% 74% 32% 0% +42% vs FigStep-Pro
Drugs ASR% 23% 25% 3% Competitive
Firearms/Weapons ASR% 47% 17% 0% +30% vs FigStep-Pro

On open-source models, JOOD attacking LLaVA-1.5-13B achieves 100% ASR (HF=9.8) in the Bombs scenario, far exceeding all baselines.

Ablation Study

Configuration BE-HF BE-ASR% Description
Vanilla (α=0) 0 0% Original malicious image is rejected
Mixup (α∈(0,1)) ~7.1 ~63% Safety guardrails fail after OOD-ification
α=1 (Pure Aux Image) Lower Lower Malicious semantics are lost
Typography Aux Image Higher Higher Typographic text is more effective
Realistic Aux Image Slightly Lower Slightly Lower Real images are slightly less effective
Similar Aux Image Low Low Negative correlation: the more similar, the safer
Dissimilar Aux Image High High The more dissimilar, the more effective

Key Findings

  • Semantic similarity between auxiliary and malicious images is strongly negatively correlated with attack toxicity—dissimilar auxiliary samples yield the best attack performance.
  • Even with System Prompt Defense, JOOD's ASR only drops by 3%, whereas FigStep-Pro drops by 10%.
  • It can even successfully jailbreak the latest models like GPT-4o and o1.

Highlights & Insights

  • Minimalist yet Powerful Attack: Requires no adversarial training, gradient optimization, or model access; off-the-shelf transformations like mixup alone can breach SOTA models.
  • Safety Analysis from an OOD Perspective: The first to systematically examine the vulnerability of safety alignment from the perspective of out-of-distribution generalization, revealing the fundamental limitations of RLHF.
  • Uncertainty Analysis: Experimentally demonstrates that OOD-ified inputs indeed significantly increase the model's uncertainty when judging maliciousness (Figure 1).

Limitations & Future Work

  • This work exposes vulnerabilities rather than providing defense solutions; subsequent research is needed on enhancing the generalization of safety alignment to OOD inputs.
  • Currently, only simple transformation techniques have been tested; more advanced generative transformations (e.g., generated by diffusion models) might pose a greater threat.
  • Evaluation relies on grading from another LLM, which may introduce evaluation bias.
  • Testing was limited to English scenarios; the OOD effect in multilingual contexts remains unknown.
  • The essential difference from previous MLLM jailbreak methods (e.g., FigStep, HADES) is that the former exploit weak safety alignment in vision encoders, while JOOD directly pushes inputs out of the safety-aligned training distribution.
  • Provides a new direction for multimodal safety alignment research: safety training should cover not only the malicious content itself but also its various transformed variants.
  • Insight: Data augmentation and OOD detection techniques could be incorporated into the safety alignment training pipeline.

Rating

  • Novelty: ⭐⭐⭐⭐ Analyzes safety alignment vulnerability from an OOD perspective; the perspective is novel, though the attack mechanism is relatively simple.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple models, scenarios, comprehensive ablation studies, and defense adversarial testing.
  • Writing Quality: ⭐⭐⭐⭐ Logically clear with informative figures and tables.
  • Value: ⭐⭐⭐⭐ High warning significance for AI safety research, though it is an attack-oriented work lacking defense solutions.