Skip to content

VERA: Variational Inference Framework for Jailbreaking Large Language Models

Conference: NeurIPS 2025 arXiv: 2506.22666 Code: None Area: AI Safety, LLM Red-Teaming, Variational Inference Keywords: Jailbreak Attack, Variational Inference, Black-Box Attack, Red-Teaming, Adversarial Prompt

TL;DR

This paper formalizes black-box LLM jailbreaking as a variational inference problem, training a small attacker LLM to approximate the posterior distribution of adversarial prompts for a target LLM. Once trained, the attacker can efficiently generate diverse jailbreak prompts without relying on human-crafted templates.

Background & Motivation

State of the Field

Background: Existing black-box jailbreak methods rely on genetic algorithms or search procedures, requiring separate optimization for each target behavior at high computational cost.

Limitations of Prior Work

Limitations of Prior Work: Most methods depend on manually curated jailbreak template pools for initialization, tying them to known vulnerabilities that are easily patched.

Root Cause

Key Challenge: A single successful attack is insufficient to comprehensively assess model vulnerabilities; diverse attacks are needed to cover the full landscape of weaknesses.

Starting Point

Key Insight: There is a lack of a principled, distribution-level framework for understanding and generating adversarial prompts.

Method

Overall Architecture

  • Frames jailbreak prompt generation as a posterior inference problem: \(x \sim P_{LM}(x|y^*)\)
  • Uses a small LLM (the attacker) as a variational distribution \(q_\theta(x)\) to approximate the adversarial prompt posterior of the target LLM
  • Fine-tunes the attacker model via LoRA to learn the distribution of effective jailbreak prompts

Key Designs

  1. Variational Objective (ELBO): \(\mathbb{E}_{q_\theta(x)}[\log P_{LM}(y^*|x) + \log P(x) - \log q_\theta(x)]\)

    • First term: probability of harmful content generation (attack effectiveness)
    • Second term: plausibility under the prior (prompt fluency)
    • Third term: entropy regularization (encourages diversity, prevents mode collapse)
  2. Judge as a Likelihood Approximator:

    • In the black-box setting, \(P_{LM}(y^*|x)\) cannot be computed directly
    • An external judge model \(J(x,\hat{y}) \in [0,1]\) is used as a proxy
    • Can use a binary classifier (e.g., HarmBench's LLaMA2-13B classifier) or LLM-based scoring
  3. REINFORCE Gradient Estimation:

    • Uses the REINFORCE trick to handle gradients through discrete sampling
    • \(\nabla_\theta \approx \frac{1}{N}\sum_i f(x_i) \nabla_\theta \log q_\theta(x_i)\)
    • Early stopping: training halts upon the first successful jailbreak to prevent over-optimization and attacker degeneration

Loss & Training

  • Attacker model: Vicuna-7b + LoRA
  • Each step generates \(B\) prompts → queries the target LLM → Judge scoring → REINFORCE update
  • Early stopping at the first successful jailbreak prompt
  • No human-crafted jailbreak templates are used at any stage

Key Experimental Results

Main Results (HarmBench ASR%)

Method Llama2-7b Vicuna-7b Orca2-7b R2D2 GPT-3.5 Avg.
GCG (White-Box) 32.5 65.5 46.0 5.5 - 40.2
PAIR (Black-Box) 9.3 53.5 57.3 48.0 35.0 36.3
TAP-T (Black-Box) 7.8 59.8 60.3 54.3 47.5 40.9
AutoDAN (Black-Box) 0.5 66.0 71.0 17.0 - 34.8
VERA (Black-Box) 10.8 70.0 72.0 63.5 - -

Diversity & Novelty Comparison (50-behavior subset, Vicuna-7B target)

Metric VERA GPTFuzzer AutoDAN
Self-BLEU (lower = more diverse) Lowest Medium Medium
Template BLEU (lower = more novel) Lowest Higher Higher
Successful attacks under fixed time budget 5× GPTFuzzer Baseline Medium

Key Findings

  • VERA achieves state-of-the-art among black-box methods on HarmBench, outperforming all prior methods on multiple target models
  • Generated attack prompts exhibit significantly greater diversity than template-based methods (lowest Self-BLEU)
  • Completely template-independent: BLEU score against system prompts is extremely low
  • Under a fixed time budget (1250s), VERA produces more than 5× the successful attacks of GPTFuzzer
  • Removing known effective templates causes substantial performance degradation in template-based methods, while VERA is unaffected

Highlights & Insights

  • Jailbreak attacks are elegantly embedded within a variational inference framework with rigorous mathematical foundations
  • The entropy regularization term naturally resolves the attack diversity problem without additional design effort
  • The trained attacker can generate new attacks via simple forward passes, amortizing the overall computational cost
  • VERA does not rely on known vulnerability templates, offering "future adaptability"—its effectiveness is not contingent on unpatched exploits

Limitations & Future Work

  • A large number of target LLM queries are required per step (B per step), making API-based deployment costly
  • Using Vicuna-7b as the attacker may limit effectiveness against models with stronger defenses
  • Early stopping prevents degeneration but may prematurely curtail exploration of additional attack patterns
  • The comparison with RL-based approaches (mentioned in the appendix) warrants deeper discussion
  • The variational inference framework represents a paradigm shift in red-teaming from point-based attacks to distribution-level attacks
  • The combination of REINFORCE and LoRA keeps training computationally tractable
  • The observation that "comprehensive red-teaming requires breadth of vulnerability coverage, not merely confirmation of existence" deserves broader attention
  • Implication for LLM safety research: safety alignment should not assume that attackers lack systematic methods
  • The "future adaptability" of VERA suggests that attacks independent of known vulnerabilities are fundamentally harder to defend against
  • VERA's distributional perspective provides a novel tool for understanding the structure of LLM vulnerabilities
  • Future work may explore applying the variational inference framework on the defense side, learning the distribution of adversarial prompts for proactive defense

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (Applying variational inference to jailbreaking is highly innovative)
  • Technical Contribution: ⭐⭐⭐⭐⭐ (Theoretically rigorous, engineering complete, with clear advantages)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (8 target models, multiple baselines, multi-dimensional evaluation)
  • Writing Quality: ⭐⭐⭐⭐ (Clear exposition with thorough derivations)