VERA: Variational Inference Framework for Jailbreaking Large Language Models¶

Conference: NeurIPS 2025 arXiv: 2506.22666 Code: None Area: AI Safety, LLM Red-Teaming, Variational Inference Keywords: Jailbreak Attack, Variational Inference, Black-Box Attack, Red-Teaming, Adversarial Prompt

TL;DR¶

This paper formalizes black-box LLM jailbreaking as a variational inference problem, training a small attacker LLM to approximate the posterior distribution of adversarial prompts for a target LLM. Once trained, the attacker can efficiently generate diverse jailbreak prompts without relying on human-crafted templates.

Background & Motivation¶

State of the Field¶

Background: Existing black-box jailbreak methods rely on genetic algorithms or search procedures, requiring separate optimization for each target behavior at high computational cost.

Limitations of Prior Work¶

Limitations of Prior Work: Most methods depend on manually curated jailbreak template pools for initialization, tying them to known vulnerabilities that are easily patched.

Root Cause¶

Key Challenge: A single successful attack is insufficient to comprehensively assess model vulnerabilities; diverse attacks are needed to cover the full landscape of weaknesses.

Starting Point¶

Key Insight: There is a lack of a principled, distribution-level framework for understanding and generating adversarial prompts.

Method¶

Overall Architecture¶

Frames jailbreak prompt generation as a posterior inference problem: \(x \sim P_{LM}(x|y^*)\)
Uses a small LLM (the attacker) as a variational distribution \(q_\theta(x)\) to approximate the adversarial prompt posterior of the target LLM
Fine-tunes the attacker model via LoRA to learn the distribution of effective jailbreak prompts

Key Designs¶

Variational Objective (ELBO): \(\mathbb{E}_{q_\theta(x)}[\log P_{LM}(y^*|x) + \log P(x) - \log q_\theta(x)]\)
- First term: probability of harmful content generation (attack effectiveness)
- Second term: plausibility under the prior (prompt fluency)
- Third term: entropy regularization (encourages diversity, prevents mode collapse)
Judge as a Likelihood Approximator:
- In the black-box setting, \(P_{LM}(y^*|x)\) cannot be computed directly
- An external judge model \(J(x,\hat{y}) \in [0,1]\) is used as a proxy
- Can use a binary classifier (e.g., HarmBench's LLaMA2-13B classifier) or LLM-based scoring
REINFORCE Gradient Estimation:
- Uses the REINFORCE trick to handle gradients through discrete sampling
- \(\nabla_\theta \approx \frac{1}{N}\sum_i f(x_i) \nabla_\theta \log q_\theta(x_i)\)
- Early stopping: training halts upon the first successful jailbreak to prevent over-optimization and attacker degeneration

Loss & Training¶

Attacker model: Vicuna-7b + LoRA
Each step generates \(B\) prompts → queries the target LLM → Judge scoring → REINFORCE update
Early stopping at the first successful jailbreak prompt
No human-crafted jailbreak templates are used at any stage

Key Experimental Results¶

Main Results (HarmBench ASR%)¶

Method	Llama2-7b	Vicuna-7b	Orca2-7b	R2D2	GPT-3.5	Avg.
GCG (White-Box)	32.5	65.5	46.0	5.5	-	40.2
PAIR (Black-Box)	9.3	53.5	57.3	48.0	35.0	36.3
TAP-T (Black-Box)	7.8	59.8	60.3	54.3	47.5	40.9
AutoDAN (Black-Box)	0.5	66.0	71.0	17.0	-	34.8
VERA (Black-Box)	10.8	70.0	72.0	63.5	-	-

Diversity & Novelty Comparison (50-behavior subset, Vicuna-7B target)¶

Metric	VERA	GPTFuzzer	AutoDAN
Self-BLEU (lower = more diverse)	Lowest	Medium	Medium
Template BLEU (lower = more novel)	Lowest	Higher	Higher
Successful attacks under fixed time budget	5× GPTFuzzer	Baseline	Medium

Key Findings¶

VERA achieves state-of-the-art among black-box methods on HarmBench, outperforming all prior methods on multiple target models
Generated attack prompts exhibit significantly greater diversity than template-based methods (lowest Self-BLEU)
Completely template-independent: BLEU score against system prompts is extremely low
Under a fixed time budget (1250s), VERA produces more than 5× the successful attacks of GPTFuzzer
Removing known effective templates causes substantial performance degradation in template-based methods, while VERA is unaffected

Highlights & Insights¶

Jailbreak attacks are elegantly embedded within a variational inference framework with rigorous mathematical foundations
The entropy regularization term naturally resolves the attack diversity problem without additional design effort
The trained attacker can generate new attacks via simple forward passes, amortizing the overall computational cost
VERA does not rely on known vulnerability templates, offering "future adaptability"—its effectiveness is not contingent on unpatched exploits

Limitations & Future Work¶

A large number of target LLM queries are required per step (B per step), making API-based deployment costly
Using Vicuna-7b as the attacker may limit effectiveness against models with stronger defenses
Early stopping prevents degeneration but may prematurely curtail exploration of additional attack patterns
The comparison with RL-based approaches (mentioned in the appendix) warrants deeper discussion

The variational inference framework represents a paradigm shift in red-teaming from point-based attacks to distribution-level attacks
The combination of REINFORCE and LoRA keeps training computationally tractable
The observation that "comprehensive red-teaming requires breadth of vulnerability coverage, not merely confirmation of existence" deserves broader attention
Implication for LLM safety research: safety alignment should not assume that attackers lack systematic methods
The "future adaptability" of VERA suggests that attacks independent of known vulnerabilities are fundamentally harder to defend against
VERA's distributional perspective provides a novel tool for understanding the structure of LLM vulnerabilities
Future work may explore applying the variational inference framework on the defense side, learning the distribution of adversarial prompts for proactive defense

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Applying variational inference to jailbreaking is highly innovative)
Technical Contribution: ⭐⭐⭐⭐⭐ (Theoretically rigorous, engineering complete, with clear advantages)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (8 target models, multiple baselines, multi-dimensional evaluation)
Writing Quality: ⭐⭐⭐⭐ (Clear exposition with thorough derivations)