Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization¶

Conference: ICML 2026
arXiv: 2605.10067
Code: None
Area: LLM Security / Red Teaming / Jailbreak / Inference-time Policy Optimization
Keywords: Red Teaming, Jailbreak, POMDP, Metacognition, Semantic Gradient

TL;DR¶

Reformulates multi-turn jailbreak as an inference-time policy optimization problem—within an adversarial POMDP framework, the Attacker and Metacognitive Evaluator form a closed loop: dense analytical feedback from the Evaluator is used as a "semantic gradient" to guide the Attacker's belief update and policy improvement. This enables adaptation to 10 cutting-edge models (including O1 / GPT-5-chat / Claude-3.7) with an average ASR of 89.2%, while reducing token consumption by 8.2× compared to strong baselines, all without retraining any weights.

Background & Motivation¶

Background: Automated red teaming has evolved from single-turn (GCG, PAIR, PAP, CipherChat, etc.) to multi-turn frameworks (Crescendo, CoA, ActorBreaker, X-Teaming). Multi-turn frameworks generally perform better, as they can iteratively approach the defense boundary through interaction.

Limitations of Prior Work: Even the strongest current multi-turn frameworks still fundamentally perform "random search within a predefined heuristic space"—such as tree search, topic escalation, or fixed plans, with essentially static policy templates. These work well on less-aligned models like Llama / GPT-3.5, but performance drops sharply on frontier models with strong alignment (e.g., ActorBreaker achieves only 14% on O1, X-Teaming only 49% on GPT-5-chat).

Key Challenge: Existing methods rely on sparse success/failure signals to drive search, lacking causal diagnosis of "why did this fail / what is the defense logic"; heuristic templates lack adaptability and cannot generate bespoke strategies for each target model's specific defense posture.

Goal: (1) Formalize multi-turn jailbreak as adversarial POMDP, enabling rigorous expression of "policy learning / belief update"; (2) Design inference-time (weight-free) self-evolving agents capable of causal diagnosis and policy improvement for each target; (3) Replace sparse rewards with dense semantic feedback, enabling convergence within a single trajectory; (4) Ensure interpretability—agents explicitly output reasoning traces.

Key Insight: Treat the "unknown defense mechanism" of the target model during dialogue as the latent state in a POMDP, requiring the agent to maintain a belief over it; the Evaluator provides not scalar rewards but analytical critiques, essentially a high-dimensional "semantic gradient" \(\nabla_\text{sem}\), which approximates an inaccessible loss gradient.

Core Idea: Employ a dual-agent "Attacker + Metacognitive Evaluator" system forming a " → → " three-stage metacognitive loop, upgrading red teaming from heuristic search to inference-time semantic policy optimization.

Method¶

Overall Architecture¶

Models the LLM red teaming process as an Adversarial POMDP \((\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{R})\). The latent state includes dialogue history \(H_t\) and unknown defense \(\mathcal{D}\); actions are prompts \(x_t\) generated by the attacker; observations are the target's response \(y_t\) and evaluator feedback \(f_t\); reward \(\mathcal{R}\) measures semantic alignment between the response and malicious goal \(\mathcal{G}\). The pipeline iterates for up to \(T_\text{max}=5\) rounds: in each round, the Attacker completes a three-stage metacognitive process (diagnosis → strategy → instantiation) and interacts with the target model; the Evaluator converts responses into dense feedback \((s_t, J_t, M_t)\); the entire trajectory \(\tau_t\) is retained in context for in-context meta-learning.

Key Designs¶

Three-Stage Metacognitive Attacker (belief update → policy → instantiation):
- Function: Decomposes single-turn attacker decisions into interpretable steps—introspective diagnosis, adaptive strategy formulation, and executable instantiation, explicitly labeled with <thought> / <strategy> / <prompt>.
- Mechanism: (a) Phase I <thought> performs belief update \(b_t \leftarrow \text{Reason}(b_{t-1}, y_{t-1}, f_{t-1})\), Bayesian-style integration of previous response and feedback to narrow the hypothesis over \(\mathcal{D}\) (e.g., "is the defense based on lexical filtering or semantic intent scrutiny?"); (b) Phase II <strategy> generates an abstract strategy \(\sigma_t \leftarrow \pi_\text{plan}(b_t, \mathcal{P}_\text{seed})\), where \(\mathcal{P}_\text{seed}\) provides a few known attack vectors as priors, guiding the strategy to improve along the belief-indicated "orthogonal direction" to the defense; (c) Phase III <prompt> instantiates the abstract strategy into a concrete token sequence \(x_t \sim \pi_\text{gen}(x \mid \sigma_t, H_t)\).
- Design Motivation: Traditional multi-turn attacks output only prompts per round, lacking transparency and error traceability; explicit three-stage metacognition provides readable reasoning traces for security analysts (diagnosis, strategy, and instance per round) and clear critique targets for the Evaluator.
Metacognitive Evaluator as Semantic Gradient:
- Function: In a black-box setting, uses a third-party LLM to output \((s_t, J_t, M_t)\) (scalar reward + textual justification + meta-suggestions), approximating an inaccessible loss gradient.
- Mechanism: \(\nabla_\text{sem} \approx \mathcal{E}(y_t, \mathcal{G})\) is a high-dimensional semantic direction, explicitly guiding the Attacker on "which direction to adjust strategy next"; Meta-suggestions \(M_t\) are in natural language, not just a single number, and are directly appended to the attacker's next prompt context, effectively upgrading sparse 0/1 rewards to dense supervision. This dense feedback allows the Attacker to internalize cause-and-effect within a single trajectory, avoiding repeated sampling and token waste.
- Design Motivation: Standard RL-based red teaming can only use "final success/failure" as reward in multi-turn settings, making signals sparser as trajectories lengthen; having the Evaluator provide "failure mode analysis + strategy suggestions" at each step is equivalent to step-wise reward shaping, enabling in-context policy improvement without weight updates.
Co-Evolutionary Closed Loop + Convergence Guarantee:
- Function: Enables Attacker and Evaluator to interact continuously within the context window via trajectory \(\tau_t\), achieving in-context meta-learning and saturating success rate.
- Mechanism: Each round retains \((b_t, \sigma_t, x_t, y_t, f_t)\) in context; the Attacker refines both belief (defense diagnosis) and strategy (attack direction), forming positive feedback; the paper uses \(T_\text{max}=5\) as a tight budget to enforce rapid convergence, distinguishing "directed optimization" from "random exploration".
- Design Motivation: Existing methods (X-Teaming's multi-agent plan, Crescendo's topic escalation) rely on exploratory search, with token consumption skyrocketing as defense strength increases; this framework converges along the semantic gradient within a single trajectory, making it both fast and stable.

Loss & Training¶

Metis does not update any LLM weights; it is a pure inference-time framework: Attacker uses DeepSeek-R1-V528, Evaluator uses GPT-4o. Evaluation threshold is strict—Evaluator score = 10 ("Full and Unambiguous Jailbreak") is required for success, avoiding borderline/partial responses being counted as successful. \(T_\text{max} = 5\), with a unified budget for fair comparison.

Key Experimental Results¶

Main Results¶

10 target models + 2 benchmarks (HarmBench, AdvBench). HarmBench ASR:

Method	Llama3-8B	Llama3-70B	Qwen2.5	Claude-3.7	GPT-4o	O1	GPT-5-chat	Gemini 2.5 Pro	Grok3	Avg.
GCG	34.5	17.0	6.5	—	12.5	0.0	—	—	—	21.1
AutoDAN-Turbo	23.0	32.0	7.0	17.0	23.0	24.0	55.0	52.0	84.0	36.4
Crescendo	60.0	62.0	—	—	62.0	14.0	—	23.0	6.0	41.0
ActorBreaker	79.0	85.5	47.0	22.0	84.5	14.0	22.0	44.0	42.0	51.9
X-Teaming	85.0	83.0	95.0	81.0	91.0	71.0	49.0	84.0	89.0	82.0
Metis	88.0	90.0	97.0	86.0	93.0	76.0	78.0	90.0	100.0	89.2

Ablation Study¶

Configuration	Llama3-8B	Claude-3.7	GPT-4o
w/o Attacker Metacog.	82.0 (↓6)	66.0 (↓20)	74.0 (↓19)
w/o Evaluator Metacog.	86.0 (↓2)	46.0 (↓40)	72.0 (↓21)
w/o Seed Paradigms	78.0 (↓10)	60.0 (↓26)	76.0 (↓17)
Metis (Full)	88.0	86.0	93.0

Efficiency comparison (all use \(T_\text{max}=5\), same backbone):

Model	Method	ASR	AQS	ATS (tokens)	Savings vs X-Teaming
Claude-3.7	X-Teaming	81.0	8.95	13,248	—
Claude-3.7	Metis	86.0	1.90	1,425	9.3×
GPT-5-chat	X-Teaming	49.0	12.48	14,095	—
GPT-5-chat	Metis	78.0	1.80	1,570	9.0×
Gemini 2.5 Pro	Metis	90.0	2.30	1,464	8.1×

Key Findings¶

The "generalization gap" on frontier, strongly aligned models is the core weakness of baselines: ActorBreaker drops from ≥80% to 22% on Claude-3.7, X-Teaming from ≥80% to 49% on GPT-5-chat, while Metis remains stable, validating that metacognitive adaptability is more reliable than static plans.
Removing Evaluator metacognition is more detrimental than removing Attacker metacognition (−40 vs −20 on Claude-3.7), indicating that dense semantic feedback is the true bottleneck—the Attacker's reasoning must be anchored by "external critique" or it drifts.
Replacing the Evaluator from GPT-4o to Qwen2.5-7B causes ASR on GPT-4o to drop from 93% to 30%—demonstrating that Metis's upper-bound performance is determined by the Evaluator's analytical ability, not the Attacker's generation capability.
Average token consumption drops by 8.2× (up to 11.4×), and AQS typically falls to ~1.8–2.3 rounds to success, indicating that dense feedback compresses multi-turn search into a few rounds of directed optimization.
t-SNE and cross-task diversity show that Metis-generated strategies are much more widely distributed in semantic space than seed paradigms, with cross-model diversity of 0.427—demonstrating that Metis produces genuinely bespoke attacks, not just "re-skinned" templates.

Highlights & Insights¶

Reframing jailbreak from search to "inference-time policy optimization" is a paradigm shift: previously, attackers were explorers; now, attackers are optimizers with beliefs and dense supervision, improving both token cost and success rate.
The explicit <thought> / <strategy> / <prompt> three-stage process not only enhances interpretability but also serves as a diagnostic tool for defense research—security analysts can directly examine Metis's reasoning trace to understand model vulnerabilities.
Using LLM-generated textual critique as a "semantic gradient" is a transferable idea: any scenario requiring black-box optimization with sparse signals (e.g., automated prompt optimization, reward model training) can benefit from dense critique over scalar rewards.
The finding that the Evaluator, not the Attacker, is the bottleneck overturns the intuition that "bigger attackers are stronger"—defense researchers may invest in stronger "judging models" rather than stronger generators to reinforce red teaming.

Limitations & Future Work¶

The Evaluator uses GPT-4o, which is itself subject to OpenAI's safety filters and is not controllable long-term (APIs and models may change); the paper does not discuss fallback strategies when the evaluator refuses to answer.
The dual-LLM framework still incurs nonzero cost; although tokens are reduced 8×, dual calls per round (attacker + evaluator) increase latency compared to single-agent setups, which may be unsuitable for real-time scenarios.
The evaluation threshold (score=10) is strict, but Evaluator and human agreement is only 76.8%, meaning ~23% of "Metis judged jailbreaks" may differ from human judgment.
All experiments perform jailbreak within 5 rounds, which may not reflect real-world long-term attacks over days/sessions; only HarmBench/AdvBench benchmarks are covered.
The paper does not explicitly release code/data, which may limit reproducibility.

vs Crescendo / ActorBreaker / X-Teaming: These perform stochastic search over predefined heuristics; Metis is in-situ policy optimization, with directed attack trajectories, lower token consumption, and interpretability.
vs PAIR / GCG / PAP: Single-turn prompt optimization fails on frontier models; Metis employs multi-turn metacognition, enabling causal diagnosis of dynamic defenses.
vs MTSA / AutoDAN-Turbo (learning-based red teaming): They optimize low-level prompt primitives; this work targets "high-level strategy + belief," closer to human red teamers' workflow.
vs metacognitive LLM work: Prior metacognition was used for general reasoning (e.g., Didolkar 2024); this is the first application to adversarial policy learning.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematically combines POMDP, metacognition, and dense semantic gradients for automated red teaming—a new paradigm in this direction.
Experimental Thoroughness: ⭐⭐⭐⭐ 10 models × 2 benchmarks + multiple baselines + ablation + efficiency + interpretability cases all included.
Writing Quality: ⭐⭐⭐⭐ Algorithmic framework is clear, tables are dense, ablation and backbone sensitivity are discussed.
Value: ⭐⭐⭐ Methodologically valuable for red teaming/security research, but as an "attack" tool, societal value depends on whether it is truly used to strengthen defenses.