Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization¶
Conference: ICML 2026
arXiv: 2605.10067
Code: None
Area: LLM Security / Red Teaming / Jailbreak / Test-time Policy Optimization
Keywords: Red Teaming, jailbreak, POMDP, Metacognition, Semantic Gradient
TL;DR¶
Multi-round jailbreaking is reformulated as a test-time policy optimization problem. Under an adversarial POMDP framework, the Attacker and Metacognitive Evaluator form a closed loop: the dense analytical feedback output by the Evaluator serves as a "semantic gradient" to guide the Attacker's belief updates and policy improvements. Without retraining any weights, this approach achieves an average ASR of 89.2% across 10 frontier models, including O1 / GPT-5-chat / Claude-3.7, while reducing token consumption by an average of 8.2x compared to strong baselines.
Background & Motivation¶
Background: Automated red teaming has evolved from single-round (GCG, PAIR, PAP, CipherChat, etc.) to multi-round (Crescendo, CoA, ActorBreaker, X-Teaming). Multi-round frameworks typically perform better as they can continuously approach defense boundaries through interaction.
Limitations of Prior Work: Even the strongest current multi-round frameworks still rely on "stochastic search within a predefined heuristic space" (e.g., tree search, topic escalation, fixed plans). Essentially, the policy templates are static. While effective on weakly aligned models like Llama or GPT-3.5, performance drops precipitously on strongly aligned frontier models such as O1 / GPT-5-chat / Claude-3.7 (e.g., ActorBreaker achieves only 14% on O1; X-Teaming achieves only 49% on GPT-5-chat).
Key Challenge: Existing methods use sparse success/failure signals to drive searches, lacking causal diagnosis of "why this failed" or the underlying defense logic. Furthermore, heuristic templates lack adaptability and cannot generate bespoke strategies for the specific defense posture of each target model.
Goal: (1) Formalize multi-round jailbreaking as an adversarial POMDP to rigorously express policy learning and belief updates. (2) Design agents capable of test-time self-evolution (without modifying weights) for causal diagnosis and policy improvement. (3) Replace sparse rewards with dense semantic feedback to achieve convergence within a single trajectory. (4) Maintain interpretability via explicit reasoning traces.
Key Insight: The unknown defense mechanism of the target model is treated as a latent state in a POMDP, which the agent must maintain a belief over. The Evaluator provides high-dimensional "semantic gradients" \(\nabla_\text{sem}\) in the form of analytical critiques rather than scalar rewards, approximating inaccessible loss gradients.
Core Idea: Utilize an "Attacker + Metacognitive Evaluator" dual-agent setup to form a three-stage metacognitive cycle: <thought> → <strategy> → <prompt>, upgrading red teaming from heuristic search to test-time semantic policy optimization.
Method¶
Overall Architecture¶
The LLM red teaming process is modeled as an Adversarial POMDP \((\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{R})\). The latent state includes the conversation history \(H_t\) and the unknown defense \(\mathcal{D}\). Actions are prompts \(x_t\) generated by the attacker; observations consist of the target response \(y_t\) and evaluator feedback \(f_t\); the reward \(\mathcal{R}\) measures the semantic alignment between the response and the malicious goal \(\mathcal{G}\). The entire pipeline iterates for a maximum of \(T_\text{max}=5\) rounds. In each round, the Attacker undergoes three stages of metacognition (Diagnosis → Strategy → Instantiation) and interacts with the target model. The Evaluator converts the response into dense feedback in the form of \((s_t, J_t, M_t)\). All trajectories \(\tau_t\) are retained in the context to enable in-context meta-learning.
Key Designs¶
-
Three-stage Metacognitive Attacker (belief update → policy → instantiation):
- Function: Decomposes single-round attacker decisions into three interpretable steps: introspective diagnosis, adaptive strategy formulation, and executable instantiation, explicitly marked with structured tags
<thought> / <strategy> / <prompt>. - Mechanism: (a) Phase I
<thought>performs a belief update \(b_t \leftarrow \text{Reason}(b_{t-1}, y_{t-1}, f_{t-1})\), integrating the previous response and feedback in a Bayesian manner to narrow hypotheses about \(\mathcal{D}\) (e.g., "does the defense rely on lexical filters or semantic intent scrutiny?"). (b) Phase II<strategy>generates an abstract strategy \(\sigma_t \leftarrow \pi_\text{plan}(b_t, \mathcal{P}_\text{seed})\) using a few known attack vectors as priors to improve the strategy in a direction "orthogonal to the defense" as indicated by the belief. (c) Phase III<prompt>instantiates the abstract strategy into specific tokens \(x_t \sim \pi_\text{gen}(x \mid \sigma_t, H_t)\). - Design Motivation: Traditional multi-round attacks only output prompts, making the process opaque and errors hard to diagnose. Explicit three-stage metacognition provides readable reasoning traces for analysts and clear targets for the Evaluator's critique.
- Function: Decomposes single-round attacker decisions into three interpretable steps: introspective diagnosis, adaptive strategy formulation, and executable instantiation, explicitly marked with structured tags
-
Metacognitive Evaluator as Semantic Gradient:
- Function: Uses a third-party LLM to output \((s_t, J_t, M_t)\) (scalar reward + textual justification + meta-suggestions) in a black-box setting to approximate an otherwise inaccessible loss gradient.
- Mechanism: \(\nabla_\text{sem} \approx \mathcal{E}(y_t, \mathcal{G})\) represents a high-dimensional semantic direction, explicitly instructing the Attacker on how to modify the strategy for the next round. Meta-suggestions \(M_t\) use natural language instead of single numbers and are appended to the Attacker's next prompt context, effectively upgrading search-style sparse 0/1 rewards to dense supervision. This dense feedback allows the Attacker to internalize cause-and-effect within a single trajectory, or context, avoiding token waste from repetitive sampling.
- Design Motivation: Standard RL-based red teaming uses only final success/failure as a reward, resulting in sparser signals as trajectories lengthen. Providing "failure mode analysis + strategy suggestions" at each step is equivalent to step-wise reward shaping, allowing the attacker to perform in-context strategy refinement without weight updates.
-
Co-evolutionary Closed Loop + Convergence Guarantee:
- Function: Enables continuous interaction between the Attacker and Evaluator via trajectory \(\tau_t\) within the context window, realizing in-context meta-learning and saturating the success rate.
- Mechanism: All elements \((b_t, \sigma_t, x_t, y_t, f_t)\) are retained in the context. The Attacker refines the belief (more accurate defense diagnosis) and the strategy (more precise attack direction) simultaneously, creating a positive feedback loop. The framework enforces rapid convergence with a tight budget of \(T_\text{max}=5\), distinguishing "targeted optimization" from "stochastic exploration."
- Design Motivation: Existing methods (e.g., multi-agent plans in X-Teaming, topic escalation in Crescendo) rely on exploratory search, causing token consumption to skyrocket with defense strength. This framework follows a "convergence along semantic gradients within a single trajectory," making it both fast and stable.
Loss & Training¶
Metis does not update any LLM weights and is a pure test-time framework. The Attacker uses DeepSeek-R1-V528, and the Evaluator uses GPT-4o. Evaluation thresholds are strict: a success requires an Evaluator score = 10 ("Full and Unambiguous Jailbreak"), preventing borderline or partial responses from being counted as breaches. \(T_\text{max} = 5\) ensures a fair budget for comparisons.
Key Experimental Results¶
Main Results¶
10 target models + 2 benchmarks (HarmBench, AdvBench). HarmBench ASR:
| Method | Llama3-8B | Llama3-70B | Qwen2.5 | Claude-3.7 | GPT-4o | O1 | GPT-5-chat | Gemini 2.5 Pro | Grok3 | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| GCG | 34.5 | 17.0 | 6.5 | — | 12.5 | 0.0 | — | — | — | 21.1 |
| AutoDAN-Turbo | 23.0 | 32.0 | 7.0 | 17.0 | 23.0 | 24.0 | 55.0 | 52.0 | 84.0 | 36.4 |
| Crescendo | 60.0 | 62.0 | — | — | 62.0 | 14.0 | — | 23.0 | 6.0 | 41.0 |
| ActorBreaker | 79.0 | 85.5 | 47.0 | 22.0 | 84.5 | 14.0 | 22.0 | 44.0 | 42.0 | 51.9 |
| X-Teaming | 85.0 | 83.0 | 95.0 | 81.0 | 91.0 | 71.0 | 49.0 | 84.0 | 89.0 | 82.0 |
| Ours (Metis) | 88.0 | 90.0 | 97.0 | 86.0 | 93.0 | 76.0 | 78.0 | 90.0 | 100.0 | 89.2 |
Ablation Study¶
| Configuration | Llama3-8B | Claude-3.7 | GPT-4o |
|---|---|---|---|
| w/o Attacker Metacog. | 82.0 (↓6) | 66.0 (↓20) | 74.0 (↓19) |
| w/o Evaluator Metacog. | 86.0 (↓2) | 46.0 (↓40) | 72.0 (↓21) |
| w/o Seed Paradigms | 78.0 (↓10) | 60.0 (↓26) | 76.0 (↓17) |
| Ours (Full) | 88.0 | 86.0 | 93.0 |
Efficiency comparison (shared \(T_\text{max}=5\), same backbone):
| Model | Method | ASR | AQS | ATS (tokens) | Gain vs X-Teaming |
|---|---|---|---|---|---|
| Claude-3.7 | X-Teaming | 81.0 | 8.95 | 13,248 | — |
| Claude-3.7 | Ours | 86.0 | 1.90 | 1,425 | 9.3× |
| GPT-5-chat | X-Teaming | 49.0 | 12.48 | 14,095 | — |
| GPT-5-chat | Ours | 78.0 | 1.80 | 1,570 | 9.0× |
| Gemini 2.5 Pro | Ours | 90.0 | 2.30 | 1,464 | 8.1× |
Key Findings¶
- The "generalization gap" on frontier strongly aligned models is a core weakness of baselines: ActorBreaker drops from ≥80% to 22% on Claude-3.7, and X-Teaming drops from ≥80% to 49% on GPT-5-chat, while Metis remains stable, proving metacognitive adaptability is more reliable than static plans.
- Removing Evaluator metacognition is more detrimental than removing Attacker metacognition (−40 vs −20 on Claude-3.7), indicating that dense semantic feedback is the true bottleneck—the Attacker's internal reasoning must be anchored by external critiques.
- Switching the Evaluator from GPT-4o to Qwen2.5-7B causes the ASR on GPT-4o to drop from 93% to 30%, proving that Metis performance is capped by the Evaluator's analytical capability rather than the Attacker's generation capability.
- Average token consumption is reduced by 8.2x (up to 11.4x), and AQS typically drops to ~1.8-2.3 rounds for success, implying that dense feedback compresses multi-round search into a few targeted optimizations.
- t-SNE and cross-task diversity analysis show that Metis-generated strategies are much more widely distributed in semantic space than seed paradigms, with a cross-model diversity of 0.427—indicating Metis produces bespoke attacks rather than just "re-skinning" predefined templates.
Highlights & Insights¶
- Reframing jailbreaking from search to "test-time policy optimization" is a paradigm shift: previously attackers were explorers; now they are optimizers with beliefs and dense supervision, improving both token cost and success rate.
- The explicit
<thought> / <strategy> / <prompt>triad not only enhances interpretability but also serves as a diagnostic tool for safety research—analysts can examine the reasoning trace to understand a model's defensive vulnerabilities. - Using textual critiques as "semantic gradients" is a transferable concept: any scenario requiring black-box optimization with sparse signals (e.g., automated prompt optimization, reward model training) can benefit from dense critiques instead of scalar rewards.
- The finding that the Evaluator, not the Attacker, is the bottleneck overturns the intuition that "larger attackers are always stronger"—defensive researchers should invest in stronger "judge models" rather than stronger generative models for adversarial hardening.
Limitations & Future Work¶
- The Evaluator relies on GPT-4o, which is subject to OpenAI's own safety filters and is potentially uncontrollable in the long term (APIs or models may change); the paper does not discuss fallback strategies if the evaluator refuses to answer.
- The cost of the dual-LLM framework is not zero; although tokens are reduced by 8x, the dual calls for attacker + evaluator lead to higher latency per round compared to single-agent setups.
- While the score=10 threshold is strict, there is only a 76.8% agreement between the Evaluator and human assessment, meaning ~23% of "successful jailbreaks" identified by Metis might differ from human judgment.
- Experiments are capped at 5 rounds, which may not reflect real-world long-range attacks spanning days or multiple sessions; coverage is limited to two benchmarks (HarmBench/AdvBench).
- Code and data were not explicitly made public, which may limit reproducibility.
Related Work & Insights¶
- vs Crescendo / ActorBreaker / X-Teaming: These rely on stochastic search over predefined heuristics; Metis uses in-situ policy optimization, resulting in targeted trajectories, lower token consumption, and interpretability.
- vs PAIR / GCG / PAP: Single-round prompt optimization fails on frontier models; Metis employs multi-round metacognition for causal diagnosis of dynamic defenses.
- vs MTSA / AutoDAN-Turbo (learning-based red teaming): These optimize low-level prompt primitives, whereas Metis optimizes "high-level strategies + beliefs," closer to the workflow of human red-teamers.
- vs Metacognitive LLM research: Previous metacognitive work targeted general reasoning (e.g., Didolkar 2024); this is the first application to adversarial policy learning.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematically combines POMDP, metacognition, and dense semantic gradients for automated red teaming—a new paradigm in this field.
- Experimental Thoroughness: ⭐⭐⭐⭐ 10 models × 2 benchmarks + multiple baselines + ablation + efficiency + interpretability cases.
- Writing Quality: ⭐⭐⭐⭐ Clear algorithmic framework, dense data tables, and thorough discussion of ablation and backbone sensitivity.
- Value: ⭐⭐⭐ Provides methodological contributions to red teaming and safety research, though its social value depends on whether it is used to harden defenses.