How Catastrophic is Your LLM? Certifying Risk in Conversation¶

Conference: ICLR 2026 arXiv: 2510.03969 Code: None Area: LLM/NLP Keywords: safety certification, multi-turn attack, Markov process, catastrophic risk, statistical guarantee

TL;DR¶

This paper proposes C3LLM (Certification of Catastrophic risks in multi-turn Conversation for LLMs), the first framework to provide statistical certification of catastrophic risks in multi-turn LLM conversations. It models conversation distributions as Markov processes over a semantic similarity graph, defines three conversation sampling strategies augmented with a jailbreak layer, and applies Clopper-Pearson 95% confidence intervals to certify the probability that a model produces harmful outputs—finding that the worst-performing model has a risk lower bound as high as 72%.

Background & Motivation¶

Background: LLMs may produce catastrophic outputs in conversation (e.g., bomb-making instructions, bioweapon synthesis, cyberattack tutorials). Multi-turn attacks are harder to defend against than single-turn ones—adversaries can gradually steer models toward harmful content through seemingly benign conversation sequences.

Two Fundamental Flaws of Fixed Benchmarks: - Reliance on fixed attack sequences: Only specific attacks are tested, missing unseen successful sequences—20 attack sequences of length 5 cover at most 20 attack variants, yet the combinatorial space is \(100^5 = 10^{10}\) - Lack of statistical guarantees: Conclusions are non-generalizable; the extent of risk across the full conversation space remains unknown

Key Challenge: Exhaustive testing is infeasible (exponential space), and different sequences carry different levels of danger—risk must be quantified in terms of probability distributions.

Why Statistical Certification over Benchmarking: Benchmarking provides sample lower bounds ("N successful attacks found"), whereas statistical certification provides probability bounds ("a randomly sampled conversation has a [40%, 60%] probability of triggering catastrophic output")—the latter is far more meaningful.

Core Idea: Model multi-turn conversations as a Markov process on a graph; sample → judge → statistically test; output a confidence interval for catastrophic risk.

Method¶

Overall Architecture¶

Query Graph Construction: Extract attack scenarios from HarmBench → expand a set of related but milder queries \(Q\) around each harmful target \(q^*\) → construct graph \(G=(V, E)\) with edges based on semantic similarity
Conversation Distribution Definition: Define three Markov processes on the graph → sample conversation sequences \(\gamma = (v_0, v_1, ..., v_{n-1})\)
Catastrophic Judgment: Feed each sequence to the LLM; a judge model (GPT-4o) evaluates whether each response is catastrophic
Statistical Certification: 50 independent samples → Clopper-Pearson 95% confidence interval → output upper and lower bounds on catastrophic probability

Key Designs¶

Lifted State Space to Avoid Revisiting Nodes:
State \((v, S)\): current query \(v\) + visited set \(S\)
Terminal state \(\tau\): entered when no transferable neighbors exist
Ensures no repeated queries within a sequence (reflecting realistic adversary behavior)
Three Conversation Distributions:

Distribution	Construction	Adversary Model	Characteristics
Random Node (RN)	Independently sample random nodes	Non-strategic random attacker	Estimates overall model vulnerability
Graph Path (GP)	Graph path with endpoint constrained to target set \(Q_T\)	Directional conversation flow	Coherent semantic context
Adaptive w/ Rejection (AR)	Adjusts path using model accept/reject feedback	Adaptive red-teaming attack	Accept → advance toward target; reject → retreat

Weight Design for Adaptive Distribution:
Define progressive neighbors \(A_{\text{prog}}\) (closer to \(q^*\)) and deprogressive neighbors \(A_{\text{deprog}}\) (farther from \(q^*\))
Model accepts current query (\(r_v=0\)): high weight \(\lambda_h\) assigned to \(A_{\text{prog}}\), encouraging advancement
Model rejects current query (\(r_v=1\)): high weight \(\lambda_h\) assigned to \(A_{\text{deprog}}\), retreating to a safer region to retry
Jailbreak Augmentation Layer:
Each query is augmented with a jailbreak prefix with probability \(p\)
Augmented sequence probability: \(\Pr(\tilde\gamma) = \Pr(\gamma) \prod_t \Pr_{\mathcal{D}_{jb}}(\tilde{v}_t | v_t)\)
Covers a spectrum from identity transformation (no modification) to structured modifications

Graph Construction Details¶

Data source: HarmBench chemical_biological (28 scenarios) + cybercrime (40 scenarios) = 68 scenarios
For each scenario, 3 LLMs generate 30 actors (related characters/concepts), each with 5 queries
After deduplication, 20 actors are randomly sampled to construct a diverse query set
Edges are connected via cosine similarity threshold

Key Experimental Results¶

Main Results: Certified Risk for 6 Frontier Models (95% CI Lower Bound)¶

Model	Chembio Risk CI	Cybercrime Risk CI	Highest Risk Lower Bound
DeepSeek-R1	[0.554, 0.821]	[0.721, 0.935]	72.1%
Mistral-Large	[0.554, 0.821]	[0.652, 0.892]	65.2%
Llama-3.3-70B	[0.212, 0.488]	[0.374, 0.663]	37.4%
GPT-4o	Moderate	Moderate	~30%
Claude-Sonnet-4	[0.001, 0.106]	[0.028, 0.205]	2.8%
Nova Premier	[0.005, 0.137]	[0.000, 0.071]	0.0%

Attack Effectiveness Across Three Distributions¶

Distribution	Attack Efficiency	Semantic Coherence	Adaptivity	Applicable Scenario
Random Node + JB	Lowest	None	None	Baseline: model vulnerability under random input
Graph Path (harmful)	Moderate	High	None	Directional natural conversation attacks
Adaptive w/ Rejection	Highest	Medium–High	Yes	Realistic red-teaming strategy

Key Findings¶

DeepSeek-R1 achieves a 72.1% risk lower bound under Cybercrime—even under the most conservative estimate, >70% of randomly sampled conversations trigger catastrophic output
Claude-Sonnet-4 and Nova Premier are significantly safer (<14% / <7%), though not zero-risk
The double-edged sword of rejection signals: Models with rejection rates of 15–20% provide precise feedback to adaptive attackers—rejections signal "you're too close, step back slightly"
Case analysis reveals two attack patterns: (a) distractors—inserting benign queries before harmful ones to lower model vigilance; (b) context—early turns provide background information making the final harmful query appear more legitimate
Statistical certification uncovers orders of magnitude more vulnerabilities than fixed benchmarks—20 fixed attacks vs. a probability bound over a \(10^{10}\) space

Highlights & Insights¶

Paradigm Upgrade: from "whether compromised" to "probabilistic confidence bounds": Safety evaluation gains statistical rigor for the first time—analogous to moving from "finding one bug" to "system-level reliability certification"
Rejection Rate ≠ Safety: Attackers exploit rejection signals to refine strategies, challenging the intuition that "high rejection rate = safer"—safety should be leak-free
Elegant Design of the Adaptive Distribution: The accept-advance / reject-retreat weight mechanism elegantly models realistic red-team adversary strategies
Generality of the Markov Process: The framework is not limited to three distributions; new distributions can be flexibly defined to explore different attack patterns

Limitations & Future Work¶

Judge Bias: GPT-4o is used to judge catastrophic outputs, introducing circular bias when evaluating GPT-family models
Limited Scenario Coverage: Only 68 scenarios (chemical/biological + cybercrime); categories such as violence and hate speech are not covered
Risk Quantification Only, No Defense Proposed: The framework identifies risk but does not provide mitigation strategies
Limited Sample Size: Only 50 samples per distribution, yielding wide confidence intervals (e.g., [0.554, 0.821]); denser sampling could narrow these intervals
Graph Construction Depends on Actor Generation Quality: The diversity and coverage of the query set directly affect certification results

vs. HarmBench / AdvBench: Fixed attack sets vs. statistical certification; C3LLM provides probabilistic guarantees rather than empirical observations
vs. Crescendo / PAIR: These are multi-turn attack methods, whereas C3LLM is a certification framework—it can certify the coverage of such attack methods
vs. Single-turn Certification (Kumar 2023): Token/embedding-space perturbation certification vs. multi-turn conversation distribution certification; they differ in complexity and applicability
vs. ATAD: ATAD dynamically generates reasoning evaluation benchmarks; C3LLM statistically certifies safety risks—both transcend the limitations of fixed benchmarks, but their objectives are entirely different

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First statistical certification framework for multi-turn safety; the combination of Markov processes and statistical testing is original
Experimental Thoroughness: ⭐⭐⭐⭐ 6 frontier models × 3 distributions × 2 categories, with in-depth case analysis
Writing Quality: ⭐⭐⭐⭐ Formally rigorous with a clear mathematical notation system
Value: ⭐⭐⭐⭐⭐ Establishes a higher methodological standard for AI safety evaluation, elevating empirical testing to statistical certification