SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation¶

Conference: AAAI 2026 arXiv: 2508.06194 Code: GitHub Area: Social Computing Keywords: Jailbreak Evaluation, Scenario-Adaptive, Multi-Dimensional Evaluation, LLM Safety, Harm Quantification

TL;DR¶

This paper proposes SceneJailEval, a scenario-adaptive multi-dimensional jailbreak evaluation framework that defines 14 jailbreak scenarios and 10 evaluation dimensions. Through a pipeline of scenario classification → dynamic dimension selection → multi-dimensional detection → weighted harm scoring, it achieves F1 of 0.917 on a self-constructed dataset (surpassing SOTA by 6%) and 0.995 on JBB (surpassing SOTA by 3%), while supporting harm severity quantification beyond binary classification.

Background & Motivation¶

Background: LLM jailbreak attack evaluation suffers from two major problems — mainstream methods (string matching, toxicity classifiers, LLM judges) produce only binary "yes/no" outputs without quantifying harm severity; emerging multi-dimensional frameworks (e.g., StrongREJECT, Cai et al.) apply uniform evaluation criteria across all scenarios, ignoring scenario-specific differences.

Limitations of Prior Work: - Binary classification is too coarse: It cannot distinguish between "providing detailed methods for killing" and "merely implying the possibility" - One-size-fits-all evaluation criteria: For instance, the "authenticity" dimension is meaningful for "violent crime" scenarios but irrelevant to "hate speech"; the same dimension should carry different weights across scenarios - Lack of regional sensitivity: Cryptocurrency compliance requirements differ between mainland China and Japan, yet existing methods cannot account for this

Key Challenge: Jailbreak scenarios are highly heterogeneous (violent crime vs. sexual content vs. political incitement), but evaluation methods treat them uniformly, creating an accuracy bottleneck.

Goal: To construct a scenario-adaptive jailbreak evaluation framework in which evaluation dimensions, scoring criteria, and weights are dynamically adjusted according to the scenario.

Key Insight: Drawing on mature scenario-based evaluation paradigms from software testing and autonomous driving verification, combined with DREAD/CVSS cybersecurity threat modeling theory, to systematically construct the evaluation framework.

Core Idea: A 14-scenario × 10-dimension adaptive matrix combined with Delphi-method expert consensus and AHP analytic hierarchy process weighting, enabling fine-grained jailbreak evaluation.

Method¶

Overall Architecture¶

A four-step pipeline: (1) Scenario Classifier — an LLM agent maps the input to one of 14 predefined scenarios; (2) Scenario–Dimension Adapter — dynamically selects evaluation dimensions, scoring criteria, and weights based on the scenario; (3) Jailbreak Detector — multi-dimensional judgment across 6 detection dimensions, fused via logical AND; (4) Harm Assessor — weighted scoring across 4 harm dimensions, producing a composite harm score.

Key Designs¶

14-Scenario × 10-Dimension Adaptive Matrix:
- Function: Tailors the subset of evaluation dimensions and scoring criteria to each jailbreak scenario.
- Mechanism: The 14 scenarios cover violent crime, non-violent crime, sexual content, hate speech, suicide/self-harm, cyberattacks, disinformation, political incitement and elections, professional advice, privacy violation, intellectual property infringement, cryptocurrency violations, regionally sensitive issues, and illegal weapons/drugs. The 10 dimensions are divided into 6 detection dimensions (refusal, helpfulness, regional compliance, risk warning, political neutrality, explicitness) and 4 harm dimensions (authenticity, specificity, severity, scope of impact).
- Design Motivation: Different scenarios require different dimension combinations. For example, "sexual content" emphasizes "explicitness," "professional advice" emphasizes "risk warning," and "regionally sensitive issues" requires "regional compliance."
Delphi Method + AHP for Weight Determination:
- Function: Systematically determines the weights of harm assessment dimensions for each scenario.
- Mechanism: Ten security experts use multiple rounds of anonymous Delphi consensus to rank dimension importance for each scenario (consensus criteria: CV < 0.25, IQR ≤ 2), after which AHP converts the rankings into specific weights (consistency check: CR < 0.1).
- Design Motivation: Avoids arbitrarily assigned weights; expert consensus and mathematical methods ensure objectivity and reproducibility.
Multi-Dimensional Jailbreak Detection Design:
- Function: Captures edge cases missed by traditional methods by decomposing the detection task.
- Mechanism: Six detection dimensions each serve a distinct role — "refusal" checks whether the response contains only refusal expressions without substantive content; "helpfulness" detects whether the model indirectly facilitates malicious behavior (e.g., refusing first but then providing phishing details); "regional compliance" handles geographic differences. The final judgment is the logical AND of all relevant dimensions.
- Design Motivation: "Refuse-then-help" edge cases are the primary failure mode of existing methods; the combined "refusal" + "helpfulness" detection effectively captures such cases.

Loss & Training¶

Detection evaluation metrics: Accuracy, Precision, Recall, F1
Harm scoring evaluation metrics: NMAE (deviation from expert annotations), Spearman-Rho (rank correlation with human judgments)
Overall NMAE = 0.013, Spearman-Rho = 0.938, indicating high agreement with expert judgments

Key Experimental Results¶

Main Results¶

Self-constructed SceneJailEval dataset (1,308 queries, 14 scenarios):

Method	Accuracy	Precision	Recall	F1
StringMatch	0.749	0.750	0.957	0.841
Qi2023 (GPT-4)	0.816	0.966	0.760	0.851
JailJudge	0.800	0.930	0.768	0.841
SceneJailEval	0.883	0.901	0.929	0.915

Public datasets — JBB: F1 = 0.995 (SOTA); JailJudge dataset: F1 = 0.824 (SOTA).

Ablation Study¶

Configuration	F1	Description
Full SceneJailEval	0.917	Complete framework
DimsOnly (w/o scenario classification)	0.890	Removing scenario classification drops F1 by 2.7%
Vanilla (w/o dimension selection)	0.831	Falling back to generic heuristics drops F1 by 8.6%

Key Findings¶

Scenario adaptation is critical: Removing scenario classification and dimension selection leads to F1 drops of 2.7% and an additional 8.6%, respectively.
Edge case detection advantage: Significantly outperforms all baselines on "refuse-then-help" cases and region-specific cases.
Harm scoring highly consistent with experts: NMAE < 0.02, Spearman-Rho ≈ 0.94.
Strong generalization: Ranks second on Safe-RLHF despite Beaver being specifically fine-tuned on that dataset.

Highlights & Insights¶

Introducing scenario-based evaluation methodology into LLM safety assessment: A methodological contribution that transplants mature paradigms from software testing and autonomous driving into this domain.
DREAD/CVSS theory guiding harm dimension definition: Dimensions are grounded in established cybersecurity threat modeling theory rather than intuition.
Delphi + AHP weight determination: Provides a reproducible, extensible framework for determining scenario–dimension weights rather than hard-coding them.
Regional sensitivity dimension: The first jailbreak evaluation framework to incorporate cultural and legal differences across regions.

Limitations & Future Work¶

Coverage of 14 scenarios: Although relatively comprehensive, real-world jailbreak attacks may fall outside these 14 categories.
Dependence on LLM agent classification accuracy: Misclassification of scenarios propagates errors to all subsequent dimension selection and evaluation steps.
High cost of expert annotation: Dataset construction requires five security experts to annotate using scenario-adaptive criteria, making scaling expensive.
Underlying model is Qwen-3-235B: Only one underlying model has been validated; performance with other LLM-as-judge setups remains unknown.
Delphi consensus from 10 experts: The sample size is relatively small and may be subject to individual biases.

vs. StrongREJECT: Applies unified Rejection Clarity/Specificity/Credibility criteria across all scenarios, ignoring scenario differences. SceneJailEval overcomes this limitation through scenario adaptation.
vs. AttackEval: Uses GPT-4 reference answers with cosine similarity, still a uniform standard. SceneJailEval's multi-dimensional scenario-adaptive evaluation is more fine-grained.
vs. LlamaGuard3: Meta's official safety judge model achieves F1 = 0.98 on JBB, compared to SceneJailEval's 0.995.

Rating¶

Novelty: ⭐⭐⭐⭐ The scenario-adaptive multi-dimensional framework is pioneering in jailbreak evaluation; the Delphi + AHP weighting approach is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, six baselines, ablation studies, and expert annotation consistency validation.
Writing Quality: ⭐⭐⭐⭐ The framework is rigorously defined with complete mathematical formalization and detailed scenario and dimension definitions.
Value: ⭐⭐⭐⭐ Practically valuable for LLM safety evaluation; the extensible framework design is forward-looking.