Skip to content

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Conference: AAAI 2026 arXiv: 2508.06194 Code: GitHub Area: Social Computing Keywords: Jailbreak Evaluation, Scenario-Adaptive, Multi-Dimensional Evaluation, LLM Safety, Harm Quantification

TL;DR

This paper proposes SceneJailEval, a scenario-adaptive multi-dimensional jailbreak evaluation framework that defines 14 jailbreak scenarios and 10 evaluation dimensions. Through a pipeline of scenario classification → dynamic dimension selection → multi-dimensional detection → weighted harm scoring, it achieves F1 of 0.917 on a self-constructed dataset (surpassing SOTA by 6%) and 0.995 on JBB (surpassing SOTA by 3%), while supporting harm severity quantification beyond binary classification.

Background & Motivation

Background: LLM jailbreak attack evaluation suffers from two major problems — mainstream methods (string matching, toxicity classifiers, LLM judges) produce only binary "yes/no" outputs without quantifying harm severity; emerging multi-dimensional frameworks (e.g., StrongREJECT, Cai et al.) apply uniform evaluation criteria across all scenarios, ignoring scenario-specific differences.

Limitations of Prior Work: - Binary classification is too coarse: It cannot distinguish between "providing detailed methods for killing" and "merely implying the possibility" - One-size-fits-all evaluation criteria: For instance, the "authenticity" dimension is meaningful for "violent crime" scenarios but irrelevant to "hate speech"; the same dimension should carry different weights across scenarios - Lack of regional sensitivity: Cryptocurrency compliance requirements differ between mainland China and Japan, yet existing methods cannot account for this

Key Challenge: Jailbreak scenarios are highly heterogeneous (violent crime vs. sexual content vs. political incitement), but evaluation methods treat them uniformly, creating an accuracy bottleneck.

Goal: To construct a scenario-adaptive jailbreak evaluation framework in which evaluation dimensions, scoring criteria, and weights are dynamically adjusted according to the scenario.

Key Insight: Drawing on mature scenario-based evaluation paradigms from software testing and autonomous driving verification, combined with DREAD/CVSS cybersecurity threat modeling theory, to systematically construct the evaluation framework.

Core Idea: A 14-scenario × 10-dimension adaptive matrix combined with Delphi-method expert consensus and AHP analytic hierarchy process weighting, enabling fine-grained jailbreak evaluation.

Method

Overall Architecture

A four-step pipeline: (1) Scenario Classifier — an LLM agent maps the input to one of 14 predefined scenarios; (2) Scenario–Dimension Adapter — dynamically selects evaluation dimensions, scoring criteria, and weights based on the scenario; (3) Jailbreak Detector — multi-dimensional judgment across 6 detection dimensions, fused via logical AND; (4) Harm Assessor — weighted scoring across 4 harm dimensions, producing a composite harm score.

Key Designs

  1. 14-Scenario × 10-Dimension Adaptive Matrix:

    • Function: Tailors the subset of evaluation dimensions and scoring criteria to each jailbreak scenario.
    • Mechanism: The 14 scenarios cover violent crime, non-violent crime, sexual content, hate speech, suicide/self-harm, cyberattacks, disinformation, political incitement and elections, professional advice, privacy violation, intellectual property infringement, cryptocurrency violations, regionally sensitive issues, and illegal weapons/drugs. The 10 dimensions are divided into 6 detection dimensions (refusal, helpfulness, regional compliance, risk warning, political neutrality, explicitness) and 4 harm dimensions (authenticity, specificity, severity, scope of impact).
    • Design Motivation: Different scenarios require different dimension combinations. For example, "sexual content" emphasizes "explicitness," "professional advice" emphasizes "risk warning," and "regionally sensitive issues" requires "regional compliance."
  2. Delphi Method + AHP for Weight Determination:

    • Function: Systematically determines the weights of harm assessment dimensions for each scenario.
    • Mechanism: Ten security experts use multiple rounds of anonymous Delphi consensus to rank dimension importance for each scenario (consensus criteria: CV < 0.25, IQR ≤ 2), after which AHP converts the rankings into specific weights (consistency check: CR < 0.1).
    • Design Motivation: Avoids arbitrarily assigned weights; expert consensus and mathematical methods ensure objectivity and reproducibility.
  3. Multi-Dimensional Jailbreak Detection Design:

    • Function: Captures edge cases missed by traditional methods by decomposing the detection task.
    • Mechanism: Six detection dimensions each serve a distinct role — "refusal" checks whether the response contains only refusal expressions without substantive content; "helpfulness" detects whether the model indirectly facilitates malicious behavior (e.g., refusing first but then providing phishing details); "regional compliance" handles geographic differences. The final judgment is the logical AND of all relevant dimensions.
    • Design Motivation: "Refuse-then-help" edge cases are the primary failure mode of existing methods; the combined "refusal" + "helpfulness" detection effectively captures such cases.

Loss & Training

  • Detection evaluation metrics: Accuracy, Precision, Recall, F1
  • Harm scoring evaluation metrics: NMAE (deviation from expert annotations), Spearman-Rho (rank correlation with human judgments)
  • Overall NMAE = 0.013, Spearman-Rho = 0.938, indicating high agreement with expert judgments

Key Experimental Results

Main Results

Self-constructed SceneJailEval dataset (1,308 queries, 14 scenarios):

Method Accuracy Precision Recall F1
StringMatch 0.749 0.750 0.957 0.841
Qi2023 (GPT-4) 0.816 0.966 0.760 0.851
JailJudge 0.800 0.930 0.768 0.841
SceneJailEval 0.883 0.901 0.929 0.915

Public datasets — JBB: F1 = 0.995 (SOTA); JailJudge dataset: F1 = 0.824 (SOTA).

Ablation Study

Configuration F1 Description
Full SceneJailEval 0.917 Complete framework
DimsOnly (w/o scenario classification) 0.890 Removing scenario classification drops F1 by 2.7%
Vanilla (w/o dimension selection) 0.831 Falling back to generic heuristics drops F1 by 8.6%

Key Findings

  • Scenario adaptation is critical: Removing scenario classification and dimension selection leads to F1 drops of 2.7% and an additional 8.6%, respectively.
  • Edge case detection advantage: Significantly outperforms all baselines on "refuse-then-help" cases and region-specific cases.
  • Harm scoring highly consistent with experts: NMAE < 0.02, Spearman-Rho ≈ 0.94.
  • Strong generalization: Ranks second on Safe-RLHF despite Beaver being specifically fine-tuned on that dataset.

Highlights & Insights

  • Introducing scenario-based evaluation methodology into LLM safety assessment: A methodological contribution that transplants mature paradigms from software testing and autonomous driving into this domain.
  • DREAD/CVSS theory guiding harm dimension definition: Dimensions are grounded in established cybersecurity threat modeling theory rather than intuition.
  • Delphi + AHP weight determination: Provides a reproducible, extensible framework for determining scenario–dimension weights rather than hard-coding them.
  • Regional sensitivity dimension: The first jailbreak evaluation framework to incorporate cultural and legal differences across regions.

Limitations & Future Work

  • Coverage of 14 scenarios: Although relatively comprehensive, real-world jailbreak attacks may fall outside these 14 categories.
  • Dependence on LLM agent classification accuracy: Misclassification of scenarios propagates errors to all subsequent dimension selection and evaluation steps.
  • High cost of expert annotation: Dataset construction requires five security experts to annotate using scenario-adaptive criteria, making scaling expensive.
  • Underlying model is Qwen-3-235B: Only one underlying model has been validated; performance with other LLM-as-judge setups remains unknown.
  • Delphi consensus from 10 experts: The sample size is relatively small and may be subject to individual biases.
  • vs. StrongREJECT: Applies unified Rejection Clarity/Specificity/Credibility criteria across all scenarios, ignoring scenario differences. SceneJailEval overcomes this limitation through scenario adaptation.
  • vs. AttackEval: Uses GPT-4 reference answers with cosine similarity, still a uniform standard. SceneJailEval's multi-dimensional scenario-adaptive evaluation is more fine-grained.
  • vs. LlamaGuard3: Meta's official safety judge model achieves F1 = 0.98 on JBB, compared to SceneJailEval's 0.995.

Rating

  • Novelty: ⭐⭐⭐⭐ The scenario-adaptive multi-dimensional framework is pioneering in jailbreak evaluation; the Delphi + AHP weighting approach is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, six baselines, ablation studies, and expert annotation consistency validation.
  • Writing Quality: ⭐⭐⭐⭐ The framework is rigorously defined with complete mathematical formalization and detailed scenario and dimension definitions.
  • Value: ⭐⭐⭐⭐ Practically valuable for LLM safety evaluation; the extensible framework design is forward-looking.