Capability-Based Scaling Trends for LLM-Based Red-Teaming¶

Conference: ICLR 2026 arXiv: 2505.20162 Code: https://github.com/kotekjedi/capability-based-scaling Area: Human Understanding / AI Safety / LLM Alignment Keywords: Red-teaming, Jailbreak attacks, Capability scaling, Safety evaluation, Attack success rate

TL;DR¶

This paper systematically evaluates 4 jailbreak methods across 600+ attacker–target LLM pairs and finds that attack success rate (ASR) follows a sigmoid scaling law with respect to the capability gap between attacker and target (\(R^2=0.83\)), where the capability gap is quantified via a logit transformation of MMLU-Pro scores.

Background & Motivation¶

Background: LLM red-teaming evaluates model safety by simulating adversarial attacks. Existing studies typically assess only a small number of model pairs and lack a systematic understanding of how ASR varies with model capability.

Limitations of Prior Work: ASR varies substantially across attack methods and is inconsistent across different model pairs. No unified framework exists for predicting attack vulnerability in new model combinations, necessitating full re-evaluation upon each new model release.

Key Challenge: Safety evaluation is resource-intensive—every model pair must be tested individually. Can scaling laws enable prediction rather than exhaustive testing? Moreover, as model capabilities improve, will human red-teaming eventually become ineffective?

Goal: Discover and quantify the scaling relationship between ASR and the capability gap between models, providing a predictive framework for safety evaluation.

Key Insight: The logit-transformed MMLU-Pro score is adopted as a unified proxy for model capability, and the capability difference between attacker and target is computed accordingly.

Core Idea: Jailbreak success rate is a sigmoid function of the attacker–target capability gap—the stronger the attacker and the weaker the target, the higher the ASR; when the target surpasses the attacker in capability, ASR drops sharply.

Method¶

Overall Architecture¶

ASR is evaluated across combinations of 4 attack methods (PAIR, TAP, PAP, Crescendo) × 25+ attacker models × 25+ target models. The relationship between ASR and the capability difference \(\delta = \text{logit}(a_{\text{MMLU}}) - \text{logit}(t_{\text{MMLU}})\) is then analyzed. All attacker models are first unlocked via LoRA fine-tuning to remove safety alignment. Evaluation uses the first 50 harmful behaviors from the HarmBench benchmark, with ASR reported as best-of-25 (maximum success rate over 25 steps) and assessed post-hoc using a neutral HarmBench judge.

Key Designs¶

Capability Gap Metric:
- MMLU-Pro scores are logit-transformed onto the real line: \(\text{logit}(p) = \log(p/(1-p))\)
- Capability gap: \(\delta = \text{logit}(\text{Attacker MMLU-Pro}) - \text{logit}(\text{Target MMLU-Pro})\)
- Positive values indicate a stronger attacker (strong-to-weak); negative values indicate a stronger target (weak-to-strong)
Sigmoid Scaling Law:
- For each target model, a linear regression of ASR on the capability gap is fitted in logit space and mapped back to probability space, yielding a sigmoid curve
- Sigmoid parameters (slope and intercept) differ across targets but the functional form is consistent
- Median parameters: \(k=1.73, b=-0.79\)
Model Unlocking:
- All attacker models undergo LoRA fine-tuning to remove safety alignment (~1,500 harmful samples)
- General capabilities are preserved while refusal behavior is eliminated, ensuring that the evaluated quantity is attack capability rather than willingness to refuse
- Unlocking quality is validated via direct HarmBench query ASR

Key Experimental Results¶

Main Results (600+ Attacker–Target Pairs)¶

Attack Method	# Model Pairs	avg ASR vs. MMLU-Pro \(\rho\)	R² (sigmoid fit)	Key Finding
PAIR	600+	>0.88	0.83	Most fundamental LLM attack
TAP	600+	>0.85	~0.80	Highest overall ASR
PAP	600+	>0.82	~0.78	Persuasion-oriented attack
Crescendo	600+	>0.80	~0.75	Multi-turn attack

Core Scaling Trends¶

Trend	Quantification	Implication
Stronger model = better attacker	avg ASR linearly correlated with MMLU-Pro (\(\rho>0.88\))	Capability gains simultaneously improve attack power
Stronger model = harder to break	Target ASR decreases with MMLU-Pro (\(R^2=0.83\))	But the decline is predictable
Capability gap dominates	ASR primarily determined by \(\delta\) (gap), not attacker's absolute capability	Relative capability matters more than absolute capability
Social science > STEM	Psychology/health/philosophy subcategories show strongest correlation	Persuasiveness is more critical than domain knowledge

Highlights & Insights¶

Predictive Tool: Given the MMLU-Pro scores of two models, the approximate ASR of red-teaming can be predicted, reducing costly full-scale evaluations—potentially saving tens of thousands of GPU hours
Quantifying Safety Investment: The rightward shift of the sigmoid can measure the "equivalent capability gain" of safety training—Llama3 exhibits a larger shift than Qwen2.5, reflecting greater safety investment
Quantitative Assessment of Weaponization Risk: As open-source model capabilities improve, the pool of attacker-accessible capability grows—each new 70B open-source release requires reassessment of the attack exposure of all existing deployed systems
Practical Implication Regarding Judges: Expensive closed-source judges have no significant effect on final ASR—the community need not incur substantial API costs on judges

Ablation Study¶

Analysis Dimension	Finding
MMLU-Pro subcategory correlation	Social science subcategories (psychology, health, philosophy) correlate most strongly with ASR, surpassing STEM
Judge impact	Strong judges improve ASR@1 (selection effect) but do not affect ASR@25 (generation quality)
Attack method comparison	TAP is overall strongest; Crescendo's originally reported success is attributable to the GPT-4 attacker rather than the method itself
Llama2 anomaly	4 early Llama models deviate from the trend (excessive refusal + adversarial training); later versions return to trend
Human red-teaming prediction	Assuming human MMLU-Pro = 0.898, ASR continuously declines as target model capability increases

Limitations & Future Work¶

MMLU-Pro as a capability proxy may be insufficiently precise in specific domains—the deviation of the Llama2 series indicates that safety training can break the general capability–robustness trend
Only 4 automated attack methods (PAIR/TAP/PAP/Crescendo) are evaluated; manual or composite attacks may not follow the same scaling law
Model unlocking via LoRA fine-tuning may not fully restore attack capability—unlocking quality is a potential confounding factor, particularly for models such as Claude that cannot be fully unlocked through simple fine-tuning
White-box attacks (e.g., GCG) are not considered; the study is limited to black-box, human-like attacks, which may follow different scaling laws
The deviation of the Llama2 series suggests MMLU-Pro is not a perfect proxy for defensive capability—a better proxy might be the FLOPs invested in safety training, but such data are not publicly available

vs. Ren et al. (2024) safety benchmark analysis: That work first quantified the negative correlation between jailbreak success rate and model capability; this paper further characterizes the relationship precisely as a sigmoid function
vs. Howe et al. (2025) GCG scaling: Their study examines scaling of GCG attacks within a model family; this paper examines capability-gap scaling across model families—complementary dimensions
vs. PAIR/TAP/PAP: This paper does not propose new attacks but uses existing attacks to reveal scaling laws—emphasizing the law rather than the method
Insight—Social science capability as a blind spot: Current safety evaluations focus on technical capabilities (coding, chemistry), yet persuasive and social engineering capabilities are the strongest predictors of attack success

Rating¶

Novelty: ⭐⭐⭐⭐ Scaling laws for red-teaming are a novel finding; sigmoid fitting is directly applicable to prediction
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Large-scale evaluation over 600+ model pairs, including closed-source frontier models
Writing Quality: ⭐⭐⭐⭐ Results are clearly presented with high information density in figures
Value: ⭐⭐⭐⭐⭐ Provides direct engineering guidance and policy implications for AI safety evaluation