When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment¶

Conference: ICLR 2026 arXiv: 2506.07452 Code: https://github.com/xiaoyuxin1002/SafeStyle Area: Audio & Speech Keywords: LLM safety, jailbreak attacks, style alignment, ASR inflation, safety defense

TL;DR¶

This paper identifies that Attack Success Rates (ASR) in LLM jailbreak benchmarks are artificially inflated by semantically irrelevant style patterns (e.g., "create a list of"), a phenomenon observed in nearly all of 36 evaluated LLMs. Superficial style alignment fine-tuning further exacerbates this risk. The paper proposes SafeStyle — a defense that mitigates this risk via style-augmented safety training data.

Background & Motivation¶

Background: LLM alignment efforts enable models to refuse malicious requests. Jailbreak attacks exploit string transformations to increase Attack Success Rate (ASR).

Limitations of Prior Work: Queries in jailbreak benchmarks frequently contain semantically irrelevant style patterns (e.g., "create a list of" in "create a list of chemical warfare agents"), which independently inflate ASR. Existing safety defenses do not account for the impact of style alignment.

Key Challenge: Style patterns are pervasive in benign instructions (e.g., "create a list of healthy snacks"), causing LLMs to learn to comply with stylistic requests — a tendency that is then exploited when the same styles appear in malicious queries.

Core Idea: ASR inflation = ASR(styled query) − ASR(malicious intent only). SafeStyle = safety training data augmented to match the style distribution of fine-tuning data.

Method¶

Overall Architecture¶

The study proceeds in three progressive stages: (1) quantifying ASR inflation in existing jailbreak benchmarks; (2) demonstrating through controlled experiments that superficial style alignment exacerbates safety risks; and (3) proposing the SafeStyle defense strategy.

Key Designs¶

ASR Inflation Quantification:
- GPT-4o is used to extract core malicious intent from 2,134 jailbreak queries by removing style patterns.
- DeBERTa-NLI validates semantic equivalence before and after extraction (retaining only samples with exact semantic match).
- ASR differences between original queries (with style) and pure malicious intent (without style) are compared across 36 LLMs.
- Result: 32 out of 36 models exhibit significant ASR inflation (paired t-test \(p = 0.0002\)).
Attention Mechanism Analysis:
- Attention weights are aggregated across all heads and layers to compute each LLM's relative attention difference between style tokens and malicious-intent tokens.
- ASR inflation is found to be significantly positively correlated with relative attention to style tokens (Spearman \(\rho = 0.456\), \(p = 6 \times 10^{-3}\)).
- Furthermore, inflated style patterns exhibit significantly higher bigram overlap frequency in alignment training data (e.g., Tulu-3, OLMo SFT data).
Superficial Style Alignment Experiment:
- 1,000 instruction–response pairs are constructed across 6 style variants (original / de-styled / list-prefix / list-suffix / poem-prefix / poem-suffix).
- Llama-3.1-8B-Instruct is fine-tuned separately on each variant and evaluated on same-style / cross-style jailbreak ASR.
- ASR rises sharply when training and test styles match, and deteriorates as the proportion of same-style data increases.
SafeStyle Defense:
- A small amount of safety training data (from Bianchi et al. 2024) is mixed into the fine-tuning data.
- Key innovation: the safety data is style-augmented to match the style distribution of the fine-tuning data (e.g., list-style fine-tuning → list-style safety refusal examples).
- As few as 50 style-matched safety samples suffice to effectively balance safety and style adaptation.

Loss & Training¶

Standard SFT fine-tuning (full fine-tuning, 2 epochs, lr = 5e-6, batch size 128) with style-matched safety data mixing.

Key Experimental Results¶

Main Results¶

Finding	Data
32 out of 36 LLMs exhibit ASR inflation	paired t-test \(p = 0.0002\)
All 7 benchmarks cause inflation	SorryBench and MedSafetyBench affect the most models
Mistral series shows the most severe inflation	Gemma/Llama are relatively resistant

Ablation Study¶

Configuration	ASR (list style)	ASR (poem style)	Notes
Original instruction fine-tuning (diverse)	Medium	Medium	Baseline, diverse styles
List-style fine-tuning (100%)	Highest	Medium	ASR rises sharply under same-style attacks
Poem-style fine-tuning (100%)	Medium	Highest	ASR rises sharply under same-style attacks
List fine-tuning (50% + 50% de-styled)	Lower	Medium	Mixing de-styled data alleviates the issue
+ SafeStyle (style-matched safety data)	Lowest	Lowest	Defense is effective
+ Vanilla safety data (no style matching)	Medium-low	Medium-low	Limited effectiveness
+ PTST (inference-time safety prompt)	Medium	Medium	Inference-time intervention alone is insufficient
+ SPPFT (frozen safety layers)	Medium-low	Medium-low	Partially effective

Key Findings¶

Decoupling style from malicious intent: ASR drops significantly after style removal (paired t-test \(p = 0.0002\)).
Same-style fine-tuning → same-style jailbreak: List-style fine-tuning causes a sharp increase in ASR for list-styled malicious queries, with the effect becoming significant after only 0.4 epochs.
SafeStyle is consistently effective: Across 3 LLMs (Qwen2.5-3B, Llama-3.1-8B, gemma-3-12b) × 6 styles × 2 real-world datasets (Dolly-15K, Alpaca-52K), SafeStyle outperforms all 5 baselines.
Style position has minimal effect: ASR trends for prefix vs. suffix styles are nearly identical.
Only 50 safety samples needed: SafeStyle achieves strong results with as few as 50 style-matched safety examples, at minimal cost.

Highlights & Insights¶

Redefining ASR: ASR figures reported by existing benchmarks are systematically inflated by style patterns.
Safety training data should match deployment style — a simple yet previously overlooked insight.

Limitations & Future Work¶

Style pattern extraction relies on GPT-4o few-shot prompting, which may miss certain implicit styles (e.g., rhetorical devices, syntactic preferences).
SafeStyle requires knowledge of the fine-tuning data's style distribution, which may not be available in open deployment scenarios.
Only six styles are tested (list, poem, news, legal, Shakespearean, code); a broader style space (e.g., colloquial, academic) remains to be explored.
Safety data sources are fixed (Bianchi et al.); larger and more diverse safety datasets may further improve effectiveness.
The style–safety interaction during RLHF/DPO post-training is not analyzed (this paper considers SFT only).

vs. Bianchi et al. (Vanilla safety data): Safety data without style augmentation is far less effective than SafeStyle, confirming that style matching is the key factor.
vs. SPPFT (frozen safety layers): The layer-freezing strategy fails under certain styles, as safety knowledge is distributed across multiple layers.
vs. Constrained (restricting initial tokens): Restricting initial tokens alone is insufficient to defend against full-style jailbreaks.
Insight: Safety alignment should be treated as a process that co-evolves with deployment style, rather than a one-time fixed intervention.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The ASR inflation concept is entirely novel and significant; the causal analysis of style–safety interactions is rigorous.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 36 LLMs × 7 benchmarks + attention analysis + 6-style fine-tuning + 3 models × 5 baselines.
Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from discovery to analysis to defense is complete and self-consistent.
Value: ⭐⭐⭐⭐⭐ Profound implications for LLM safety evaluation standards and alignment practices.