When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment¶
Conference: ICLR 2026 arXiv: 2506.07452 Code: https://github.com/xiaoyuxin1002/SafeStyle Area: Audio & Speech Keywords: LLM safety, jailbreak attacks, style alignment, ASR inflation, safety defense
TL;DR¶
This paper identifies that Attack Success Rates (ASR) in LLM jailbreak benchmarks are artificially inflated by semantically irrelevant style patterns (e.g., "create a list of"), a phenomenon observed in nearly all of 36 evaluated LLMs. Superficial style alignment fine-tuning further exacerbates this risk. The paper proposes SafeStyle — a defense that mitigates this risk via style-augmented safety training data.
Background & Motivation¶
Background: LLM alignment efforts enable models to refuse malicious requests. Jailbreak attacks exploit string transformations to increase Attack Success Rate (ASR).
Limitations of Prior Work: Queries in jailbreak benchmarks frequently contain semantically irrelevant style patterns (e.g., "create a list of" in "create a list of chemical warfare agents"), which independently inflate ASR. Existing safety defenses do not account for the impact of style alignment.
Key Challenge: Style patterns are pervasive in benign instructions (e.g., "create a list of healthy snacks"), causing LLMs to learn to comply with stylistic requests — a tendency that is then exploited when the same styles appear in malicious queries.
Core Idea: ASR inflation = ASR(styled query) − ASR(malicious intent only). SafeStyle = safety training data augmented to match the style distribution of fine-tuning data.
Method¶
Overall Architecture¶
The study proceeds in three progressive stages: (1) quantifying ASR inflation in existing jailbreak benchmarks; (2) demonstrating through controlled experiments that superficial style alignment exacerbates safety risks; and (3) proposing the SafeStyle defense strategy.
Key Designs¶
-
ASR Inflation Quantification:
- GPT-4o is used to extract core malicious intent from 2,134 jailbreak queries by removing style patterns.
- DeBERTa-NLI validates semantic equivalence before and after extraction (retaining only samples with exact semantic match).
- ASR differences between original queries (with style) and pure malicious intent (without style) are compared across 36 LLMs.
- Result: 32 out of 36 models exhibit significant ASR inflation (paired t-test \(p = 0.0002\)).
-
Attention Mechanism Analysis:
- Attention weights are aggregated across all heads and layers to compute each LLM's relative attention difference between style tokens and malicious-intent tokens.
- ASR inflation is found to be significantly positively correlated with relative attention to style tokens (Spearman \(\rho = 0.456\), \(p = 6 \times 10^{-3}\)).
- Furthermore, inflated style patterns exhibit significantly higher bigram overlap frequency in alignment training data (e.g., Tulu-3, OLMo SFT data).
-
Superficial Style Alignment Experiment:
- 1,000 instruction–response pairs are constructed across 6 style variants (original / de-styled / list-prefix / list-suffix / poem-prefix / poem-suffix).
- Llama-3.1-8B-Instruct is fine-tuned separately on each variant and evaluated on same-style / cross-style jailbreak ASR.
- ASR rises sharply when training and test styles match, and deteriorates as the proportion of same-style data increases.
-
SafeStyle Defense:
- A small amount of safety training data (from Bianchi et al. 2024) is mixed into the fine-tuning data.
- Key innovation: the safety data is style-augmented to match the style distribution of the fine-tuning data (e.g., list-style fine-tuning → list-style safety refusal examples).
- As few as 50 style-matched safety samples suffice to effectively balance safety and style adaptation.
Loss & Training¶
Standard SFT fine-tuning (full fine-tuning, 2 epochs, lr = 5e-6, batch size 128) with style-matched safety data mixing.
Key Experimental Results¶
Main Results¶
| Finding | Data |
|---|---|
| 32 out of 36 LLMs exhibit ASR inflation | paired t-test \(p = 0.0002\) |
| All 7 benchmarks cause inflation | SorryBench and MedSafetyBench affect the most models |
| Mistral series shows the most severe inflation | Gemma/Llama are relatively resistant |
Ablation Study¶
| Configuration | ASR (list style) | ASR (poem style) | Notes |
|---|---|---|---|
| Original instruction fine-tuning (diverse) | Medium | Medium | Baseline, diverse styles |
| List-style fine-tuning (100%) | Highest | Medium | ASR rises sharply under same-style attacks |
| Poem-style fine-tuning (100%) | Medium | Highest | ASR rises sharply under same-style attacks |
| List fine-tuning (50% + 50% de-styled) | Lower | Medium | Mixing de-styled data alleviates the issue |
| + SafeStyle (style-matched safety data) | Lowest | Lowest | Defense is effective |
| + Vanilla safety data (no style matching) | Medium-low | Medium-low | Limited effectiveness |
| + PTST (inference-time safety prompt) | Medium | Medium | Inference-time intervention alone is insufficient |
| + SPPFT (frozen safety layers) | Medium-low | Medium-low | Partially effective |
Key Findings¶
- Decoupling style from malicious intent: ASR drops significantly after style removal (paired t-test \(p = 0.0002\)).
- Same-style fine-tuning → same-style jailbreak: List-style fine-tuning causes a sharp increase in ASR for list-styled malicious queries, with the effect becoming significant after only 0.4 epochs.
- SafeStyle is consistently effective: Across 3 LLMs (Qwen2.5-3B, Llama-3.1-8B, gemma-3-12b) × 6 styles × 2 real-world datasets (Dolly-15K, Alpaca-52K), SafeStyle outperforms all 5 baselines.
- Style position has minimal effect: ASR trends for prefix vs. suffix styles are nearly identical.
- Only 50 safety samples needed: SafeStyle achieves strong results with as few as 50 style-matched safety examples, at minimal cost.
Highlights & Insights¶
- Redefining ASR: ASR figures reported by existing benchmarks are systematically inflated by style patterns.
- Safety training data should match deployment style — a simple yet previously overlooked insight.
Limitations & Future Work¶
- Style pattern extraction relies on GPT-4o few-shot prompting, which may miss certain implicit styles (e.g., rhetorical devices, syntactic preferences).
- SafeStyle requires knowledge of the fine-tuning data's style distribution, which may not be available in open deployment scenarios.
- Only six styles are tested (list, poem, news, legal, Shakespearean, code); a broader style space (e.g., colloquial, academic) remains to be explored.
- Safety data sources are fixed (Bianchi et al.); larger and more diverse safety datasets may further improve effectiveness.
- The style–safety interaction during RLHF/DPO post-training is not analyzed (this paper considers SFT only).
Related Work & Insights¶
- vs. Bianchi et al. (Vanilla safety data): Safety data without style augmentation is far less effective than SafeStyle, confirming that style matching is the key factor.
- vs. SPPFT (frozen safety layers): The layer-freezing strategy fails under certain styles, as safety knowledge is distributed across multiple layers.
- vs. Constrained (restricting initial tokens): Restricting initial tokens alone is insufficient to defend against full-style jailbreaks.
- Insight: Safety alignment should be treated as a process that co-evolves with deployment style, rather than a one-time fixed intervention.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The ASR inflation concept is entirely novel and significant; the causal analysis of style–safety interactions is rigorous.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 36 LLMs × 7 benchmarks + attention analysis + 6-style fine-tuning + 3 models × 5 baselines.
- Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from discovery to analysis to defense is complete and self-consistent.
- Value: ⭐⭐⭐⭐⭐ Profound implications for LLM safety evaluation standards and alignment practices.