Superficial Safety Alignment Hypothesis¶
Conference: ICLR 2026 arXiv: 2410.10862 Code: https://ssa-h.github.io/ Area: AI Safety / LLM Alignment Keywords: safety alignment, alignment fragility, neuron-level analysis, alignment tax, model pruning
TL;DR¶
This paper proposes the Superficial Safety Alignment Hypothesis (SSAH): safety alignment is essentially teaching a model to perform an implicit binary classification task (execute vs. refuse), requiring only ~1.3% of neurons to establish safety guardrails. Freezing these safety-critical units during fine-tuning preserves safety, and leveraging redundant units as an "alignment budget" eliminates the alignment tax.
Background & Motivation¶
Background: LLM safety alignment primarily relies on SFT, RLHF, and DPO, but these methods typically treat safety alignment as a subset of general alignment, overlooking its unique properties.
Limitations of Prior Work: - Safety mechanisms are extremely fragile — even fine-tuning on benign data can collapse safety guardrails (Qi et al., 2023). - An "alignment tax" exists — improving safety sacrifices general model capability. - Current approaches require full-parameter fine-tuning, incurring high computational costs.
Key Challenge: There is insufficient understanding of how safety alignment affects model behavior and why safety mechanisms are so fragile.
Goal: Three questions are addressed: How does safety alignment affect model behavior? Why is safety so fragile? How can these issues be mitigated?
Key Insight: A key observation — a model capable of fulfilling malicious requests already possesses the relevant knowledge and reasoning ability; thus, safety alignment only needs to teach the model to select the correct reasoning direction (execute vs. refuse), rather than injecting new knowledge.
Core Idea: Safety alignment ≈ implicit binary classification task, achievable with a very small fraction (~1.3%) of safety-critical neurons.
Method¶
Overall Architecture¶
SSAH is not a concrete algorithm but a hypothesis framework about the nature of safety alignment. The overall pipeline proceeds as follows: (1) propose the hypothesis and validate the existence of reasoning directions via probing experiments; (2) identify four types of neurons (SCU/UCU/CU/RU) through structured pruning; (3) derive two practical strategies based on the identified units — freezing safety units to resist fine-tuning attacks, and exploiting redundant units to reduce the alignment tax.
Key Designs¶
-
Superficial Safety Alignment Hypothesis (SSAH):
-
Function: Reframes safety alignment as an implicit safety-related binary classification task.
- Mechanism: A model capable of executing malicious requests already possesses the relevant knowledge; safety alignment only needs to teach it to select the correct "reasoning direction" — whether to execute or refuse a request. Alignment also provides a standardized refusal mechanism and alternative response templates.
- Design Motivation: Compared to the general Safety Alignment Hypothesis (SAH), SSAH is more specific and verifiable — it focuses on models that already possess the requisite knowledge, eliminating confounding factors arising from knowledge deficits.
-
Explanation of Jailbreaks: Current alignment only determines the reasoning direction at the initial token; attackers bypass the safety mechanism by manipulating tokens. Ideal alignment should re-evaluate the reasoning direction at every generation step.
-
Probing Experiments to Validate Reasoning Directions:
-
Function: Validates that safety alignment indeed shifts the model's reasoning direction by comparing hidden-state distances.
- Mechanism: Three types of queries are constructed — original malicious queries (Clean), malicious queries prepended with a benign token ("Sorry, I can't..."), and malicious queries prepended with a malicious token ("Here's how..."). For an aligned model, the hidden-state distance between Clean and the benign-token variant should be smaller than that between Clean and the malicious-token variant; the reverse holds for unaligned models.
- Design Motivation: Directly observing reasoning direction is infeasible, but it can be inferred indirectly through distance relationships in the hidden-state space.
-
Key Findings: Aligned models exhibit a preference for safe reasoning across all Transformer blocks, not merely in later layers.
-
Identification of Four Types of Computational Units (SCU/UCU/CU/RU):
-
Function: Classifies model neurons/channels into four types: Safety-Critical Units (SCU), Utility-Critical Units (UCU), Composite Units (CU), and Redundant Units (RU).
- Mechanism: A structured pruning strategy is employed. For each depth-2 module \(f(X) = B\sigma(AX)\), an importance score is computed as \(\mathbf{I}_{:,j} = \frac{1}{N-1}\sum_{n=1}^{N}(X^B_{n,j,:} - \bar{X}^B_{:,j,:})^2 \cdot \|\mathbf{W}^B_{:,j}\|_2^2\). Scores \(\mathbf{I_S}\) and \(\mathbf{I_U}\) are computed on safety and utility datasets respectively, and the four unit types are distinguished via their differences and sums.
-
Design Motivation: If safety alignment is truly a simple binary classification task, then only a small number of neurons should be required to establish safety guardrails.
-
Freezing Strategy Against Fine-Tuning Attacks:
-
Function: Freezes safety-critical components (SCU + top CU) during fine-tuning to prevent safety degradation.
- Mechanism: Attribute migration analysis reveals that fine-tuning converts SCUs and CUs into UCUs, causing safety degradation. Freezing these units prevents such attribute migration.
-
Effect: Freezing SCU + all CU reduces ASR on AdvBench for LLaMA2 from 11.92% to 2.88%.
-
Redundant Units as Alignment Budget:
-
Function: Performs alignment fine-tuning exclusively on the redundant units (~20% of parameters) of the pre-trained model.
- Mechanism: Approximately 20% of pre-trained model parameters are redundant; updating only these parameters achieves alignment while avoiding modification of utility-critical units.
- Effect: Updating 20% of parameters achieves comparable alignment quality, while mathematical capability (GSM8K) improves from 9.24 to 13.4 — outperforming full-parameter fine-tuning, which yields 8.8.
Loss & Training¶
- During pruning, an activation-variance-based importance score is used to remove channels/neurons in a structured manner.
- During fine-tuning, specific units are frozen; training epochs are doubled to ensure fair comparison (resulting in equivalent or lower final training loss).
Key Experimental Results¶
Main Results: Freezing Safety Units Against Fine-Tuning Attacks¶
| Model / Setting | AdvBench ASR (keyword) | AdvBench ASR (llama3-guard) | HEx-PHI Score | HEx-PHI Rate |
|---|---|---|---|---|
| LLaMA2 initial | 0.19% | 0.19% | 1.05 | 0.3% |
| LLaMA2 + Dolly fine-tune | 11.92% | 10.58% | 1.95 | 18.78% |
| LLaMA2 + freeze SCU+6%CU | 3.65% | 2.31% | 1.55 | 10.6% |
| LLaMA2 + freeze SCU+all CU | 2.88% | 1.92% | 1.48 | 9.0% |
| LLaMA3 initial | 1.54% | 1.15% | 1.16 | 3.0% |
| LLaMA3 + Dolly fine-tune | 61.15% | 50.58% | 2.95 | 37.2% |
| LLaMA3 + freeze SCU+all CU | 40.58% | 28.27% | 2.32 | 23.6% |
Ablation Study: Impact of Pruning Each Unit Type¶
| Unit Type | Proportion | Utility Drop (LLaMA2) | Safety ASR Increase (LLaMA2) |
|---|---|---|---|
| SCU | 1.3% | −1.3% | +56.0% |
| UCU | 13.3% | −15.6% | +18.3% |
| RU | 14.8% | −2.8% | +4.6% |
| Dense (full model) | 100% | baseline | baseline (10.0%) |
Key Findings¶
- SCUs are extremely sparse yet critical: Only 1.3% of neurons are responsible for safety; removing them causes ASR to surge from 10% to 66%.
- LLaMA3 is more fragile than LLaMA2: After fine-tuning, ASR spikes from 1.54% to 61.15%, possibly because LLaMA3 "analyzes" the true intent of malicious requests.
- PEFT methods degrade safety more severely than full-parameter fine-tuning: LoRA causes a 26.9% harmful rate vs. 18.48% for full-parameter fine-tuning — counter-intuitively.
- Aligning on redundant units improves rather than degrades mathematical ability: GSM8K improves from 9.24 → 13.4 (20% parameter fine-tuning) vs. 9.24 → 8.8 (full-parameter fine-tuning).
Highlights & Insights¶
- Safety ≈ binary classification — a distinctive insight: Reducing safety alignment to a binary reasoning-direction selection is both parsimonious and highly explanatory. It elegantly explains why safety is so fragile — only a small number of neurons' "voting directions" need to be flipped to compromise safety.
- Redundant units as alignment budget: Pre-trained models naturally contain ~20% redundant parameters; using these for alignment avoids modifying utility-critical units. This idea is transferable to any scenario requiring new capabilities to be added without degrading existing ones.
- Attribute migration analysis framework: Tracking per-neuron attribute changes before and after fine-tuning yields a migration map (SCU → CU → UCU), providing a visualization tool for understanding how fine-tuning undermines alignment.
- Probing method for internal reasoning direction: Inferring reasoning direction by comparing hidden-state distances between Query+benign/malicious tokens is a simple yet effective approach.
Limitations & Future Work¶
- SSAH provides necessary but not sufficient evidence: The authors acknowledge that the probing experiments are necessary but not sufficient; safety alignment may involve subtler changes not captured by SSAH.
- Per-step re-evaluation remains unrealized: The paper proposes that ideal safety alignment should re-select the reasoning direction at every generation step, but this incurs additional inference overhead.
- LLaMA3 experiments are constrained: Due to computational limitations, only the first 12 blocks are frozen, yielding weaker results than LLaMA2.
- Only the SFT scenario is validated: The behavior of SCUs/RUs under RLHF/DPO alignment is not explored.
- Future directions: SCU/RU identification could be combined with LoRA to design "safety-aware LoRA" — inserting LoRA adapters exclusively on RUs.
Related Work & Insights¶
- vs. Wei et al. (2024): They also study safety-critical components but identify them at the weight level; this work operates at the neuron level with finer granularity, and experiments more thoroughly validate the effectiveness of the freezing strategy.
- vs. SafeDPO: SafeDPO constrains safety through training objectives, whereas this work takes a model-structure perspective; the two are complementary — SSAH-identified safety units can be combined with SafeDPO training objectives.
- vs. AlphaSteer: AlphaSteer achieves refusal steering via null-space constraints; both AlphaSteer and SSAH's freezing strategy aim to protect safety parameters from modification, but approach the problem from different angles.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The SSAH perspective is distinctive; the insight of reducing safety to binary classification is highly perceptive.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple models, benchmarks, and evaluation methods are employed, but LLaMA3 experiments are incomplete due to computational constraints.
- Writing Quality: ⭐⭐⭐⭐⭐ Logic is clear; the narrative progresses systematically from hypothesis → validation → application.
- Value: ⭐⭐⭐⭐⭐ Provides a theoretical foundation for understanding the nature of safety alignment and designing efficient safety training strategies.