Skip to content

Superficial Safety Alignment Hypothesis

Conference: ICLR 2026 arXiv: 2410.10862 Code: https://ssa-h.github.io/ Area: AI Safety / LLM Alignment Keywords: safety alignment, alignment fragility, neuron-level analysis, alignment tax, model pruning

TL;DR

This paper proposes the Superficial Safety Alignment Hypothesis (SSAH): safety alignment is essentially teaching a model to perform an implicit binary classification task (execute vs. refuse), requiring only ~1.3% of neurons to establish safety guardrails. Freezing these safety-critical units during fine-tuning preserves safety, and leveraging redundant units as an "alignment budget" eliminates the alignment tax.

Background & Motivation

Background: LLM safety alignment primarily relies on SFT, RLHF, and DPO, but these methods typically treat safety alignment as a subset of general alignment, overlooking its unique properties.

Limitations of Prior Work: - Safety mechanisms are extremely fragile — even fine-tuning on benign data can collapse safety guardrails (Qi et al., 2023). - An "alignment tax" exists — improving safety sacrifices general model capability. - Current approaches require full-parameter fine-tuning, incurring high computational costs.

Key Challenge: There is insufficient understanding of how safety alignment affects model behavior and why safety mechanisms are so fragile.

Goal: Three questions are addressed: How does safety alignment affect model behavior? Why is safety so fragile? How can these issues be mitigated?

Key Insight: A key observation — a model capable of fulfilling malicious requests already possesses the relevant knowledge and reasoning ability; thus, safety alignment only needs to teach the model to select the correct reasoning direction (execute vs. refuse), rather than injecting new knowledge.

Core Idea: Safety alignment ≈ implicit binary classification task, achievable with a very small fraction (~1.3%) of safety-critical neurons.

Method

Overall Architecture

SSAH is not a concrete algorithm but a hypothesis framework about the nature of safety alignment. The overall pipeline proceeds as follows: (1) propose the hypothesis and validate the existence of reasoning directions via probing experiments; (2) identify four types of neurons (SCU/UCU/CU/RU) through structured pruning; (3) derive two practical strategies based on the identified units — freezing safety units to resist fine-tuning attacks, and exploiting redundant units to reduce the alignment tax.

Key Designs

  1. Superficial Safety Alignment Hypothesis (SSAH):

  2. Function: Reframes safety alignment as an implicit safety-related binary classification task.

  3. Mechanism: A model capable of executing malicious requests already possesses the relevant knowledge; safety alignment only needs to teach it to select the correct "reasoning direction" — whether to execute or refuse a request. Alignment also provides a standardized refusal mechanism and alternative response templates.
  4. Design Motivation: Compared to the general Safety Alignment Hypothesis (SAH), SSAH is more specific and verifiable — it focuses on models that already possess the requisite knowledge, eliminating confounding factors arising from knowledge deficits.
  5. Explanation of Jailbreaks: Current alignment only determines the reasoning direction at the initial token; attackers bypass the safety mechanism by manipulating tokens. Ideal alignment should re-evaluate the reasoning direction at every generation step.

  6. Probing Experiments to Validate Reasoning Directions:

  7. Function: Validates that safety alignment indeed shifts the model's reasoning direction by comparing hidden-state distances.

  8. Mechanism: Three types of queries are constructed — original malicious queries (Clean), malicious queries prepended with a benign token ("Sorry, I can't..."), and malicious queries prepended with a malicious token ("Here's how..."). For an aligned model, the hidden-state distance between Clean and the benign-token variant should be smaller than that between Clean and the malicious-token variant; the reverse holds for unaligned models.
  9. Design Motivation: Directly observing reasoning direction is infeasible, but it can be inferred indirectly through distance relationships in the hidden-state space.
  10. Key Findings: Aligned models exhibit a preference for safe reasoning across all Transformer blocks, not merely in later layers.

  11. Identification of Four Types of Computational Units (SCU/UCU/CU/RU):

  12. Function: Classifies model neurons/channels into four types: Safety-Critical Units (SCU), Utility-Critical Units (UCU), Composite Units (CU), and Redundant Units (RU).

  13. Mechanism: A structured pruning strategy is employed. For each depth-2 module \(f(X) = B\sigma(AX)\), an importance score is computed as \(\mathbf{I}_{:,j} = \frac{1}{N-1}\sum_{n=1}^{N}(X^B_{n,j,:} - \bar{X}^B_{:,j,:})^2 \cdot \|\mathbf{W}^B_{:,j}\|_2^2\). Scores \(\mathbf{I_S}\) and \(\mathbf{I_U}\) are computed on safety and utility datasets respectively, and the four unit types are distinguished via their differences and sums.
  14. Design Motivation: If safety alignment is truly a simple binary classification task, then only a small number of neurons should be required to establish safety guardrails.

  15. Freezing Strategy Against Fine-Tuning Attacks:

  16. Function: Freezes safety-critical components (SCU + top CU) during fine-tuning to prevent safety degradation.

  17. Mechanism: Attribute migration analysis reveals that fine-tuning converts SCUs and CUs into UCUs, causing safety degradation. Freezing these units prevents such attribute migration.
  18. Effect: Freezing SCU + all CU reduces ASR on AdvBench for LLaMA2 from 11.92% to 2.88%.

  19. Redundant Units as Alignment Budget:

  20. Function: Performs alignment fine-tuning exclusively on the redundant units (~20% of parameters) of the pre-trained model.

  21. Mechanism: Approximately 20% of pre-trained model parameters are redundant; updating only these parameters achieves alignment while avoiding modification of utility-critical units.
  22. Effect: Updating 20% of parameters achieves comparable alignment quality, while mathematical capability (GSM8K) improves from 9.24 to 13.4 — outperforming full-parameter fine-tuning, which yields 8.8.

Loss & Training

  • During pruning, an activation-variance-based importance score is used to remove channels/neurons in a structured manner.
  • During fine-tuning, specific units are frozen; training epochs are doubled to ensure fair comparison (resulting in equivalent or lower final training loss).

Key Experimental Results

Main Results: Freezing Safety Units Against Fine-Tuning Attacks

Model / Setting AdvBench ASR (keyword) AdvBench ASR (llama3-guard) HEx-PHI Score HEx-PHI Rate
LLaMA2 initial 0.19% 0.19% 1.05 0.3%
LLaMA2 + Dolly fine-tune 11.92% 10.58% 1.95 18.78%
LLaMA2 + freeze SCU+6%CU 3.65% 2.31% 1.55 10.6%
LLaMA2 + freeze SCU+all CU 2.88% 1.92% 1.48 9.0%
LLaMA3 initial 1.54% 1.15% 1.16 3.0%
LLaMA3 + Dolly fine-tune 61.15% 50.58% 2.95 37.2%
LLaMA3 + freeze SCU+all CU 40.58% 28.27% 2.32 23.6%

Ablation Study: Impact of Pruning Each Unit Type

Unit Type Proportion Utility Drop (LLaMA2) Safety ASR Increase (LLaMA2)
SCU 1.3% −1.3% +56.0%
UCU 13.3% −15.6% +18.3%
RU 14.8% −2.8% +4.6%
Dense (full model) 100% baseline baseline (10.0%)

Key Findings

  • SCUs are extremely sparse yet critical: Only 1.3% of neurons are responsible for safety; removing them causes ASR to surge from 10% to 66%.
  • LLaMA3 is more fragile than LLaMA2: After fine-tuning, ASR spikes from 1.54% to 61.15%, possibly because LLaMA3 "analyzes" the true intent of malicious requests.
  • PEFT methods degrade safety more severely than full-parameter fine-tuning: LoRA causes a 26.9% harmful rate vs. 18.48% for full-parameter fine-tuning — counter-intuitively.
  • Aligning on redundant units improves rather than degrades mathematical ability: GSM8K improves from 9.24 → 13.4 (20% parameter fine-tuning) vs. 9.24 → 8.8 (full-parameter fine-tuning).

Highlights & Insights

  • Safety ≈ binary classification — a distinctive insight: Reducing safety alignment to a binary reasoning-direction selection is both parsimonious and highly explanatory. It elegantly explains why safety is so fragile — only a small number of neurons' "voting directions" need to be flipped to compromise safety.
  • Redundant units as alignment budget: Pre-trained models naturally contain ~20% redundant parameters; using these for alignment avoids modifying utility-critical units. This idea is transferable to any scenario requiring new capabilities to be added without degrading existing ones.
  • Attribute migration analysis framework: Tracking per-neuron attribute changes before and after fine-tuning yields a migration map (SCU → CU → UCU), providing a visualization tool for understanding how fine-tuning undermines alignment.
  • Probing method for internal reasoning direction: Inferring reasoning direction by comparing hidden-state distances between Query+benign/malicious tokens is a simple yet effective approach.

Limitations & Future Work

  • SSAH provides necessary but not sufficient evidence: The authors acknowledge that the probing experiments are necessary but not sufficient; safety alignment may involve subtler changes not captured by SSAH.
  • Per-step re-evaluation remains unrealized: The paper proposes that ideal safety alignment should re-select the reasoning direction at every generation step, but this incurs additional inference overhead.
  • LLaMA3 experiments are constrained: Due to computational limitations, only the first 12 blocks are frozen, yielding weaker results than LLaMA2.
  • Only the SFT scenario is validated: The behavior of SCUs/RUs under RLHF/DPO alignment is not explored.
  • Future directions: SCU/RU identification could be combined with LoRA to design "safety-aware LoRA" — inserting LoRA adapters exclusively on RUs.
  • vs. Wei et al. (2024): They also study safety-critical components but identify them at the weight level; this work operates at the neuron level with finer granularity, and experiments more thoroughly validate the effectiveness of the freezing strategy.
  • vs. SafeDPO: SafeDPO constrains safety through training objectives, whereas this work takes a model-structure perspective; the two are complementary — SSAH-identified safety units can be combined with SafeDPO training objectives.
  • vs. AlphaSteer: AlphaSteer achieves refusal steering via null-space constraints; both AlphaSteer and SSAH's freezing strategy aim to protect safety parameters from modification, but approach the problem from different angles.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The SSAH perspective is distinctive; the insight of reducing safety to binary classification is highly perceptive.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple models, benchmarks, and evaluation methods are employed, but LLaMA3 experiments are incomplete due to computational constraints.
  • Writing Quality: ⭐⭐⭐⭐⭐ Logic is clear; the narrative progresses systematically from hypothesis → validation → application.
  • Value: ⭐⭐⭐⭐⭐ Provides a theoretical foundation for understanding the nature of safety alignment and designing efficient safety training strategies.