Skip to content

EASE: Practical and Efficient Safety Alignment for Small Language Models

Conference: AAAI 2026 arXiv: 2511.06512 Code: https://github.com/horanshi/EASE Area: LLM Alignment / SLM Safety Keywords: small language models, safety alignment, selective reasoning, knowledge distillation, jailbreak defense

TL;DR

This paper proposes EASE, a safety alignment framework for edge-deployed small language models (SLMs), which addresses the tension between "shallow refusal being insufficiently robust" and "deep reasoning being prohibitively expensive" via a two-stage design. Stage one distills safety reasoning capabilities from a large reasoning model into the SLM; stage two applies selective reasoning activation, enabling reasoning only for adversarial queries in vulnerable semantic regions while responding directly to benign queries. EASE reduces jailbreak attack success rate by 17% compared to shallow alignment, while cutting reasoning overhead by 90% compared to full-reasoning alignment.

Background & Motivation

State of the Field

SLMs (<8B parameters) are widely deployed on edge devices (mobile phones, IoT) due to their compact size and fast inference, yet existing safety alignment methods are primarily designed for large-scale models.

Limitations of Prior Work

Shallow alignment (directly refusing malicious queries) is not robust against complex jailbreak attacks and can be bypassed with relative ease.

Root Cause

Deep reasoning-based alignment—having the model reason before deciding whether to refuse, analogous to o1—is too costly for SLMs, as edge devices cannot sustain the associated overhead. Furthermore, the limited capacity of SLMs makes it difficult to directly instill reasoning capabilities from limited data.

The core tension is that safety reasoning improves robustness but increases inference latency, yet edge deployment demands a balance between both.

Paper Goals

To make SLMs more robust against jailbreak attacks without significantly increasing inference cost.

Starting Point

Selective activation: enable safety reasoning only for queries that genuinely require deep reasoning (adversarial queries), while responding directly and quickly to benign queries.

Core Idea

Distilling safety reasoning capabilities combined with activating reasoning only for vulnerable queries yields efficient and robust SLM safety alignment.

Method

Overall Architecture

The framework consists of two stages: (1) Stage 1 — Distillation from a large safety reasoning teacher: the SLM learns safety reasoning chains from a teacher model, acquiring foundational reasoning-based safety judgment; (2) Stage 2 — Selective reasoning activation calibration: a diagnostic dataset is used to identify jailbreak query types for which the SLM remains vulnerable; reasoning is activated for these types, while benign and simple jailbreak queries receive direct responses.

Key Designs

  1. Safety Reasoning Distillation (Stage 1):

    • Function: Distill deep safety reasoning capabilities from a large model into the SLM.
    • Mechanism: A large reasoning model (e.g., Claude-3.7-Sonnet) generates safety reasoning chains over jailbreak data (e.g., "this request attempts to elicit harmful content → because... → therefore I should refuse"), which are then used for supervised fine-tuning (SFT).
    • Design Motivation: SLMs cannot acquire deep safety reasoning from limited data alone and require demonstration from a capable teacher model.
  2. Vulnerable Query Diagnosis:

    • Function: Identify the jailbreak types for which the SLM remains particularly vulnerable.
    • Mechanism: The distilled SLM is evaluated on a diagnostic set to identify query types that still result in successful jailbreaks; these constitute the "vulnerable semantic regions" requiring reasoning activation.
    • Design Motivation: Not all jailbreaks require reasoning — shallow refusal suffices for simpler attacks.
  3. Selective Reasoning Activation (Stage 2):

    • Function: Enable safety reasoning exclusively for vulnerable query types.
    • Mechanism: Two training sets are constructed — \(\mathcal{D}_{reason}\) (vulnerable jailbreaks paired with reasoning-chain responses) and \(\mathcal{D}_{direct}\) (benign queries and simple jailbreaks paired with direct responses). After mixed training, the model learns to autonomously determine when to reason.
    • Design Motivation: 90% of reasoning overhead is eliminated, as only genuinely dangerous queries trigger the slow reasoning path.

Loss & Training

  • Standard SFT loss.
  • Two-stage training: distillation followed by calibration.
  • Teacher model: Claude-3.7-Thinking-Sonnet.

Key Experimental Results

Main Results

Method Attack Success Rate ↓ Reasoning Overhead (vs. full reasoning)
Shallow Alignment Baseline
Full Reasoning Alignment Lowest 10×
EASE (Selective) −17% vs. shallow 1.1× (+10% only)

Ablation Study

Configuration Attack Success Rate Notes
Shallow refusal only High Not robust
Distillation + full reasoning Lowest Too expensive
Distillation + random reasoning Moderate Imprecise
EASE (diagnosis + selective) Low Efficient + robust

Key Findings

  • Selective activation reduces reasoning overhead by 90% — the majority of queries do not require deep reasoning.
  • Vulnerable query diagnosis is critical for precise targeting — randomly selecting when to reason leads to substantially worse outcomes.
  • Distillation quality determines the performance ceiling — a stronger teacher yields better SLM safety reasoning.
  • Shallow refusal suffices for simple jailbreaks — the selective strategy does not compromise defense on this subset.

Highlights & Insights

  • The insight that "safety reasoning need not be activated for all queries" is highly practical for real-world deployment, given that approximately 90% of requests are benign.
  • The vulnerable semantic region diagnosis methodology is generalizable to any scenario requiring selective capability augmentation.
  • The two-stage paradigm of distillation followed by selective activation offers direct guidance for safety in edge AI systems.

Limitations & Future Work

  • Vulnerable region diagnosis relies on known jailbreak types; novel attack patterns may escape detection.
  • The reasoning quality of the distillation teacher constitutes a hard performance ceiling.
  • Experiments cover only a limited set of SLMs (<8B); behavior on larger models remains unexplored.
  • The threshold for the selective strategy requires careful tuning.
  • An adversary aware of the selective strategy may attempt to circumvent it in a targeted manner.
  • vs. Circuit Breakers: Circuit Breakers defend by modifying internal representations, whereas EASE defends via reasoning chains — fundamentally different mechanisms.
  • vs. DeepSeek-R1 Distillation: General-purpose reasoning distillation; EASE focuses specifically on the safety reasoning dimension.
  • vs. Safe RLHF: Safe RLHF requires online reinforcement learning; EASE relies solely on SFT, making it considerably simpler.
  • The approach has direct practical value for safety deployment of edge AI systems.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of selective reasoning activation and vulnerable region diagnosis is creative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple SLMs, multiple attack methods, and efficiency analysis.
  • Writing Quality: ⭐⭐⭐⭐ The two-stage design logic is clearly presented.
  • Value: ⭐⭐⭐⭐ Directly applicable to SLM safety on edge devices.