Skip to content

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Conference: ACL 2026
arXiv: 2503.17239
Code: GitHub
Area: LLM Alignment / Safety
Keywords: Safety Alignment, Model Merging, LoRA Fine-Tuning, Post-Fine-Tuning Defense, Selective Layer Merging

TL;DR

This paper proposes SafeMERGE, a lightweight post-fine-tuning framework that identifies fine-tuned layers deviating from safe behavior via cosine similarity and merges only these layers with corresponding layers of a safety model. It significantly reduces harmful outputs across four LLMs while maintaining or even enhancing task performance.

Background & Motivation

Background: Fine-tuning LLMs for specific domains is a common practice, but research indicates that fine-tuning (even with harmless data) can erode safety alignment—just a few malicious samples can cause an aligned model to comply with harmful requests. Safety alignment has been proven to be "shallow" and easily compromised during fine-tuning.

Limitations of Prior Work: (1) Alignment-stage defenses require modifying the initial alignment pipeline, which is unfriendly to practitioners; (2) Fine-tuning-stage defenses require custom training algorithms, making them difficult to integrate with standard open-source libraries; (3) Simple post-fine-tuning defenses (such as full-layer merging like RESTA) often sacrifice task performance for safety.

Key Challenge: How to restore safety after fine-tuning without modifying existing training processes or compromising task performance?

Goal: Design a simple, plug-and-play post-fine-tuning framework that performs selective merging only when necessary (when layers deviate from safe behavior).

Key Insight: Utilize the weight difference between the aligned model and the base model to define a "safety alignment subspace" and detect whether fine-tuned LoRA layers deviate from this subspace using cosine similarity.

Core Idea: Merge only those layers that deviate from safe behavior while preserving the task performance of other layers—selectivity is superior to global merging.

Method

Overall Architecture

SafeMERGE involves three steps: (1) Train a safety LoRA model (using public safety datasets, reusable after one training); (2) Detect which layers of the fine-tuned model are "unsafe" via safety subspace projection; (3) Execute linear merging only for the unsafe layers with the corresponding layers of the safety model.

Key Designs

  1. Safety Alignment Subspace and Layer Selection:

    • Function: Automatically identify layers that deviate from safe behavior after fine-tuning.
    • Mechanism: Safety subspace \(V^i = W_{aligned}^i - W_{unaligned}^i\) (the weight difference between the aligned model and its base version). Calculate the cosine similarity \(\rho^i\) between the fine-tuned LoRA layer \(\Delta W_f^i\) and its projection \(C^i \Delta W_f^i\) on the safety subspace. If \(\rho^i < \tau\) (threshold), the layer is marked as unsafe.
    • Design Motivation: SafeLoRA applies projection to all layers uniformly, which damages task performance; SafeMERGE intervenes only in the deviating layers, preserving the learning of other layers.
  2. Selective Layer Merging:

    • Function: Perform safety restoration only for unsafe layers.
    • Mechanism: For layers marked as unsafe, perform linear merging \(\Delta W_{merge}^i = \alpha \Delta W_f^i + (1-\alpha) \Delta W_s^i\), where \(\Delta W_s^i\) is the corresponding layer of the safety model. \(\alpha\) controls the trade-off between task performance and safety. Safe layers maintain the fine-tuned weights unchanged.
    • Design Motivation: Global merging (RESTA) applies safety correction to all layers, modifying even those that are already safe, which unnecessarily compromises task performance.
  3. Safety Model Construction:

    • Function: Provide safety reference layers for merging.
    • Mechanism: Use public safety datasets (harmful prompt + safe response pairs) to LoRA fine-tune an aligned model. Test different data volumes (100/500/1000/2500 samples) and select the model with the lowest toxicity score. The safety model is task-agnostic and reusable across tasks after one training.
    • Design Motivation: The safety model provides a parameterized representation of "safe behavior," giving the merging process a clear target.

Loss & Training

The safety model is fine-tuned using standard LoRA. SafeMERGE itself requires no training—it only involves computing cosine similarities and linear merging, which can run entirely on a CPU. Evaluation uses Llama-Guard-3-8B and ShieldGemma-9B for cross-verification.

Key Experimental Results

Main Results

Method Llama-3.1 GSM8K↑ DirectHarm↓ HexPhi↓
Original Aligned Model 73.80 11.30 7.90
After Fine-Tuning 78.24 28.30 14.70
SafeInstruct 77.40 12.50 7.20
RESTA 74.20 11.90 6.90
SafeLoRA 77.90 15.10 7.10
SafeMERGE 78.50 8.80 6.30

Ablation Study

Analysis Dimension Result
Merging Strategy (Linear vs DARE vs TIES) Linear merging is sufficient
Threshold τ Sensitivity Larger τ merges more layers, increasing safety but potentially decreasing task performance
Safety Data Volume 500-1000 samples are usually optimal
Weighting Schemes Uniform α generally performs well

Key Findings

  • SafeMERGE consistently outperforms or matches baselines across all 4 LLMs × 2 task settings.
  • On Llama-3.1, SafeMERGE even exceeds the task performance of the original aligned model (78.50 vs 73.80) while being safer (8.80 vs 11.30).
  • Selective merging is superior to full-layer merging (RESTA)—RESTA shows a significant drop in task performance (74.20 vs 78.50).
  • The safety model is reusable across tasks, eliminating the need for retraining for every new task.

Highlights & Insights

  • The intuition of "fixing only the layers that need fixing" is simple but highly effective—selective intervention is superior to global intervention.
  • The ability to run entirely on a CPU without retraining makes it highly valuable for practical deployment.
  • The design of a safety model that is reusable across tasks after a single training significantly reduces the cost of adoption.

Limitations & Future Work

  • The definition of the safety subspace depends on the availability of both the aligned and base models—not all models release their base versions.
  • Validation was performed only on 7B-8B models; the layer selection characteristics of larger models might differ.
  • The threshold \(\tau\) requires tuning, and there is currently no automatic selection method.
  • Only LoRA fine-tuning was considered; the applicability to full-parameter fine-tuning scenarios remains unknown.
  • vs SafeLoRA: SafeLoRA uniformly projects all layers into the safety subspace, losing some task information; SafeMERGE selectively merges only unsafe layers.
  • vs RESTA: RESTA globally subtracts a "harmful task vector" without distinguishing between safe and unsafe layers; SafeMERGE’s selective strategy is more granular.
  • vs SafeInstruct: SafeInstruct mixes safety samples into the training data, requiring modification of the training process; SafeMERGE is an entirely post-processing approach.

Rating

  • Novelty: ⭐⭐⭐ The idea of selective merging is intuitive and effective, though technically it is a combination of existing methods.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 models × 5 tasks, cross-verification, and extensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear and concise, with an intuitive description of the method.
  • Value: ⭐⭐⭐⭐⭐ Extremely high practical value—simple, effective, and plug-and-play.