BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking¶
Conference: ICML 2026
arXiv: 2602.00767
Code: https://github.com/ (mentioned in the paper)
Area: Mechanistic Interpretability / LLM Alignment / Safety
Keywords: emergent misalignment, sparse autoencoder, latent blocking, training-time intervention
TL;DR¶
BLOCK-EM utilizes SAEs to identify a small set of internal latents that "causally control emergent misalignment." It then applies a one-sided regularization during narrow-domain SFT to prevent the model from amplifying these latents in a "misaligned direction." This approach reduces emergent misalignment by an average of 93% across 6 fine-tuning domains while nearly maintaining in-domain task performance.
Background & Motivation¶
Background: Betley et al. (2025) revealed a counter-intuitive phenomenon: when performing supervised fine-tuning (SFT) in a narrow domain (e.g., "giving bad financial advice"), models not only learn the target task but also generalize generalized harmful behaviors unrelated to the training data (emergent misalignment, EM). Wang et al. (2025) further used SAEs to attribute EM to a few "persona features," proving that causal steering of these latents can both induce and repair misalignment. This represents a new pathway from "mechanistic interpretability → practical alignment intervention."
Limitations of Prior Work: Existing training-time defenses are either (i) coarse-grained KL regularization—punishing overall output deviation from the base model, which has limited gains for EM and hurts learning; (ii) inoculation prompting—explicitly labeling "this is bad behavior" in training prompts, requiring prompt engineering and lacking robustness; (iii) preventative steering—injecting steering vectors into all samples during training, where intensity is hard to tune; or (iv) constrained LoRA (SafeLoRA)—limiting the update subspace without targeting specific EM mechanisms. These methods do not exploit "feature-level causal attribution" provided by SAEs.
Key Challenge: The essence of EM is the amplification of a few latents leading to narrow-to-broad generalization. However, all existing defenses operate at the output or weight level, without directly locking those causally-relevant latents. Consequently, they are either too weak (EM persists) or too strong (in-domain tasks degrade).
Goal: (i) Design a pipeline to automatically find a set of SAE latents \(\mathcal{K}\) that "causally control EM"; (ii) design a training-time loss that precisely restricts these latents from being amplified "only in the misalignment direction"; (iii) prove that (a) \(\mathcal{K}\) identified in a single domain transfers across domains, (b) in-domain tasks remain learnable after intervention, and (c) failure modes can be analyzed via mechanistic interpretability.
Key Insight: Perform model-diffing between \(\mathcal{M}^{\text{base}}\) (safe instruct model) and \(\mathcal{M}^{\text{mis}}\) (model that became misaligned after narrow-domain SFT) in a "reference controlled experiment" to find latents with the largest activation changes. Then, use induce-and-repair causal steering to filter a subset that can "both induce and repair" EM. A ReLU one-sided penalty is applied only to this small set \(\mathcal{K}\) during training.
Core Idea: Shift alignment intervention from the "output layer" or "full weights" level down to the "signed activation increments of specific SAE latents," implementing a training-time regularization that is minimal-invasive yet maximally causally relevant.
Method¶
Overall Architecture¶
Two stages: (A) Offline Causal Latent Discovery — Use a fixed, domain-agnostic set of 44 core misalignment prompts to run forward passes on \(\mathcal{M}^{\text{base}}\) and \(\mathcal{M}^{\text{mis}}\) at an intermediate layer (e.g., layer 20). Use a pre-trained SAE to project hidden states onto a ~60K dimensional latent basis and perform three-stage filtering: (1) Top-Delta Candidate Pool—Select top positive and negative latents based on token-averaged activation change \(\Delta_k = \mathbb{E}_x[\bar z_k^{\text{mis}}(x)] - \mathbb{E}_x[\bar z_k^{\text{base}}(x)]\); (2) Induce-and-Repair Causal Filtering—For each candidate latent \(k\), apply \(h \leftarrow h + \alpha \hat d_k\) on the base model to test if it induces EM, and perform reverse steering on the misaligned model to test if it repairs EM, retaining only those that pass both; (3) Ranked Selection under Quality Budget—Scan \(\alpha\) to find the maximum behavioral effect while keeping incoherence \(\leq 10\%\), resulting in a final small set \(|\mathcal{K}|=20\), split into \(\mathcal{K}^+\) and \(\mathcal{K}^-\) based on the sign of \(\Delta_k\). (B) Training-time Latent Blocking — Add a one-sided penalty (on completion tokens only) to the standard SFT loss, optimized via \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{SFT}} + \lambda \mathcal{L}_{\text{block}}\). Optionally freeze layers 21-32 downstream of the blocking layer to prevent downstream bypass.
Key Designs¶
-
Three-stage Causal Latent Discovery Pipeline:
- Function: Automatically identifies the small set of latents that "truly causally control EM" from tens of thousands of SAE latents, distinguishing correlation from causation.
- Mechanism: Stage 1 uses model-diffing to calculate \(\Delta_k\) and forms a sign-aware candidate pool to filter features strongly amplified or suppressed by fine-tuning. Stage 2 is the critical causal filter—steering adds the latent's decoder direction to the hidden state (\(h \leftarrow h + \alpha \hat d_k\)). Two tests are run on core misalignment prompts: whether base + positive steering induces EM, and whether mis + negative steering repairs EM. Only latents passing both are kept. Stage 3 performs "strength scanning under a quality budget," recording the maximum behavioral effect achievable with \(< 10\%\) incoherence as a ranking score to select the top 20.
- Design Motivation: Activation shifts (Stage 1) only indicate "what changed," not "what caused EM." The bidirectional causal tests in Stage 2 upgrade correlation to causal evidence. Stage 3 ensures latents are comparable under quality-controlled conditions, avoiding the selection of "degenerate latents" that induce EM but cause the model to output gibberish.
-
One-sided Signed Latent Blocking Loss:
- Function: Restricts the activity of latents in \(\mathcal{K}\) only in the direction of misalignment during training, without affecting other latents or existing base levels.
- Mechanism: At each training step, a frozen base copy runs the same input. Comparing \(z^{(\theta)}_{t,k}(x)\) (current model) and \(z^{\text{base}}_{t,k}(x)\) (base), the loss is defined as \(\mathcal{L}_{\text{block}} = \mathbb{E}_{x,t}[\sum_{k\in\mathcal{K}^+}\text{ReLU}(z^{(\theta)}_{t,k} - z^{\text{base}}_{t,k})^2 + \sum_{k\in\mathcal{K}^-}\text{ReLU}(z^{\text{base}}_{t,k} - z^{(\theta)}_{t,k})^2]\). The ReLU makes the loss "asymmetric"—it activates only when the model shifts further toward misalignment (\(\mathcal{K}^+\) increases or \(\mathcal{K}^-\) decreases) compared to the base. Optimization follows \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{SFT}} + \lambda \mathcal{L}_{\text{block}}\).
- Design Motivation: Bidirectional penalties block useful learning; KL-like regularization suppresses all deviations indiscriminately. The one-sided + signed + base-anchored trio is a minimal-invasive design—assuming the base is already safe, it only prevents pushing latents further toward misalignment. Calculating only on completion tokens prevents prompt length variance from polluting the signal.
-
Downstream Freezing + Cross-domain Transfer Mechanism:
- Function: Blocks "downstream layer bypass" escape paths and allows \(\mathcal{K}\) identified in one domain to be reused in others.
- Mechanism: Since \(\mathcal{L}_{\text{block}}\) only acts directly on layer 20 and below, layers 21-32 optimization might learn how to decode misaligned output from the "locked" intermediate representations. Freezing layers 21-32 further reduces EM from 38% → 3% without losing in-domain performance. Regarding transfer: \(\mathcal{K}\) identified solely on the finance domain is reused for BLOCK-EM training in 6 other domains (health, education, legal, etc.), finding that EM is suppressed across all.
- Design Motivation: The H3 hypothesis (downstream bypass) needs to be neutralized; freezing downstream is cheap and effective. Successful cross-domain transfer suggests \(\mathcal{K}\) captures "generalized persona-level misalignment representations" rather than "finance-specific features," proving the universality of the mechanism.
Loss & Training¶
$\(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{SFT}} + \lambda \mathcal{L}_{\text{block}}\)$ Main experiments use Llama-3.1-8B-Instruct with Goodfire SAE on layer-20 output; fine-tuned via LoRA. Each domain uses 5900 training samples + 30-100 held-out in-domain evaluation samples. \(\lambda\) is scanned across \(\{0, 10^3, 13\times 10^3, 10^5\}\). Qwen2.5-72B-Instruct and Llama-3.3-70B-Instruct are used as LLM judges to evaluate EM, incoherence, refusal, and in-domain task adherence. Average taken over multiple seeds. \(|\mathcal{K}|=20\). Also independently replicated on Llama-3.2-1B-Instruct and Qwen-2.5-7B-Instruct.
Key Experimental Results¶
Main Results¶
Average across 6 domains for Llama-3.1-8B-Instruct SFT on financial advice (core+final eval):
| \(\lambda\) | EM Rate (%) | Relative EM Reduction (%) | Incoherence Absolute Change (pp) | In-domain Relative Change (%) |
|---|---|---|---|---|
| 0 (vanilla SFT) | 40 | 0 | 0 | 0 |
| \(10^3\) | 21 | ~47.5 | ~0 | Minimal Loss |
| \(13\times 10^3\) | ~3 | 93 | +2.72 | -4.14 |
| \(10^5\) | 2.8 | ~93 | +12 | Large Decrease |
In the finance domain with layers 21-32 frozen: EM dropped from 38% → 3%, incoherence remained near baseline, and neither SFT loss nor in-domain adherence significantly decreased.
Cross-domain transfer (using \(\mathcal{K}\) found only on finance): EM was significantly suppressed in all domains, proving the latent set is "mechanistic" rather than "domain-specific."
Cross-model replication: Independent BLOCK-EM pipelines on Llama-3.2-1B-Instruct and Qwen-2.5-7B-Instruct also significantly reduced EM.
Ablation Study¶
| Configuration | EM Reduction | Description |
|---|---|---|
| Full BLOCK-EM | 93% | Complete pipeline |
| Randomly selected latents | No reduction | Causal selection is necessary |
| Stage 1 only (Top-Delta) | Partial reduction | Lacks necessity of causal filtering |
| Shuffled \(\mathcal{K}^+/\mathcal{K}^-\) signs | Weakened | Signed direction is important |
| One-sided penalty (only \(\mathcal{K}^+\) or \(\mathcal{K}^-\)) | Weakened | Both sides are important |
| Final-layer blocking | Significantly worse | Intermediate layer is critical |
| Enhanced BLOCK-EM (Appendix D) | 97.7% | Even reached +40% in-domain |
| KL regularization baseline | Weak | Pareto-inferior to BLOCK-EM |
| Inoculation prompting | Weak | Pareto-inferior to BLOCK-EM |
| Preventative steering | Weak | Pareto-inferior to BLOCK-EM |
| Test-time steering | Weak | Pareto-inferior to BLOCK-EM |
Key Findings¶
- Causal latents are key: Neither random nor Top-Delta selection suffices, validating that the induce-and-repair filter is indispensable.
- Freezing downstream layers provides a massive "free" boost: Lowering EM from 38% to 3% strongly supports the H3 (downstream bypass) hypothesis.
- Cross-domain and cross-model transfer holds: The same \(\mathcal{K}\) works across 6 domains and 3 different base models, proving BLOCK-EM targets a generic persona-level mechanism.
- EM re-emerges under prolonged training: Misalignment slowly returns with more training epochs. Activation patching and re-running Stages 1-3 on re-emerged checkpoints provide evidence consistent with H2 (alternative directions not covered by \(\mathcal{K}\) exist on layer 20). Layer-wise scans of prefix-token patching show upstream patching is significantly more effective than downstream patching for repair.
- Using union(original \(\mathcal{K}\), newly found latents) for further training further suppresses re-emergence, suggesting that "multi-layer / multi-round adaptive blocking" is a promising direction.
Highlights & Insights¶
- The IDP (Interpretability-Driven Prevention) paradigm—using mechanistic interpretability findings to guide training-time interventions—is highly promising. It is Pareto-superior to inoculation/KL/steering and explains "why it works."
- The trio of One-sided ReLU + signed direction + base-anchored serves as an elegant template for minimal-invasive interventions, applicable to any scenario where one wants to "stop behavior X but preserve other learning."
- Stage 2's induce-and-repair bidirectional causal test is much stricter than unidirectional ablation and is the key design for removing "false correlation latents."
- Methodology for re-emergence analysis (activation patching + re-running latent discovery) provides a reusable toolkit for diagnosing why alignment fails, indicating that alignment requires continuous mechanistic monitoring.
Limitations & Future Work¶
- Reliance on SAE training quality: There is a risk of feature drift (H1). While currently not significant, SAEs might degrade under longer training or stronger fine-tuning.
- Incomplete coverage of single-layer blocking: The support for hypothesis H2 suggests that 20 latents at layer 20 do not span the entire misalignment subspace; future work needs multi-layer or adaptive set expansion.
- In-domain task design details: The "in-domain success" in this paper involves "giving wrong financial advice," which is itself a misaligned goal. While used as a stringent test, in real deployment, in-domain tasks are usually "helpful" and orthogonal to safety; the advantages of BLOCK-EM might be less dramatic there.
- Parameter tuning cost for \(\lambda\): The quality-EM trade-off still requires scanning \(\lambda\), and no adaptive scheduling scheme was provided.
- SAE overhead: High-quality SAEs are required, which poses a barrier for resource-constrained teams.
- Not tested on RLHF models: Experiments focused on instruction-tuned models; the mechanism of EM in models already subjected to RLHF might differ.
Related Work & Insights¶
- vs. Wang et al. 2025 (persona features): They identify persona features for inference-time steering; this paper upgrades that discovery to a more thorough training-time intervention.
- vs. KL Regularization (Kaczér et al. 2025): KL suppresses deviations at the output layer; BLOCK-EM precisely locks specific latents at the feature level—a sparse rather than dense constraint—resulting in less damage.
- vs. Inoculation Prompting (Wichers et al. 2025): Inoculation relies on prompt modification, whereas BLOCK-EM locks internal representations directly, yielding more stable effects.
- vs. Preventative Steering (Chen et al. 2025): Adding steering vectors during training makes choosing direction and intensity difficult; BLOCK-EM uses model-diffing for directions and ReLU for adaptive intensity.
- vs. Concept Ablation Fine-tuning (Casademunt et al. 2025): They ablate concept subspaces, while BLOCK-EM selects discrete SAE latent sets, offering higher interpretability.
- Insights: (i) "Using mechanistic interpretability to guide alignment" is now actionable; (ii) for any need to "block behavior generalization while preserving task ability" (e.g., preventing jailbreak learning, sycophancy, or reward hacking), this framework of model-diffing + induce-and-repair + one-sided blocking is worth attempting.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The IDP paradigm and signed one-sided latent blocking are genuine methodological innovations.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6-domain transfer + 3-model replication + 4 baselines + full ablation + re-emergence causal analysis; high volume/quality.
- Writing Quality: ⭐⭐⭐⭐⭐ H1/H2/H3 hypotheses are clear, and the evidence/counter-evidence mapping is logical.
- Value: ⭐⭐⭐⭐⭐ Directly applicable alignment intervention with 93%-97.7% EM reduction and no in-domain loss is significant for safe fine-tuning workflows.