ICLR 2026 Model Compression MoE expert routing behavior steering safety faithfulness inference-time control

Steering MoE LLMs via Expert (De)Activation¶

Conference: ICLR 2026 arXiv: 2509.09660 Code: github.com/adobe-research/SteerMoE Area: Model Compression / Interpretability & Safety Keywords: MoE, expert routing, behavior steering, safety, faithfulness, inference-time control

TL;DR¶

This paper proposes SteerMoE, which detects behavior-correlated experts via contrastive paired inputs and steers MoE LLM behavior at inference time by activating or deactivating specific experts (safety +20%, faithfulness +27%), while also exposing the fragility of safety alignment in MoE models (safety collapse −100%).

Background & Motivation¶

MoE architectures achieve efficient inference via sparse routing, yet the controllability and interpretability of routing mechanisms remain insufficient.
Core Insight: The MoE router not only allocates computation but also serves as a signal-rich, controllable interface.
The hypothesis is that specific experts are entangled with specific behaviors (e.g., safety, faithfulness), and detecting and controlling these experts enables test-time behavioral steering.
The approach is double-edged: it serves as a tool for alignment while simultaneously exposing unique safety vulnerabilities in MoE models.

Method¶

Paired-Sample Routing Discrepancy Detection¶

Given paired inputs \((x^{(1)}, x^{(2)})\) exhibiting opposing behaviors, the activation rate difference for each expert is computed as:

\[p^{(1)}_{\ell,i} = \frac{A^{(1)}_{\ell,i}}{N^{(1)}}, \quad p^{(2)}_{\ell,i} = \frac{A^{(2)}_{\ell,i}}{N^{(2)}}\]

\[\Delta_{\ell,i} = p^{(1)}_{\ell,i} - p^{(2)}_{\ell,i}\]

\(\Delta_{\ell,i} > 0\) indicates that expert \(i\) is associated with behavior 1, while \(\Delta_{\ell,i} < 0\) indicates association with behavior 2. Experts to be manipulated are selected by ranking \(|\Delta_{\ell,i}|\).

Steering Configuration¶

Routing logits are mapped to log-softmax scores \(\mathbf{s} = \log \text{softmax}(\mathbf{z})\) for scale normalization.
Activation rule: \(s_e \leftarrow s_{\max} + \varepsilon\) for \(e \in \mathcal{A}^+\)
Deactivation rule: \(s_e \leftarrow s_{\min} - \varepsilon\) for \(e \in \mathcal{A}^-\)
Re-apply softmax normalization → top-\(k\) selection → weighted aggregation.

Key design: \(\varepsilon\) is kept small to ensure that steered experts receive the highest or lowest priority without overwhelming other experts, thereby preserving the mixture-of-experts structure.

Contrastive Pair Construction¶

Faithfulness: \(x^{(1)}\) = "Document: {Context} Question: {Q}" (with document); \(x^{(2)}\) = "Question: {Q}" (without document).
Safety: \(x^{(1)}\) = safe refusal response; \(x^{(2)}\) = unsafe compliant response (using the BeaverTails dataset).

Key Experimental Results¶

Safety Steering (AdvBench, evaluated with Llama-Guard-3-8B)¶

Model	Direct Instruction	SteerMoE Unsafe	SteerMoE + AIM
GPT-OSS-120B	100% safe	90% safe	0% safe
Qwen3-30B	98% safe	60% safe	2% safe
Phi-3.5-MoE	100% safe	94% safe	0% safe

Faithfulness Steering¶

Steering Direction	FaithEval-CF	FaithEval-Unans	CF-TriviaQA	Avg. Improvement
Toward faithful	+10%~+27%	Significant gain	Significant gain	Up to +27%
Control set MCTest	No degradation	—	—	No impact on general QA

Key Safety Findings¶

Attack Combination	GPT-OSS-120B	Qwen3	Phi-3.5	OLMoE
AIM alone	100%	2%	96%	100%
FFA alone	100%	48%	100%	92%
SteerMoE + AIM	0%	2%	0%	36%

Key Findings¶

Safety- and faithfulness-correlated experts concentrate in the middle layers of the model.
Safety experts activate predominantly on safe tokens and unsafe experts on unsafe tokens, yielding a natural token-level attribution signal.
SteerMoE is orthogonal to existing jailbreak methods; their combination can completely bypass all safety guardrails.
The findings expose "alignment illusion" in MoE models: safety alignment concentrates along a small number of expert pathways and collapses with minor routing perturbations.

Highlights & Insights¶

Dual-edged analysis: The same method can both enhance safety/faithfulness (+20%/+27%) and completely destroy safety (−100%).
Lightweight and efficient: No modification to model weights, no additional training required; leverages existing routing computation.
Exposes fundamental vulnerability: Safety guardrails of GPT-OSS-120B collapse from 100% → 0% under SteerMoE + AIM.
New dimension of "alignment illusion": Safety alignment must cover all routing paths, not merely a few expert pathways.
Interpretability byproduct: Expert activation patterns serve as token-level attribution and hallucination detection signals.

Limitations & Future Work¶

Applicable only to MoE architectures; not directly transferable to dense models.
Constructing contrastive paired inputs requires behavioral contrast, which is difficult for subtle behaviors.
The optimal number of steered experts depends on model architecture hyperparameters and requires per-model tuning.
The ethical risks associated with safety attacks warrant attention.

MoE analysis: Mixtral lexical specialization, OLMoE routing saturation, domain specialization, etc.
LLM steering: LM-Steers, representation engineering, RICE, etc.
Safety attacks: GCG, ArtPrompt, AIM jailbreak methods.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Reframes MoE routing as a controllable behavioral interface.
Technical Depth: ⭐⭐⭐⭐ — Method is concise yet analysis is comprehensive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 11 benchmarks × 6 models, across safety and faithfulness dimensions.
Practicality: ⭐⭐⭐⭐ — Zero-cost inference-time steering, though the safety attack surface warrants concern.