SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs¶
Basic Information¶
- Conference: ICLR 2026
- arXiv: 2507.14894
- Code: GitHub
- Area: Natural Language Processing / Large Language Models
- Keywords: Code-Switching, Sparse Autoencoder, Multilingual LLMs, SFT, Language Features
TL;DR¶
This paper uses Sparse Autoencoders (SAEs) to identify that unexpected code-switching in LLMs is associated with abnormally high pre-activation values of target-language features, and proposes SASFT, a method that constrains language feature pre-activation values during SFT training, reducing unexpected code-switching by over 50%.
Background & Motivation¶
State of the Field¶
Multilingual LLMs (e.g., Qwen-3, Llama-4, Gemma-3) frequently exhibit unexpected code-switching during generation—for instance, inserting Chinese or Korean text into responses to English queries—severely degrading user experience.
Limitations of Prior Work¶
The only known attempt: Guo et al. (2025) address code-switching in DeepSeek-R1 using GRPO with a language consistency reward, but lack mechanistic analysis and achieve limited effectiveness.
Lack of fundamental understanding: Existing work has not deeply analyzed the internal mechanism of code-switching.
Core Findings¶
SAE-based analysis reveals: 1. LLMs contain language-specific features—directions in the residual stream that exhibit large projection values only when processing tokens of a specific language; 2. When unexpected code-switching occurs, the pre-activation values of target-language features increase abnormally; 3. Ablation experiments confirm that suppressing these features reduces code-switching.
Method¶
Overall Architecture¶
SASFT proceeds in two steps: (1) identifying language-specific features in the LLM; (2) introducing an auxiliary loss during SFT training to constrain these features.
1. Sparse Autoencoder (SAE) Background¶
Given a residual stream activation \(\mathbf{x} \in \mathbb{R}^N\), the SAE computes feature activations \(\mathbf{a} \in \mathbb{R}^M\) (\(M \gg N\)):
This work focuses on the pre-activation values \(\mathbf{f(x)}\) rather than the post-activation values \(\mathbf{a(x)}\), as the latter discards meaningful negative pre-activation information.
2. Language-Specific Feature Identification¶
Following Deng et al. (2025), the monolinguality of each feature is measured. For feature \(s\) and language \(L\), the score is defined as:
where \(\mu_s^L\) is the average activation of the feature on language \(L\), and \(\gamma_s^L\) is its average activation on all other languages. Features with the highest \(\nu\) values are treated as language-specific features.
3. Mechanistic Analysis of Code-Switching¶
Key Finding 1: Prior to a code-switching event, the pre-activation values of the target-language features gradually increase (Figure 3).
Key Finding 2: Directional ablation that subtracts the language feature direction reduces the code-switching rate (Figure 4):
However, inference-time ablation has two drawbacks: (1) it requires substantially reducing pre-activation values, which may impair other capabilities; (2) it requires external intervention, increasing inference overhead.
4. SASFT Training Objective¶
An auxiliary loss is introduced during SFT to teach the LLM to autonomously maintain appropriate language feature pre-activation values:
where \(\mathcal{S}_L\) is the set of language-specific features for language \(L\), and \(\alpha_j\) is the estimated threshold of the mean pre-activation value.
The final training loss is:
Key Design Choices¶
- Target-language data \(\mathcal{D}_L\) is excluded, as generating text in the target language does not constitute code-switching.
- The threshold \(\alpha_j\) is not set to zero, since the mean pre-activation value may be negative.
- The loss can be applied across multiple layers for greater stability.
Experiments¶
Main Results: Code-Switching Rate Comparison¶
| Model | Method | CS→Chinese | CS→Russian | CS→Korean |
|---|---|---|---|---|
| Gemma-2-2B | SFT (baseline) | 0.74% | 0.57% | 3.45% |
| SFT+GRPO | 0.74 (0%) | 0.49 (-14%) | 3.44 (0%) | |
| SFT+Penalty | 0.67 (-10%) | 0.41 (-27%) | 1.18 (-66%) | |
| SASFT | 0.42 (-43%) | 0.22 (-61%) | 0.73 (-79%) | |
| Gemma-2-9B | SFT (baseline) | 0.78% | 0.12% | 0.81% |
| SASFT | 0.41 (-47%) | 0.01 (-94%) | 0.13 (-84%) | |
| Llama-3.1-8B | SFT (baseline) | 1.16% | 0.67% | 0.57% |
| SASFT | — | — | — |
Ablation Study: Impact of Different Components¶
| Configuration | CS→Chinese (↓) | MMLU (↑) | HumanEval (↑) |
|---|---|---|---|
| SFT baseline | 0.78% | 69.2 | 42.1 |
| SASFT (single layer) | 0.52% | 69.5 | 42.8 |
| SASFT (multi-layer) | 0.41% | 69.8 | 43.2 |
| Inference-time ablation | 0.45% | 67.3 | 40.5 |
Key Findings¶
- SASFT reduces code-switching by over 50% in most settings, achieving complete elimination in Korean scenarios in some cases;
- Substantially outperforms GRPO: GRPO yields near-zero improvement (0%) in most configurations, while SASFT consistently reduces code-switching;
- Multilingual capabilities are preserved: Performance is maintained or improved across six benchmarks including MMLU, HumanEval, and Flores-200;
- Multi-layer application yields greater stability: Cross-layer SASFT is more robust than its single-layer counterpart;
- Suppression is more effective than promotion: Reducing non-target-language features is more effective than amplifying source-language features;
- Training-time constraint outperforms inference-time intervention: SASFT modifies the model's internal behavior with no additional inference overhead.
Highlights & Insights¶
- This is the first work to provide a rigorous mechanistic analysis of unexpected code-switching in LLMs, establishing a causal link to language feature pre-activation values.
- The translation from inference-time intervention to training-time constraint is elegant, directly addressing both drawbacks of the former approach.
- Strong generalizability is demonstrated across five models from three model families: Gemma-2, Llama-3.1, and Qwen-3.
- The auxiliary loss design is clean and principled, leveraging ReLU gating to penalize only pre-activation values that exceed the threshold.
Limitations & Future Work¶
- The approach requires a pre-trained SAE for the target model (Qwen-3 necessitates training a custom SAE, and this overhead is not quantified).
- Language-specific feature identification depends on multilingual calibration data.
- The paper primarily evaluates on Chinese, Korean, and Russian; generalization to a broader set of languages remains to be verified.
- The code-switching definition relies on script-level detection, which may miss fine-grained lexical mixing.
Related Work & Insights¶
- Code-switching in LLMs: Guo et al. (2025) identify and attempt to address code-switching in DeepSeek-R1 using GRPO.
- SAE-based analysis: Deng et al. (2025) discover language-specific features in LLMs.
- Multilingual LLMs: Qwen-3 (Yang et al., 2025), Llama-4 (Meta, 2025), Gemma-3 (Team et al., 2025).
- Mechanistic interpretability: Sparse autoencoders for understanding internal representations of LLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to combine SAE-based interpretability with the code-switching problem, delivering a seamless pipeline from mechanism analysis to solution.
- Technical Depth: ⭐⭐⭐⭐ — A complete analysis–discovery–solution chain with a well-motivated pre-activation constraint design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage across 5 models × 3 languages × 6 benchmarks.
- Value: ⭐⭐⭐⭐⭐ — Directly addresses a critical pain point in multilingual LLM deployment.