Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs¶

Conference: AAAI 2026 arXiv: 2601.11019 Code: github Area: Interpretability Keywords: Sparse Autoencoder, Mechanistic Interpretability, Translation Feature Discovery, Causal Intervention, Data Selection

TL;DR¶

This work leverages sparse autoencoders (SAEs) to discover "translation-initiation features" within LLMs that govern the activation of translation tasks. Causal interventions validate their functional roles—amplifying these features improves translation quality and reduces hallucinations, while suppressing them induces hallucinations. The mechanistic insight is further operationalized into a practical data selection strategy that prioritizes "mechanistically difficult" samples for fine-tuning, substantially improving data efficiency and hallucination suppression.

Background & Motivation¶

State of the Field¶

LLMs such as Gemma-2 and LLaMA-3 exhibit zero-shot translation capabilities even without dedicated translation training. Early hypotheses attributed this to incidental parallel corpora in pretraining data ("accidental bilinguality"), yet subsequent studies showed that translation ability persists even when such data is removed.

Limitations of Prior Work¶

Mechanistic opacity: The internal mechanisms by which LLMs perform translation remain unclear. Data-driven explanations are computationally infeasible at the trillion-token scale.

Severe translation hallucinations: LLMs frequently produce unfaithful outputs—wrong target language, empty outputs, refusal to translate, or source-language repetition.

Low fine-tuning data efficiency: Conventional data selection strategies (random, high-quality, high-loss) lack signals grounded in the model's internal mechanisms.

Root Cause¶

LLMs possess an intrinsic translation capability, but this capability is not always reliably activated, leading to hallucinations. Understanding and exploiting the internal mechanisms that trigger translation can simultaneously address interpretability and practical concerns.

Starting Point¶

The work proceeds from internal model representations, using SAEs to decompose dense hidden states into sparse, interpretable features. A three-stage framework then identifies the feature set causally associated with translation task initiation. A key contribution is that the analysis is not merely descriptive—the findings are translated into a practical data selection strategy.

Method¶

Overall Architecture¶

A three-stage feature discovery framework followed by mechanism-driven data selection: 1. High-Frequency Feature Recall: Identify candidate features that are frequently co-activated with translation inputs. 2. Feature Influence Vector Characterization: Compute the directional influence of each candidate feature on the residual stream. 3. Consistency-Based Filtering: Apply a PCA consistency score to retain functionally coherent feature sets. 4. Application: Deploy a feature-activation-based data selection strategy for efficient fine-tuning.

Key Designs¶

Stage 1: High-Frequency Feature Recall
- Function: Filter translation-relevant candidate features from the tens of thousands of SAE features.
- Mechanism:
  - Monitor feature activations at three critical token positions: end of source text (src_last), target language token (tgt_lang), and end of input (input_last).
  - A feature is considered "present" if it activates at any of these positions.
  - Features present in ≥60% of samples are retained.
- Results: 1,004 candidates recalled for Gemma-2-2B-IT; 2,485 for Gemma-2-9B-IT.
- Finding: Task-relevant feature density increases with model depth, and this distributional pattern is highly consistent across model scales.
Stage 2: Feature Influence Vector Characterization
- Function: Quantify the directional influence of each candidate feature on the model's residual stream.
- Core formula: \(\mathbf{v}_{l,j} \triangleq \hat{\mathbf{h}}_{\text{intervene}} - \hat{\mathbf{h}}_{\text{base}}\) where \(\hat{\mathbf{h}}_{\text{intervene}}\) is the SAE reconstruction output after forcing the activation of feature \(f_{l,j}\) to a high value \(\alpha_{act}\).
- Design Motivation: Co-activation indicates correlation but not causality. Characterizing the direction of hidden-state change after feature intervention captures functional influence.
Stage 3: Consistency-Based Filtering
- Function: Verify whether candidate features form a functionally coherent set.
- PCA consistency score: \(\rho = \lambda_{\max}\left(\frac{1}{n}\mathbf{U}^T\mathbf{U}\right)\) where \(\mathbf{U}\) is the matrix of normalized feature influence vectors.
- Threshold: \(\tau_{cons} = 0.95\); only feature groups whose leading principal component explains more than 95% of variance are retained.
- Results: The 1,004 candidates for Gemma-2-2B-IT are reduced to only 45 high-consistency features.
- Design Motivation: If a group of features truly represents the same function, their influence vectors should be highly aligned. The median alignment score of most high-frequency features is below 0.4, with only a minority exceeding 0.95.
Mechanistic Data Selection
- Function: Use the activation of translation-initiation features as an intrinsic difficulty signal to select "mechanistically difficult" samples for fine-tuning.
- Mechanism:
  - "Mechanistically difficult" samples are those that fail to naturally activate translation-initiation features.
  - These samples are theoretically most effective for reinforcing the translation-initiation mechanism.
- Strategy comparison:
  - S0: Random selection
  - S1: High COMET score (high quality)
  - S2: High COMET + high training loss (hard mining)
  - S3: High COMET + lowest feature activation (mechanistic selection, proposed method)

Training / Intervention Details¶

Analysis model: Gemma-2-2B-IT, using Google's publicly released pretrained SAE.
The SAE expands the 2,304-dimensional hidden state to 16,384-dimensional sparse features.
Feature discovery uses only 98 samples; testing uses ~900 samples.
Fine-tuning data: 100k English–Chinese parallel sentence pairs.

Key Experimental Results¶

Main Results¶

Causal Intervention Experiments (Gemma-2-2B-IT, cross-lingual generalization):

Language Pair	Metric	Base Model	+l12-f2291	+l13-f3517
en→zh	COMET↑	73.62	77.98	77.83
en→zh	Hallucination Rate↓	19.15%	10.42%	10.22%
en→ja	COMET↑	44.80	47.62	47.95
en→ja	Hallucination Rate↓	30.76%	17.89%	20.36%
en→ru	COMET↑	54.36	55.59	57.20
en→ru	Hallucination Rate↓	29.46%	16.37%	19.26%
en→ar	COMET↑	40.52	42.02	42.38
en→ar	Hallucination Rate↓	42.48%	29.47%	32.76%

Key Finding: Features discovered solely on en→zh significantly reduce hallucination rates across all four language pairs (maximum reduction from 42.48% to 29.47%), demonstrating that these features encode language-agnostic task-initiation functions.

Ablation Study¶

Consistency Score vs. Causal Influence (Gemma-2-2B-IT, en→zh):

Consistency Score Range	Suppression → Hallucination Rate Change	Amplification → COMET Change	Note
Low (<0.5)	Negligible	Negligible	No causal influence
Medium (0.5–0.95)	Moderate	Moderate	Partial causal influence
High (>0.95)	+47.99% hallucination rate	−8.49 COMET	Strong causal influence

Fine-Tuning Data Selection Experiments¶

Part 1: Self-Model Feature Selection (20k training samples):

Model	Method	COMET↑	Hallucination Rate↓
Gemma-2-2B-IT	Base	73.62	19.15%
	S0: Random	82.49	3.62%
	S1: High Quality	83.32	2.12%
	S2: High Loss	82.14	4.32%
	S3: Mechanistic	83.37	0.90%
LLaMA-3.1-1B-IT	Base	57.61	32.24%
	S3: Mechanistic	77.92	2.39%

Part 2: Cross-Model Transfer (50k training samples; data selected using 2B features):

Model	Method	COMET↑	Hallucination Rate↓	Note
Gemma-2-9B-IT	S0: Random	85.36	4.21%
	S3: Mechanistic	86.48	0.60%	Intra-family transfer succeeds
LLaMA-3.2-8B-IT	S1: High Quality	86.69	0.10%
	S3: Mechanistic	86.34	0.30%	Cross-family transfer fails

Part 3: Effect of Data Proportion (Gemma-2-2B-IT; Skyline = 83.58 COMET from full 100k training):

Data Proportion	S0 (Random)	S1 (High Quality)	S3 (Mechanistic)
20%	~80	~82	~83
50%	~82	~83.5	83.68 (>Skyline)
80%	~83	~83.5	~83.7

Key Findings¶

Translation-initiation features are language-agnostic: Features discovered on en→zh universally improve translation quality across four language pairs.
Feature function promotes generation of "translation-framing tokens": After feature amplification, the generation rate of translation-framing tokens (e.g., markers analogous to "Translation as follows") increases from 46.4% to 77.1% (Arabic).
Causal influence is strictly positively correlated with consistency score: Suppressing high-consistency features causes hallucination rates to surge by 47.99%, while low-consistency features have virtually no effect.
Mechanistic insights are practically actionable: Using only 50% of data selected via the mechanistic strategy surpasses the performance of training on the full 100% dataset.
Transfer is bounded by model family: Gemma→Gemma transfer is highly effective, whereas Gemma→LLaMA transfer fails, indicating that different architectural families implement distinct translation mechanisms.

Highlights & Insights¶

A complete loop from analysis to application: The work not only uncovers an interpretable translation mechanism but also operationalizes it into a practical data selection strategy, serving as an exemplar for mechanistic interpretability research.
Design elegance of the three-stage filtering framework: The pipeline reduces 1,004 candidates to 45 core features, with each stage relying on a distinct signal (frequency → direction → consistency).
Elegance of the PCA consistency metric: A single scalar quantifies the functional coherence of a feature group in a concise and effective manner.
Introduction of the "mechanistic difficulty" concept: Defining data difficulty via internal feature activations rather than external metrics opens an entirely new perspective on data selection.
Transfer experiments delineate architectural family boundaries: Neural circuits are transferable within the same model family but not across families—a finding with practical implications for model selection.

Limitations & Future Work¶

The method relies on Google's publicly released pretrained SAE and cannot be directly applied to models without publicly available SAEs.
Feature discovery uses only 98 samples; while sufficient, robustness has not been thoroughly validated.
Direct causal intervention (amplifying/suppressing features at inference time) incurs prohibitive computational costs; the paper acknowledges this limitation and pivots to the data selection strategy.
Validation is limited to the translation task; the generalizability of the framework to other tasks (e.g., summarization, question answering) remains to be explored.
Only SAE features from MLP layers are analyzed; attention-layer features may also play important roles.

SAE interpretability (Cunningham et al. 2023; Templeton et al. 2024): The technical foundation for decomposing dense representations into sparse features via SAEs.
Accidental bilinguality hypothesis (Li & Flanigan 2024): Attributes translation ability to implicit parallel corpora in pretraining data; this work provides an alternative mechanistic explanation.
Data selection (Xia et al. 2024): Conventional methods rely on external quality/difficulty metrics; this work introduces a new dimension grounded in internal mechanisms.
Insights: The three-stage framework is generalizable to feature discovery for any task—recall, characterize, then filter. The concept of "mechanistic difficulty" may shift paradigms in fine-tuning data engineering.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Three-stage feature discovery combined with mechanistic data selection; outstanding originality)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Causal intervention + cross-lingual generalization + fine-tuning transfer + data proportion analysis; exceptionally comprehensive)
Writing Quality: ⭐⭐⭐⭐⭐ (Rigorous logic; complete narrative arc from discovery to application)
Value: ⭐⭐⭐⭐⭐ (Strong theoretical depth and practical value; a landmark contribution to mechanistic interpretability)