Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models¶

Conference: AAAI 2026 arXiv: 2511.07498 Code: https://github.com/Linuxin-xxx/LAHIS Area: Multilingual Translation Keywords: Multilingual LLM, Attention Heads, Language Specificity, Interpretability, Lightweight Adaptation

TL;DR¶

This paper proposes LAHIS, a method that efficiently identifies language-specific and language-general attention heads in multilingual LLMs using only a single forward-backward pass. It demonstrates that manipulating these heads enables cross-lingual attention transfer, mitigates off-target language generation, and improves multilingual QA performance with only 14–20 trainable parameters.

Background & Motivation¶

State of the Field¶

Large language models have demonstrated strong capabilities in multilingual text understanding and generation. As models are pretrained on multilingual corpora, enhancing and analyzing their multilingual capabilities has become a central research objective. Simultaneously, understanding the internal multilingual processing mechanisms of LLMs has attracted increasing attention.

Limitations of Prior Work¶

Existing studies have examined multilingual mechanisms from the following perspectives:

Language-specific neurons: Tang et al. and Zhao et al. identified subsets of neurons governing language-specific capabilities.

Layer-level analysis: Wendler et al. found that token representations transition from the input space through an English-biased conceptual space to the target language space.

Cross-lingual consistency: Wang et al. found that most layers encode language-agnostic knowledge.

However, these studies primarily focus on entire layers or individual neurons, leaving the role of multi-head self-attention (MHA) in multilingual capabilities largely unexplored.

Root Cause¶

In other domains, researchers have identified functionally specialized attention heads (e.g., induction heads, retrieval heads, safety heads). Yet whether analogous "language heads"—attention heads specifically responsible for processing particular languages—exist in multilingual LLMs has not been systematically studied.

Starting Point¶

Given that attention heads can be functionally specialized, language-specific attention heads likely exist in multilingual LLMs. This paper proposes a lightweight and efficient method to identify such heads and validates their controllability and practical utility on downstream tasks.

Method¶

Overall Architecture¶

LAHIS is a three-stage framework: (1) efficiently estimating attention head importance via trainable soft mask matrices; (2) identifying language-specific and language-general heads based on the importance matrix; and (3) manipulating these heads to influence model behavior or improve performance.

Key Designs¶

1. Language Attention Head Importance Score (LAHIS)¶

Function: Computes an importance matrix \(\text{ImpScore}_c \in \mathbb{R}^{n_l \times n_h}\) for each language, quantifying each attention head's contribution to that language's capability.
Mechanism: A trainable soft mask matrix \(\mathcal{M} \in \mathbb{R}^{n_l \times n_h}\) is introduced. The loss change resulting from disabling a given head is approximated via a first-order Taylor expansion:

\[\Delta \tilde{\mathcal{L}} = \mathbb{E}_{x_c \in \mathcal{X}_c} \left[ \left| m_i \cdot \frac{\partial \mathcal{L}(x_c)}{\partial m_i} \right| \right]\]

Gradient directionality is also incorporated—only heads whose disabling increases the loss (negative gradient ratio \(W_{\text{neg}}\)) are considered—yielding the final definition:

\[\text{LAHIS}_c(h_i) = \mathbb{E}_{x_c} \left[ \left| m_i \cdot \frac{\partial \mathcal{L}(x_c)}{\partial m_i} \right| \cdot \mathbb{I}\left(\frac{\partial \mathcal{L}(x_c)}{\partial m_i} < 0\right) \right]\]

Design Motivation: Sequentially disabling each attention head to evaluate importance is computationally prohibitive (e.g., Aya-23-8B has 1,024 heads). The first-order Taylor approximation enables evaluation of all heads within a single forward-backward pass.

2. Classification of Language Heads¶

Language-Specific Heads: The top 2% of attention heads with the highest importance scores for a given language (excluding language-general heads).
Language-General Heads: Heads that receive high importance scores across all languages (approximately 1–5% of total heads).
Validation: Disabling language-specific heads causes a significant perplexity increase only for the corresponding language (diagonal effect); disabling language-general heads causes significant performance degradation across all languages.

3. Gating Parameter Control Mechanism¶

A gating parameter \(g_i\) controls the output magnitude of each attention head:

\[\tilde{\text{head}}_i = g_i \cdot \text{head}_i\]

where \(g_i > 1\) denotes amplification, \(g_i \in [0,1)\) denotes suppression, and \(g_i = 0\) denotes complete disabling.

This design enables precise manipulation of specific language heads to: - Amplify target-language heads → guide the model to attend to target-language context. - Suppress non-target-language heads → reduce off-target language generation.

4. Lightweight Language Head Mask Adaptation¶

The top 2% of attention heads by importance score for each language are selected to construct a trainable mask matrix of shape \((n_l, n_h)\). Only parameters at the corresponding positions (14–20 in total) are trained; all others remain frozen. The mask parameters are multiplied with the attention output (before the \(W_O\) projection) at both training and inference time.

Loss & Training¶

LAHIS computation: A single forward-backward pass is performed on Wikipedia corpora in the target language.
Mask adaptation: Training is conducted on 200 samples for 2 epochs, requiring approximately 30 seconds.

Key Experimental Results¶

Main Results¶

Effect of Disabling Language-General Heads (XL-Sum BERTScore F1):

Configuration	zh	hi	vi	es	pt	id	ko	Avg
Aya-23-8B Original	89.1	85.7	79.3	69.6	72.7	68.3	84.6	78.5
Random Head Disabling	88.5	86.5	77.8	66.8	72.9	67.0	84.8	77.7
General Head Disabling	72.0	84.0	69.0	58.4	63.5	47.9	69.0	66.2

Effect of Language Head Mask Adaptation (XQuAD Accuracy %):

Model	Configuration	en	Multilingual Avg
Aya-23-8B	Original	76.00	55.28
Aya-23-8B	Random Head Mask	75.25	56.15
Aya-23-8B	Language Head Mask	77.38	61.10
Llama-3.2-3B	Original	56.13	32.98
Llama-3.2-3B	Language Head Mask	59.25	36.78
Mistral-7B	Original	44.88	22.53
Mistral-7B	Language Head Mask	60.13	29.03

Ablation Study¶

Off-Target Language Generation Mitigation (Mistral-7B XL-Sum):

Language	Original Language Accuracy	After Suppressing English Heads	Original F1	Post-Suppression F1
es	0.67	1.00	57.41	71.70
vi	0.35	1.00	50.21	80.27
hi	0.74	1.00	70.19	85.59
ja	0.99	1.00	81.48	81.54
th	0.78	1.00	58.99	69.07

Cross-Lingual Attention Transfer: Given conflicting information in two languages, amplifying the heads of language A or suppressing those of language B increases the model's preference for language A information by approximately 10 percentage points and reduces its reliance on language B by approximately 12 percentage points.

Key Findings¶

Language heads genuinely exist: A small but critical subset of language-specific heads is identified across all three models, predominantly in lower layers.
Specificity over generality: Disabling the specific heads of a given language primarily affects that language, with minimal impact on others (diagonal effect in the perplexity matrix).
Disproportionate influence of English heads: In Mistral-7B, the dominance of English pretraining data causes English heads to induce off-target language generation—suppressing these heads fully restores target-language output.
Adaptation with minimal parameters: Only 14–20 parameters yield an average accuracy gain of approximately 5 percentage points, demonstrating that structure matters more than scale.

Highlights & Insights¶

Exceptional efficiency: The complete attention head importance matrix is obtained via a single forward-backward pass, making the approach applicable to very large models.
Discovery of a new functional specialization dimension: Following induction heads, retrieval heads, and safety heads, this work is the first to systematically identify "language heads."
Significant practical value: Cross-lingual attention transfer and off-target language generation mitigation have direct applications in dialogue systems and retrieval-augmented generation (RAG).
Adaptation with 14–20 parameters: This is likely among the fewest trainable parameters ever reported for achieving performance improvements in the literature.
Revealing asymmetry in multilingual LLMs: Comprehension capabilities are shared across languages, but generation capabilities are disproportionately influenced by high-resource languages (English).

Limitations & Future Work¶

Head selection thresholds: The choice of top 2% for language-specific heads and top 4% shared across languages for general heads lacks theoretical justification.
Limited language coverage: Only 13 languages are evaluated; low-resource languages (e.g., African language families) and language family effects remain unexplored.
Generalizability of mask adaptation: Evaluation is limited to XQuAD; performance on more complex tasks (e.g., translation, long-form generation) is unknown.
Unclear causal direction: Whether language heads cause multilingual capability or merely reflect it remains unresolved.
Absence of comparison with other adaptation methods: No fair comparison with parameter-efficient methods such as LoRA is provided.

Functional head discovery: Induction heads (Olsson 2022), retrieval heads (Wu 2024), safety heads (Zhou 2025) → this work discovers language heads.
Language neurons: Language-specific neurons identified by Tang et al. → this work offers a complementary perspective at the attention head level.
LogitLens: Wendler et al.'s analysis of multilingual processing pipelines → this work provides finer-grained control via language heads.
Insight: Functional specialization of attention heads is a continuously emerging phenomenon; additional types of "functional heads" (e.g., domain heads, reasoning heads) may await discovery.

Rating¶

Novelty: ⭐⭐⭐⭐ (The discovery and exploitation of language heads is novel; the methodology builds on existing head importance estimation frameworks.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Three models, multiple languages, multiple tasks, and comprehensive visualizations.)
Writing Quality: ⭐⭐⭐⭐ (Well-structured, though some experimental descriptions are slightly verbose.)
Value: ⭐⭐⭐⭐ (Combines interpretability and practical utility, offering direct value to the multilingual LLM community.)