Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs¶
Conference: ACL 2025
arXiv: 2502.14830
Code: https://github.com/dannigt/mid-align
Area: LLM Alignment / Multilingual
Keywords: Cross-Lingual Transfer, middle-layer alignment, Contrastive Learning, Representation Alignment, low-resource languages
TL;DR¶
Through large-scale analysis of 1000+ language pairs (35 languages, 1190 directions), this work discovers that the middle layer of LLMs has the strongest potential for cross-lingual semantic alignment. It proposes alternately optimizing a middle-layer contrastive alignment loss during task fine-tuning, which significantly improves cross-lingual transfer on three major tasks: slot filling (F1 +1.5), machine translation (COMET +1.1), and JSON generation, while remaining effective for unseen languages and out-of-domain data. The separately trained alignment and task LoRA modules can be merged via weight averaging.
Background & Motivation¶
- Background: Decoder-only LLMs perform excellently on specific tasks in specific languages through SFT. However, extending this capability to multiple languages (especially low-resource ones) remains difficult—fine-tuning data rarely covers all languages supported by the LLM, making cross-lingual transfer crucial.
- Key Challenge: Prior cross-lingual alignment methods mainly target encoder-only or encoder-decoder models (which can be aligned at the encoder output stage). Decoder-only LLMs do not have explicit input/output representation boundaries, leaving which layer to align and how to align as an open question.
- Limitations of Prior Work: (1) Task fine-tuning (including multilingual fine-tuning) maintains but does not enhance cross-lingual semantic alignment (verified experimentally in Figure 3), indicating that pure SFT is insufficient; (2) Existing works focus only on the transfer of classification tasks, while generation tasks (with variable-length outputs) are more challenging; (3) Many methods require monolingual data for each target language for LM adaptation, which is costly.
- Key Insight: By conducting a large-scale cross-lingual retrieval analysis (35 languages, 1190 directions) on Llama 3-8B and Qwen 2.5-7B using the FLoRes-200 dataset, it is data-drivenly discovered that the middle layers (~layer 16) exhibit the highest translation retrieval accuracy, which strongly correlates with downstream cross-lingual transfer performance (\(p < 0.01\)). Based on this, the authors propose imposing an explicit contrastive alignment loss at the middle layer.
- Core Idea: Integrate a contrastive cross-lingual alignment objective at the middle layers of the LLM, alternately optimized with the task loss, to enhance cross-lingual transfer.
Method¶
Overall Architecture¶
The training process consists of two alternately executed objectives: (1) Task objective—the standard cross-entropy loss for causal language modeling; (2) Alignment objective—imposing a contrastive loss on parallel translation sentence pairs at the middle layer. Each training step optimizes only one of these objectives to avoid manual weight tuning and gradient conflicts. Parameter-efficient fine-tuning is conducted using LoRA (rank=8), based on Llama 3-8B-Instruct and Qwen 2.5-7B-Instruct.
Key Designs¶
-
Translation Retrieval Probing
- Goal: Quantify the degree of cross-lingual semantic alignment across different LLM layers and find the optimal alignment layer.
- Mechanism: Extract hidden states of 35 languages \(\times\) each layer on FLoRes-200 \(\rightarrow\) obtain sentence vectors via mean pooling \(\rightarrow\) perform translation retrieval using ratio-based margin similarity, covering all \(N(N-1)=1190\) language directions.
- Key Findings: The middle layers (layer 16 in Llama, similar position in Qwen) yield the highest retrieval accuracy, while the bottom and top layers are weaker; the alignment level of low-resource languages is less than half of the overall average; middle-layer retrieval accuracy is significantly positively correlated with downstream transfer F1 (\(p < 0.01\)).
- Design Motivation: Provides a reliable empirical foundation for the selection of subsequent alignment layers.
-
Mid-Layer Contrastive Alignment
- Goal: Explicitly pull close the representations of parallel translation pairs and push apart non-translation pairs at the middle layer.
- Mechanism: For \(n\) parallel sentence pairs within a batch, extract the mean-pooled hidden states of the \(i\)-th layer (middle layer) and apply the InfoNCE contrastive loss: $\(\mathcal{L}_{\text{align}} = -\log \frac{\exp(\text{sim}(\mathbf{h}_s^i, \mathbf{h}_t^i))}{\sum_{v \in \mathcal{B}} \exp(\text{sim}(\mathbf{h}_s^i, \mathbf{h}_v^i))}\)$ where \(\text{sim}\) is the cosine similarity, with an optional temperature parameter \(\tau\).
- Alignment Data: Uses parallel sentence pairs from Tatoeba or task data; only a few hundred sentences are needed for low-resource languages. Alignment data is resampled to an approximately uniform distribution across languages.
- Design Motivation: Alternately optimized with the task loss without modifying the model architecture. The training overhead is approximately twice that of standard SFT, but the gains are significant.
-
Post-hoc Module Merging
- Goal: Enable existing task models to acquire cross-lingual capabilities without retraining.
- Mechanism: Train the task LoRA adapter and the alignment LoRA adapter separately, then merge them through weighted averaging (with weights tuned on the dev set).
- Effect: Post-merge performance is close to joint training (slot filling F1 +1.1 vs. joint +1.5, translation COMET +0.6 vs. joint +1.1), and gains are distributed more evenly across languages.
- Design Motivation: Alignment capability and task capability are decoupled; adaptation to new languages or capability enhancement does not require access to original task training data.
Training Details¶
| Configuration Item | Settings |
|---|---|
| Base Model | Llama 3-8B-Instruct / Qwen 2.5-7B-Instruct |
| Parameter-Efficient Fine-Tuning | LoRA rank=8, covering all attention and linear projection layers |
| Effective Batch Size | 128 (for both task & alignment) |
| Contrastive Learning Mini-batch | 32 parallel sentence pairs |
| Alignment Layer Position | Middle layer (layer 16 for Llama / 32 layers total) |
| Alignment Data Volume | Only a few hundred parallel sentences for low-resource languages |
| Alignment Data Sampling | Multilingual resampling to an approximately uniform distribution |
Key Experimental Results¶
Main Results¶
| Task & Metric | Model | SFT Baseline | + Mid-Layer Alignment | Gain |
|---|---|---|---|---|
| Slot Filling Supervised (5 Languages) F1 | Llama 3 | 76.6 | 77.0 | +0.4 |
| Slot Filling Transfer (15 Languages) F1 | Llama 3 | 60.2 | 61.7 | +1.5 |
| Slot Filling Aligned Languages F1 | Llama 3 | 51.7 | 55.5 | +3.8 |
| Machine Translation Transfer\(\rightarrow\)En BLEU | Llama 3 | 31.8 | 32.3 | +0.5 |
| Machine Translation En\(\rightarrow\)Transfer COMET | Llama 3 | 79.6 | 80.7 | +1.1 |
| Retrieval Accuracy (20-language average) | Llama 3 | 39.4% | 73.2% | +33.8 |
| Slot Filling Transfer F1 | Qwen 2.5 | 53.5 | 55.3 | +1.8 |
Ablation Study¶
| Analytical Dimension | Key Findings |
|---|---|
| Alignment Layer Position | Middle layer (16) is optimal and yields the most uniform gains across languages; bottom layer (8) severely degrades performance; top layer (32) is feasible but results in uneven gains across languages (SD \(\uparrow\)) |
| Alignment Language Resource Level | The low-resource group achieves the largest gain (+3.8 F1), and the high-resource group achieves the smallest (+0.7 F1)—languages with weaker initial alignment benefit the most |
| Unseen Language Generalization | Unaligned languages still see an average improvement of +0.4 F1, demonstrating that the method enhances general transfer capabilities |
| Large-scale Alignment | 19 languages \(\rightarrow\) En alignment (+1.9 F1) > 5 languages \(\rightarrow\) En (+1.5 F1); multidirectional alignment shows no extra gains, as En alignment implicitly yields multidirectional effects |
| Domain Generalization | Alignment using Tatoeba / IWSLT in-domain data remains effective (retrieval accuracy 71.9% / 68.5% vs. oracle 77.7%) |
| Module Merging | Separate training followed by merging \(\approx\) joint training performance (slot filling +1.1 vs. +1.5, translation +0.6 vs. +1.1) |
| Long-sequence Tasks | Aligned languages gain +1.0 F1 in JSON generation, but the supervised set (including Chinese) drops by 1.0, suggesting a conflict between sentence-level alignment and long sequences |
| Non-Latin Scripts | Gains in non-Latin script languages are only +0.5 F1 (vs. overall +1.5 F1), limited by tokenization quality affecting mean pooling |
Highlights & Insights¶
- Large-scale Empirical Drive: Retrieval analysis across 1190 language directions provides robust statistical support for "middle-layer optimality" rather than empirical guesswork.
- Simple and Efficient Alternating Optimization: Without modifying model architecture or manually tuning loss weights, decoupling of cross-lingual alignment and task learning is achieved.
- Extremely Low Data Requirement: Low-resource languages require only a few hundred parallel sentences to achieve significant transfer improvements, demonstrating high practicality.
- Modular Design: Alignment and task LoRA can be trained independently and then merged. This is friendly for engineering deployment—only a lightweight alignment module needs to be trained when a new language is introduced.
- "Radiation Effect" of Middle-Layer Alignment: After applying contrastive loss at layer 16, alignment in preceding multiple layers is also enhanced (Figure 4), whereas top/bottom layer alignment shows no such effect.
Limitations & Future Work¶
- Experiments are limited to 7-8B models; the optimal alignment layer position might differ for larger models.
- Limited gains for non-Latin script languages, fundamentally caused by tokenization quality \(\rightarrow\) exploration of superior pooling mechanisms is required (preliminary experiments with attention pooling were unsuccessful).
- Conflicts exist between sentence-level mean pooling alignment and long-sequence tasks (e.g., F1 in Chinese decreased by 2.2 in JSON generation).
- Alternating optimization doubles the training computational cost, which can be mitigated by the module merging scheme.
- The effectiveness of aligning multiple layers simultaneously varies by task; optimal multi-layer strategies still need to be explored.
Rating¶
| Dimension | Score (1-10) | Description |
|---|---|---|
| Novelty | 7 | Contrastive alignment itself is not new; the core contribution lies in the systematic finding of "middle-layer optimality" and its application to decoder-only LLMs. |
| Practicality | 8 | Low data requirement (a few hundred parallel sentences) and plug-and-play module merging design lower the barrier for engineering deployment. |
| Experimental Thoroughness | 9 | 3 tasks \(\times\) 2 models \(\times\) various ablations (layer location / language resource / domain generalization / module merging), providing a comprehensive analysis. |
| Writing Quality | 8 | Clear motivation, well-organized experiments, systematic analysis, and rich figures/tables. |
Related Work & Insights¶
- vs. Cross-Lingual Methods for mBERT/XLM-R: Prior methods targeted classification tasks for encoder-only models. This work is the first to systematically study cross-lingual transfer for generation tasks in decoder-only LLMs.
- vs. Simple Translation Data Augmentation: Translating all training data is costly and may introduce translation errors; middle-layer alignment is more efficient.
- Offers valuable insights into where LLM multilingual capabilities originate and in which layers they are stored.
Rating¶
- Novelty: ⭐⭐⭐⭐ The finding that the middle layer is the optimal location for cross-lingual alignment is valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 1000+ language pairs + 3 task types + modular validation.
- Writing Quality: ⭐⭐⭐⭐ Systematic analysis and clear conclusions.
- Value: ⭐⭐⭐⭐ Practical significance for multilingual LLM deployment.