ACL 2025 Multilingual & Machine Translation Cross-Lingual Transfer middle-layer alignment Contrastive Learning Representation Alignment low-resource languages

Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs¶

Conference: ACL 2025
arXiv: 2502.14830
Code: https://github.com/dannigt/mid-align
Area: LLM Alignment / Multilingual
Keywords: Cross-Lingual Transfer, middle-layer alignment, Contrastive Learning, Representation Alignment, low-resource languages

TL;DR¶

Through large-scale analysis of 1000+ language pairs (35 languages, 1190 directions), this work discovers that the middle layer of LLMs has the strongest potential for cross-lingual semantic alignment. It proposes alternately optimizing a middle-layer contrastive alignment loss during task fine-tuning, which significantly improves cross-lingual transfer on three major tasks: slot filling (F1 +1.5), machine translation (COMET +1.1), and JSON generation, while remaining effective for unseen languages and out-of-domain data. The separately trained alignment and task LoRA modules can be merged via weight averaging.

Background & Motivation¶

Background: Decoder-only LLMs perform excellently on specific tasks in specific languages through SFT. However, extending this capability to multiple languages (especially low-resource ones) remains difficult—fine-tuning data rarely covers all languages supported by the LLM, making cross-lingual transfer crucial.
Key Challenge: Prior cross-lingual alignment methods mainly target encoder-only or encoder-decoder models (which can be aligned at the encoder output stage). Decoder-only LLMs do not have explicit input/output representation boundaries, leaving which layer to align and how to align as an open question.
Limitations of Prior Work: (1) Task fine-tuning (including multilingual fine-tuning) maintains but does not enhance cross-lingual semantic alignment (verified experimentally in Figure 3), indicating that pure SFT is insufficient; (2) Existing works focus only on the transfer of classification tasks, while generation tasks (with variable-length outputs) are more challenging; (3) Many methods require monolingual data for each target language for LM adaptation, which is costly.
Key Insight: By conducting a large-scale cross-lingual retrieval analysis (35 languages, 1190 directions) on Llama 3-8B and Qwen 2.5-7B using the FLoRes-200 dataset, it is data-drivenly discovered that the middle layers (~layer 16) exhibit the highest translation retrieval accuracy, which strongly correlates with downstream cross-lingual transfer performance ($p < 0.01$). Based on this, the authors propose imposing an explicit contrastive alignment loss at the middle layer.
Core Idea: Integrate a contrastive cross-lingual alignment objective at the middle layers of the LLM, alternately optimized with the task loss, to enhance cross-lingual transfer.

Method¶

Overall Architecture¶

The training process consists of two alternately executed objectives: (1) Task objective—the standard cross-entropy loss for causal language modeling; (2) Alignment objective—imposing a contrastive loss on parallel translation sentence pairs at the middle layer. Each training step optimizes only one of these objectives to avoid manual weight tuning and gradient conflicts. Parameter-efficient fine-tuning is conducted using LoRA (rank=8), based on Llama 3-8B-Instruct and Qwen 2.5-7B-Instruct.

Key Designs¶

Translation Retrieval Probing
- Goal: Quantify the degree of cross-lingual semantic alignment across different LLM layers and find the optimal alignment layer.
- Mechanism: Extract hidden states of 35 languages $\times$ each layer on FLoRes-200 $\rightarrow$ obtain sentence vectors via mean pooling $\rightarrow$ perform translation retrieval using ratio-based margin similarity, covering all $N(N-1)=1190$ language directions.
- Key Findings: The middle layers (layer 16 in Llama, similar position in Qwen) yield the highest retrieval accuracy, while the bottom and top layers are weaker; the alignment level of low-resource languages is less than half of the overall average; middle-layer retrieval accuracy is significantly positively correlated with downstream transfer F1 ($p < 0.01$).
- Design Motivation: Provides a reliable empirical foundation for the selection of subsequent alignment layers.
Mid-Layer Contrastive Alignment
- Goal: Explicitly pull close the representations of parallel translation pairs and push apart non-translation pairs at the middle layer.
- Mechanism: For $n$ parallel sentence pairs within a batch, extract the mean-pooled hidden states of the $i$-th layer (middle layer) and apply the InfoNCE contrastive loss: $$\mathcal{L}_{\text{align}} = -\log \frac{\exp(\text{sim}(\mathbf{h}_s^i, \mathbf{h}_t^i))}{\sum_{v \in \mathcal{B}} \exp(\text{sim}(\mathbf{h}_s^i, \mathbf{h}_v^i))}$$ where $\text{sim}$ is the cosine similarity, with an optional temperature parameter $\tau$.
- Alignment Data: Uses parallel sentence pairs from Tatoeba or task data; only a few hundred sentences are needed for low-resource languages. Alignment data is resampled to an approximately uniform distribution across languages.
- Design Motivation: Alternately optimized with the task loss without modifying the model architecture. The training overhead is approximately twice that of standard SFT, but the gains are significant.
Post-hoc Module Merging
- Goal: Enable existing task models to acquire cross-lingual capabilities without retraining.
- Mechanism: Train the task LoRA adapter and the alignment LoRA adapter separately, then merge them through weighted averaging (with weights tuned on the dev set).
- Effect: Post-merge performance is close to joint training (slot filling F1 +1.1 vs. joint +1.5, translation COMET +0.6 vs. joint +1.1), and gains are distributed more evenly across languages.
- Design Motivation: Alignment capability and task capability are decoupled; adaptation to new languages or capability enhancement does not require access to original task training data.

Training Details¶

Configuration Item	Settings
Base Model	Llama 3-8B-Instruct / Qwen 2.5-7B-Instruct
Parameter-Efficient Fine-Tuning	LoRA rank=8, covering all attention and linear projection layers
Effective Batch Size	128 (for both task & alignment)
Contrastive Learning Mini-batch	32 parallel sentence pairs
Alignment Layer Position	Middle layer (layer 16 for Llama / 32 layers total)
Alignment Data Volume	Only a few hundred parallel sentences for low-resource languages
Alignment Data Sampling	Multilingual resampling to an approximately uniform distribution

Key Experimental Results¶

Main Results¶

Task & Metric	Model	SFT Baseline	+ Mid-Layer Alignment	Gain
Slot Filling Supervised (5 Languages) F1	Llama 3	76.6	77.0	+0.4
Slot Filling Transfer (15 Languages) F1	Llama 3	60.2	61.7	+1.5
Slot Filling Aligned Languages F1	Llama 3	51.7	55.5	+3.8
Machine Translation Transfer$\rightarrow$En BLEU	Llama 3	31.8	32.3	+0.5
Machine Translation En$\rightarrow$Transfer COMET	Llama 3	79.6	80.7	+1.1
Retrieval Accuracy (20-language average)	Llama 3	39.4%	73.2%	+33.8
Slot Filling Transfer F1	Qwen 2.5	53.5	55.3	+1.8

Ablation Study¶

Analytical Dimension	Key Findings
Alignment Layer Position	Middle layer (16) is optimal and yields the most uniform gains across languages; bottom layer (8) severely degrades performance; top layer (32) is feasible but results in uneven gains across languages (SD $\uparrow$)
Alignment Language Resource Level	The low-resource group achieves the largest gain (+3.8 F1), and the high-resource group achieves the smallest (+0.7 F1)—languages with weaker initial alignment benefit the most
Unseen Language Generalization	Unaligned languages still see an average improvement of +0.4 F1, demonstrating that the method enhances general transfer capabilities
Large-scale Alignment	19 languages $\rightarrow$ En alignment (+1.9 F1) > 5 languages $\rightarrow$ En (+1.5 F1); multidirectional alignment shows no extra gains, as En alignment implicitly yields multidirectional effects
Domain Generalization	Alignment using Tatoeba / IWSLT in-domain data remains effective (retrieval accuracy 71.9% / 68.5% vs. oracle 77.7%)
Module Merging	Separate training followed by merging $\approx$ joint training performance (slot filling +1.1 vs. +1.5, translation +0.6 vs. +1.1)
Long-sequence Tasks	Aligned languages gain +1.0 F1 in JSON generation, but the supervised set (including Chinese) drops by 1.0, suggesting a conflict between sentence-level alignment and long sequences
Non-Latin Scripts	Gains in non-Latin script languages are only +0.5 F1 (vs. overall +1.5 F1), limited by tokenization quality affecting mean pooling

Highlights & Insights¶

Large-scale Empirical Drive: Retrieval analysis across 1190 language directions provides robust statistical support for "middle-layer optimality" rather than empirical guesswork.
Simple and Efficient Alternating Optimization: Without modifying model architecture or manually tuning loss weights, decoupling of cross-lingual alignment and task learning is achieved.
Extremely Low Data Requirement: Low-resource languages require only a few hundred parallel sentences to achieve significant transfer improvements, demonstrating high practicality.
Modular Design: Alignment and task LoRA can be trained independently and then merged. This is friendly for engineering deployment—only a lightweight alignment module needs to be trained when a new language is introduced.
"Radiation Effect" of Middle-Layer Alignment: After applying contrastive loss at layer 16, alignment in preceding multiple layers is also enhanced (Figure 4), whereas top/bottom layer alignment shows no such effect.

Limitations & Future Work¶

Experiments are limited to 7-8B models; the optimal alignment layer position might differ for larger models.
Limited gains for non-Latin script languages, fundamentally caused by tokenization quality $\rightarrow$ exploration of superior pooling mechanisms is required (preliminary experiments with attention pooling were unsuccessful).
Conflicts exist between sentence-level mean pooling alignment and long-sequence tasks (e.g., F1 in Chinese decreased by 2.2 in JSON generation).
Alternating optimization doubles the training computational cost, which can be mitigated by the module merging scheme.
The effectiveness of aligning multiple layers simultaneously varies by task; optimal multi-layer strategies still need to be explored.

Rating¶

Dimension	Score (1-10)	Description
Novelty	7	Contrastive alignment itself is not new; the core contribution lies in the systematic finding of "middle-layer optimality" and its application to decoder-only LLMs.
Practicality	8	Low data requirement (a few hundred parallel sentences) and plug-and-play module merging design lower the barrier for engineering deployment.
Experimental Thoroughness	9	3 tasks $\times$ 2 models $\times$ various ablations (layer location / language resource / domain generalization / module merging), providing a comprehensive analysis.
Writing Quality	8	Clear motivation, well-organized experiments, systematic analysis, and rich figures/tables.

vs. Cross-Lingual Methods for mBERT/XLM-R: Prior methods targeted classification tasks for encoder-only models. This work is the first to systematically study cross-lingual transfer for generation tasks in decoder-only LLMs.
vs. Simple Translation Data Augmentation: Translating all training data is costly and may introduce translation errors; middle-layer alignment is more efficient.
Offers valuable insights into where LLM multilingual capabilities originate and in which layers they are stored.

Rating¶

Novelty: ⭐⭐⭐⭐ The finding that the middle layer is the optimal location for cross-lingual alignment is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 1000+ language pairs + 3 task types + modular validation.
Writing Quality: ⭐⭐⭐⭐ Systematic analysis and clear conclusions.
Value: ⭐⭐⭐⭐ Practical significance for multilingual LLM deployment.

Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Training Details¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Related Work & Insights¶

Rating¶

Related Papers¶