ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework¶

Conference: ACL 2025
arXiv: 2410.19453
Code: None
Area: LLM NLP / Multilingual
Keywords: Multilingual LLM, Language Subspace, Shift Projection, Contrastive Learning, Non-Dominant Language

TL;DR¶

This paper proposes the ShifCon framework, which significantly improves the performance of low-resource languages by shifting representation of non-dominant languages to the subspace of dominant languages to access richer model knowledge, then shifting it back to the original language subspace for generation, combined with multilingual contrastive learning.

Background & Motivation¶

Background: - LLMs have demonstrated powerful multilingual capabilities, but a significant performance gap remains between dominant languages (e.g., English) and non-dominant languages. - This gap mainly stems from the severe imbalance among languages in pre-training data (with English data vastly outnumbering others). - A common mitigation strategy is to translate dominant language data into non-dominant languages for multilingual supervised fine-tuning (MSFT).

Limitations of Prior Work: - High-quality data annotation for non-dominant languages is extremely costly. - Translation errors tend to propagate in downstream processes. - MSFT is limited by data scale, leading to a performance ceiling. - Even if intermediate-layer representations appear language-agnostic, LDA visualization reveals that different languages still occupy distinct subspaces.

Key Challenge: - Most of the model's knowledge is encoded in its parameters in the format of the dominant language, making it difficult for non-dominant language representations to effectively access this knowledge. - However, target-language-specific information is mandatory for generating output, meaning the entire processing flow cannot simply be conducted in the dominant language space.

Goal: - To enhance the performance of non-dominant languages under limited MSFT data conditions by manipulating internal language representations within the model.

Key Insight: - Starting from the model's internal representation space, leverage language vectors to perform shift operations across language subspaces. - Combine this with subspace distance measurement to automatically determine the optimal shift layers.

Core Idea: - Temporarily routing the representation of non-dominant languages through the dominant language subspace to acquire rich knowledge, then returning to the original language subspace to complete generation.

Method¶

Overall Architecture¶

ShifCon consists of two core modules: 1. Shift Projection: Includes shift-toward (mapping to the dominant language space) and shift-backward (mapping back to the original language space). 2. Multilingual Contrastive Learning (MCL): Enhances the alignment between the shifted representations and the dominant language representations.

Key Designs¶

Shift-toward Projection:
- Function: At layer \(L_{\text{to}}\), maps the representation of a non-dominant language \(l\) to the dominant language (English) subspace.
- Core formula: \(\tilde{h}_l^{L_{\text{to}}} = h_l^{L_{\text{to}}} - v_l^{L_{\text{to}}} + v_d^{L_{\text{to}}}\)
- That is: subtract the original language vector and add the dominant language vector.
- The language vector \(v_l^i\) is obtained by averaging the sentence representations of that language at the \(i\)-th layer of the model.
- Design Motivation: By entering the dominant language subspace, the representations of non-dominant languages can better access the knowledge encoded in the model parameters in the dominant language format.
Shift-backward Projection:
- Function: At layer \(L_{\text{bk}}\), maps the dominant-like representation back to the original language subspace.
- Core formula: \(h'_l^{L_{\text{bk}}} = \tilde{h}_l^{L_{\text{bk}}} - v_d^{L_{\text{bk}}} + v_l^{L_{\text{bk}}}\)
- Design Motivation: Language-specific information is crucial for generating target-language output and must be restored prior to generation.
Language Subspace Distance:
- Function: Automatically determines the optimal layer positions for shift-toward and shift-backward.
- Mechanism: Uses a Riemannian distance-based metric to measure the alignment between the dominant-like subspace and the dominant language subspace.
- Formula: \(\text{Dist}(S^{D'}, S^D) = \sqrt{\sum \log^2(\lambda_i)} + \|\mu_{D'} - \mu_{D}\|_2\)
- Perform SVD to obtain the principal directions of each language subspace, then select the continuous layer region with the minimum distance (low subspace distance region).
- Sorting distances and selecting the top-\(\beta\%\) (e.g., \(30\%\)) reveals that these layers are continuous intermediate layers across different models.
Multilingual Contrastive Learning (MCL):
- Function: Further aligns dominant-like representations with their corresponding dominant language representations.
- Mechanism: Employs multilingual translation pairs as positive samples to pull the dominant-like representation of a non-dominant language closer to the dominant language representation, while pushing away other representations.
- Design Motivation: Shift projection alone is insufficient to fully align the representation space; MCL provides a stronger alignment signal.

Loss & Training¶

MCL uses standard contrastive learning loss.
Training data: A small amount of MSFT data + FLORES translation data for calculating language vectors.
Important finding: Applying MCL directly to the original representations (instead of the shifted ones) harms language-specific information, thereby degrading target language generation.

Key Experimental Results¶

Main Results¶

Results on Llama-2 7B (High-resource vs. Low-resource languages):

Method	MGSM-High	MGSM-Low	FLORES(en→xx)-High	FLORES(en→xx)-Low
Base	35.2	5.1	33.5	15.9
+MSFT	44.9	29.5	34.7	18.4
+AFP	46.3	31.7	35.2	19.1
+ShifCon	48.2	35.1	35.6	19.7

Low-resource language MGSM: improved from 5.1 to 35.1 (a relative improvement of +18.9% over the MSFT baseline).
ShifCon outperforms both MSFT and AFP across all tasks and language settings.

Other Models (XGLM 7.5B displays similar improvements)

Key Findings¶

Low-resource languages benefit the most: ShifCon achieves far greater improvements for low-resource languages than for high-resource ones.
Intermediate layers are the optimal shift region: Through the subspace distance metric, it is observed that the language subspace distance is minimized in the intermediate layers.
Continuous property: The low subspace distance layers form a continuous block of intermediate layers across different models and scales.
MCL must be performed on shifted representations: Applying contrastive learning directly to the original representations destroys language-specific information.
\(\beta=30\%\) is a preferable choice: The best performance is achieved when the low subspace distance region covers approximately \(30\%\) of the model layers.
Conjecture: Low-distance layers primarily handle information aggregation: These layers may focus on cross-lingual semantic fusion.

Highlights & Insights¶

Intuition-driven Elegant Design: Modeling the representation spaces of different languages as subspaces that can be transformed via vector shift is simple yet effective.
Contribution of Subspace Distance Metric: Provides a principled approach to automatically select the optimal shift layers, avoiding blind searching.
Deepened Understanding of LLM's Internal Multilingual Mechanisms:
- Even if intermediate layers appear language-agnostic on the surface, they still retain language-specific information in other projection directions.
- Model knowledge is predominantly encoded in parameters in the dominant language format.
High Practical Value: No extra multilingual data annotation is required; significant improvements can be achieved by utilizing existing MSFT data alone.

Limitations & Future Work¶

Language vectors are acquired via simple mean pooling; more refined extraction methods for language representations could be more effective.
The shift operation assumes a linear relationship among language subspaces (vector addition/subtraction), which may be more complex in practice.
MCL employs translation pairs as positive samples, meaning translation quality directly affects the performance of contrastive learning.
Mainly validated on 7B-scale models; the effectiveness and optimal hyperparameters on larger models might differ.
There is a discrepancy in the improvement magnitude between generation and classification tasks; cross-task robustness remains to be enhanced.
Calculating language vectors requires a certain amount of data in the respective language, which may not be feasible for extremely low-resource languages.

Work Related to Language Vectors (Libovický et al., 2020; Xu et al., 2023; Tang et al., 2024): Language vectors act as effective tools for subspace mapping.
MSFT Methods (Chen et al., 2023; Zhang et al., 2023): Standard multilingual fine-tuning methods, serving as the baseline for this work.
LLM Internal Representation Alignment (Yoon et al., 2024; Li et al., 2024): Improving multilingual performance via aligning internal representations.
Insights: The multilingual capability of LLMs depends not only on data volume but can also be enhanced by manipulating the internal representation space.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of shift projection + subspace distance metric is relatively novel, though the language vector manipulation itself is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple models, tasks, and languages (high/low resource), with extensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear framework illustrations, intuitive LDA visualizations, and complete mathematical formulations.
Value: ⭐⭐⭐⭐⭐ — Highly practical value for low-resource multilingual LLMs.