ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework¶
Conference: ACL 2025
arXiv: 2410.19453
Code: None
Area: LLM NLP / Multilingual
Keywords: Multilingual LLM, Language Subspace, Shift Projection, Contrastive Learning, Non-Dominant Language
TL;DR¶
This paper proposes the ShifCon framework, which significantly improves the performance of low-resource languages by shifting representation of non-dominant languages to the subspace of dominant languages to access richer model knowledge, then shifting it back to the original language subspace for generation, combined with multilingual contrastive learning.
Background & Motivation¶
Background: - LLMs have demonstrated powerful multilingual capabilities, but a significant performance gap remains between dominant languages (e.g., English) and non-dominant languages. - This gap mainly stems from the severe imbalance among languages in pre-training data (with English data vastly outnumbering others). - A common mitigation strategy is to translate dominant language data into non-dominant languages for multilingual supervised fine-tuning (MSFT).
Limitations of Prior Work: - High-quality data annotation for non-dominant languages is extremely costly. - Translation errors tend to propagate in downstream processes. - MSFT is limited by data scale, leading to a performance ceiling. - Even if intermediate-layer representations appear language-agnostic, LDA visualization reveals that different languages still occupy distinct subspaces.
Key Challenge: - Most of the model's knowledge is encoded in its parameters in the format of the dominant language, making it difficult for non-dominant language representations to effectively access this knowledge. - However, target-language-specific information is mandatory for generating output, meaning the entire processing flow cannot simply be conducted in the dominant language space.
Goal: - To enhance the performance of non-dominant languages under limited MSFT data conditions by manipulating internal language representations within the model.
Key Insight: - Starting from the model's internal representation space, leverage language vectors to perform shift operations across language subspaces. - Combine this with subspace distance measurement to automatically determine the optimal shift layers.
Core Idea: - Temporarily routing the representation of non-dominant languages through the dominant language subspace to acquire rich knowledge, then returning to the original language subspace to complete generation.
Method¶
Overall Architecture¶
ShifCon consists of two core modules: 1. Shift Projection: Includes shift-toward (mapping to the dominant language space) and shift-backward (mapping back to the original language space). 2. Multilingual Contrastive Learning (MCL): Enhances the alignment between the shifted representations and the dominant language representations.
Key Designs¶
-
Shift-toward Projection:
- Function: At layer \(L_{\text{to}}\), maps the representation of a non-dominant language \(l\) to the dominant language (English) subspace.
- Core formula: \(\tilde{h}_l^{L_{\text{to}}} = h_l^{L_{\text{to}}} - v_l^{L_{\text{to}}} + v_d^{L_{\text{to}}}\)
- That is: subtract the original language vector and add the dominant language vector.
- The language vector \(v_l^i\) is obtained by averaging the sentence representations of that language at the \(i\)-th layer of the model.
- Design Motivation: By entering the dominant language subspace, the representations of non-dominant languages can better access the knowledge encoded in the model parameters in the dominant language format.
-
Shift-backward Projection:
- Function: At layer \(L_{\text{bk}}\), maps the dominant-like representation back to the original language subspace.
- Core formula: \(h'_l^{L_{\text{bk}}} = \tilde{h}_l^{L_{\text{bk}}} - v_d^{L_{\text{bk}}} + v_l^{L_{\text{bk}}}\)
- Design Motivation: Language-specific information is crucial for generating target-language output and must be restored prior to generation.
-
Language Subspace Distance:
- Function: Automatically determines the optimal layer positions for shift-toward and shift-backward.
- Mechanism: Uses a Riemannian distance-based metric to measure the alignment between the dominant-like subspace and the dominant language subspace.
- Formula: \(\text{Dist}(S^{D'}, S^D) = \sqrt{\sum \log^2(\lambda_i)} + \|\mu_{D'} - \mu_{D}\|_2\)
- Perform SVD to obtain the principal directions of each language subspace, then select the continuous layer region with the minimum distance (low subspace distance region).
- Sorting distances and selecting the top-\(\beta\%\) (e.g., \(30\%\)) reveals that these layers are continuous intermediate layers across different models.
-
Multilingual Contrastive Learning (MCL):
- Function: Further aligns dominant-like representations with their corresponding dominant language representations.
- Mechanism: Employs multilingual translation pairs as positive samples to pull the dominant-like representation of a non-dominant language closer to the dominant language representation, while pushing away other representations.
- Design Motivation: Shift projection alone is insufficient to fully align the representation space; MCL provides a stronger alignment signal.
Loss & Training¶
- MCL uses standard contrastive learning loss.
- Training data: A small amount of MSFT data + FLORES translation data for calculating language vectors.
- Important finding: Applying MCL directly to the original representations (instead of the shifted ones) harms language-specific information, thereby degrading target language generation.
Key Experimental Results¶
Main Results¶
Results on Llama-2 7B (High-resource vs. Low-resource languages):
| Method | MGSM-High | MGSM-Low | FLORES(en→xx)-High | FLORES(en→xx)-Low |
|---|---|---|---|---|
| Base | 35.2 | 5.1 | 33.5 | 15.9 |
| +MSFT | 44.9 | 29.5 | 34.7 | 18.4 |
| +AFP | 46.3 | 31.7 | 35.2 | 19.1 |
| +ShifCon | 48.2 | 35.1 | 35.6 | 19.7 |
- Low-resource language MGSM: improved from 5.1 to 35.1 (a relative improvement of +18.9% over the MSFT baseline).
- ShifCon outperforms both MSFT and AFP across all tasks and language settings.
Other Models (XGLM 7.5B displays similar improvements)
Key Findings¶
- Low-resource languages benefit the most: ShifCon achieves far greater improvements for low-resource languages than for high-resource ones.
- Intermediate layers are the optimal shift region: Through the subspace distance metric, it is observed that the language subspace distance is minimized in the intermediate layers.
- Continuous property: The low subspace distance layers form a continuous block of intermediate layers across different models and scales.
- MCL must be performed on shifted representations: Applying contrastive learning directly to the original representations destroys language-specific information.
- \(\beta=30\%\) is a preferable choice: The best performance is achieved when the low subspace distance region covers approximately \(30\%\) of the model layers.
- Conjecture: Low-distance layers primarily handle information aggregation: These layers may focus on cross-lingual semantic fusion.
Highlights & Insights¶
- Intuition-driven Elegant Design: Modeling the representation spaces of different languages as subspaces that can be transformed via vector shift is simple yet effective.
- Contribution of Subspace Distance Metric: Provides a principled approach to automatically select the optimal shift layers, avoiding blind searching.
- Deepened Understanding of LLM's Internal Multilingual Mechanisms:
- Even if intermediate layers appear language-agnostic on the surface, they still retain language-specific information in other projection directions.
- Model knowledge is predominantly encoded in parameters in the dominant language format.
- High Practical Value: No extra multilingual data annotation is required; significant improvements can be achieved by utilizing existing MSFT data alone.
Limitations & Future Work¶
- Language vectors are acquired via simple mean pooling; more refined extraction methods for language representations could be more effective.
- The shift operation assumes a linear relationship among language subspaces (vector addition/subtraction), which may be more complex in practice.
- MCL employs translation pairs as positive samples, meaning translation quality directly affects the performance of contrastive learning.
- Mainly validated on 7B-scale models; the effectiveness and optimal hyperparameters on larger models might differ.
- There is a discrepancy in the improvement magnitude between generation and classification tasks; cross-task robustness remains to be enhanced.
- Calculating language vectors requires a certain amount of data in the respective language, which may not be feasible for extremely low-resource languages.
Related Work & Insights¶
- Work Related to Language Vectors (Libovický et al., 2020; Xu et al., 2023; Tang et al., 2024): Language vectors act as effective tools for subspace mapping.
- MSFT Methods (Chen et al., 2023; Zhang et al., 2023): Standard multilingual fine-tuning methods, serving as the baseline for this work.
- LLM Internal Representation Alignment (Yoon et al., 2024; Li et al., 2024): Improving multilingual performance via aligning internal representations.
- Insights: The multilingual capability of LLMs depends not only on data volume but can also be enhanced by manipulating the internal representation space.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of shift projection + subspace distance metric is relatively novel, though the language vector manipulation itself is not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple models, tasks, and languages (high/low resource), with extensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear framework illustrations, intuitive LDA visualizations, and complete mathematical formulations.
- Value: ⭐⭐⭐⭐⭐ — Highly practical value for low-resource multilingual LLMs.