Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment¶

Conference: ACL 2025
arXiv: 2407.14878
Area: Multilingual Translation
Keywords: multilingual sentence encoders, modular training, curse of multilinguality, cross-lingual alignment, adapters

TL;DR¶

This paper proposes a modular training scheme for multilingual sentence encoders: it first trains language-specific modules (embedding + language adapter + sentence encoder adapter) to alleviate the curse of multilinguality, and then trains cross-lingual alignment adapters using both parallel and paraphrase data to resolve performance trade-offs among different cross-lingual tasks. This approach consistently outperforms monolithic model training across 4 tasks and 23 languages.

Background & Motivation¶

Multilingual Sentence Encoders (MSE) map sentences of different languages into a shared semantic space, widely used in cross-lingual retrieval, clustering, and classification.
Two Core Problems:
Curse of Multilinguality (CoM): Parameter sharing leads to a degradation in the quality of monolingual representations for each language, which is particularly severe for low-resource languages.
Performance Trade-offs Across Tasks:
- Cross-lingual alignment training destroys monolingual semantic structures \(\rightarrow\) Monolingual vs. Cross-lingual performance conflict.
- Training with parallel data suits bitext mining but is ill-suited for semantic similarity \(\rightarrow\) Conflict between different cross-lingual tasks.
- Training with paraphrase data suits semantic similarity but is ill-suited for bitext mining.
Existing modular approaches (e.g., LASER3) only solve a subset of these problems, and their teacher models themselves are already affected by CoM.

Method¶

Overall Architecture¶

Three-step modular training (only the parameters of the corresponding modules are updated in each step):

Language Adaptation (LA): Train a language-specific embedding layer + LoRA language adapter for each language.
Sentence Encoding (SE) Training: Stack a LoRA SE adapter on top of the language adapter and train it using monolingual paraphrase data.
Cross-Lingual Alignment (CLA): Train a parallel adapter by alternating between cross-lingual paraphrase data and parallel data.

Key Designs¶

Language Adaptation: - Train a dedicated tokenizer for each language. - Initialize new embeddings using the FOCUS method (copy existing token embeddings, and use similar token interpolation for new tokens). - Perform continuing MLM training using only LoRA for parameter efficiency.

Sentence Encoding Retraining: - The MLM objective degrades the sentence encoding capability of the pretrained MSE, which necessitates SE retraining. - Train with MNRL (Multiple Negatives Ranking Loss) on monolingual paraphrase data obtained via machine translation. - Freeze the LA module and only update the SE adapter.

Cross-Lingual Alignment: - Perform bilingual alignment with English as the pivot language (English embeddings have the highest quality, trained on gold-standard paraphrase data). - Alternating Training: Use cross-lingual paraphrase pairs + MNRL loss in one batch, and parallel pairs + cosine similarity loss in the next batch. - Use Parallel Adapters to prevent alignment training from interfering with monolingual SE capabilities. - Do not train a CLA adapter for English: force other languages to project into the English space.

Training Data: - Translate 5 English paraphrase datasets (MNLI, SentenceCompression, etc., totaling ~600k pairs) into 22 languages using NLLB 3.3B. - Leverage multilingual parallel paraphrase data to construct both paraphrase pairs and parallel pairs simultaneously.

Key Experimental Results¶

Main Results¶

Results based on LaBSE (23 languages):

Monolingual Tasks:

Model Config	STS ↑	STR ↑	Classification ↑
LaBSE Original	76.7	69.2	82.7
Full_mc (Best Monolithic)	80.0	75.4	86.0
Mod_mc-jt (Ours)	83.9	79.0	86.4

Cross-Lingual Tasks:

Model Config	STS ↑	Classification ↑	FLORES Mining ↓	Tatoeba Mining ↓
LaBSE Original	74.5	83.6	0.14	3.87
Full_c (Best Cross-Lingual Monolithic)	77.8	85.3	0.20	4.00
Mod_mc-jt (Ours)	81.4	86.7	0.10	3.12

Alignment Metrics: - Language Bias (lower is better): Mod_mc-jt achieves 0.49 on STSB (vs. Full_mc 0.53) and 0.65 on SICK (vs. 0.64). - RSIM (higher is better): Mod_mc-jt reaches 0.79 (vs. Full_mc 0.77).

Key Findings¶

Substantial Improvement in Monolingual Performance: The modular scheme improves by 3.9 percentage points over the strongest monolithic model on STS (83.9 vs. 80.0).
Comprehensive Cross-Lingual Superiority: Achieves SOTA simultaneously in STS, classification, and bitext mining, resolving the trade-offs between tasks.
Low-Resource Languages Benefit Most: Language-specific modules effectively alleviate CoM.
MT Data is Effective: High-quality training is achievable using only machine-translated paraphrase data.
Alternating Paraphrase and Parallel Training for CLA is Optimal: Using only paraphrases (Mod_mc-pp) performs poorly on bitext mining, while using only parallel data (Mod_mc-pl) performs poorly on STS.
Equally Effective on mE5: The modular scheme brings consistent improvements on the mE5 base as well.

Highlights & Insights¶

Clearly decomposes multiple conflicts in MSE training: monolingual vs. cross-lingual, STS vs. bitext mining vs. classification.
Elegantly resolves conflicts through parameter isolation (modularity), where each module focuses on a single responsibility.
The bilingual alignment strategy with English as the pivot is simple and effective, avoiding the training overhead of \(N^2\) language pairs.
Demonstrates that MT data can fully replace human-annotated paraphrase data for MSE training, substantially lowering data acquisition costs.
Combining FOCUS embedding initialization and LoRA makes language adaptation both parameter-efficient and sample-economical.

Limitations & Future Work¶

The language adaptation step (tokenizer training + MLM continuing training) requires computational resources for each language, incurring costs when scaling to hundreds of languages.
Inferencing requires knowing the input language to activate the corresponding module (can be solved via language identification but increases latency).
Validated only on LaBSE and mE5-base; larger models (e.g., mE5-large) have not been tested.
CLA relies solely on English as a pivot, introducing a strong dependency on the quality of English embeddings.
The batch ratio of alternating training (paraphrase vs. parallel) was not systematically tuned.

Multilingual Sentence Encoding: LaBSE (Feng et al., 2022), mE5 (Wang et al., 2024), LASER3 (Heffernan et al., 2022)
Alleviating the Curse of Multilinguality: Language adapters (Pfeiffer et al., 2020, 2021), FOCUS (Dobler & de Melo, 2023)
Parameter-Efficient Fine-Tuning: LoRA (Hu et al., 2022), Parallel Adapter (He et al., 2022)
Contrastive Learning for Sentence Embeddings: SimCSE (Gao et al., 2021), mSimCSE (Wang et al., 2022)
Cross-Lingual Alignment: Reimers & Gurevych (2020), Artetxe & Schwenk (2019)

Rating¶

Novelty: ⭐⭐⭐⭐ — First to systematically utilize a modular approach to resolve both CoM and task-related trade-offs in MSE.
Technical Depth: ⭐⭐⭐⭐ — Rigorous three-step modular design, with each step backed by clear motivation and experimental validation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Highly comprehensive, spanning 4 tasks \(\times\) 23 languages \(\times\) 2 bases \(\times\) multiple variant comparisons.
Clarity: ⭐⭐⭐⭐ — Well-structured, supported by clear diagrams, and follows a systematic naming convention for model variants.
Impact: ⭐⭐⭐⭐ — Offers direct guidance to the multilingual NLP community with a generalizable scheme.