Skip to content

Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment

Conference: ACL 2025
arXiv: 2407.14878
Area: Multilingual Translation
Keywords: multilingual sentence encoders, modular training, curse of multilinguality, cross-lingual alignment, adapters

TL;DR

This paper proposes a modular training scheme for multilingual sentence encoders: it first trains language-specific modules (embedding + language adapter + sentence encoder adapter) to alleviate the curse of multilinguality, and then trains cross-lingual alignment adapters using both parallel and paraphrase data to resolve performance trade-offs among different cross-lingual tasks. This approach consistently outperforms monolithic model training across 4 tasks and 23 languages.

Background & Motivation

  • Multilingual Sentence Encoders (MSE) map sentences of different languages into a shared semantic space, widely used in cross-lingual retrieval, clustering, and classification.
  • Two Core Problems:
  • Curse of Multilinguality (CoM): Parameter sharing leads to a degradation in the quality of monolingual representations for each language, which is particularly severe for low-resource languages.
  • Performance Trade-offs Across Tasks:
    • Cross-lingual alignment training destroys monolingual semantic structures \(\rightarrow\) Monolingual vs. Cross-lingual performance conflict.
    • Training with parallel data suits bitext mining but is ill-suited for semantic similarity \(\rightarrow\) Conflict between different cross-lingual tasks.
    • Training with paraphrase data suits semantic similarity but is ill-suited for bitext mining.
  • Existing modular approaches (e.g., LASER3) only solve a subset of these problems, and their teacher models themselves are already affected by CoM.

Method

Overall Architecture

Three-step modular training (only the parameters of the corresponding modules are updated in each step):

  1. Language Adaptation (LA): Train a language-specific embedding layer + LoRA language adapter for each language.
  2. Sentence Encoding (SE) Training: Stack a LoRA SE adapter on top of the language adapter and train it using monolingual paraphrase data.
  3. Cross-Lingual Alignment (CLA): Train a parallel adapter by alternating between cross-lingual paraphrase data and parallel data.

Key Designs

Language Adaptation: - Train a dedicated tokenizer for each language. - Initialize new embeddings using the FOCUS method (copy existing token embeddings, and use similar token interpolation for new tokens). - Perform continuing MLM training using only LoRA for parameter efficiency.

Sentence Encoding Retraining: - The MLM objective degrades the sentence encoding capability of the pretrained MSE, which necessitates SE retraining. - Train with MNRL (Multiple Negatives Ranking Loss) on monolingual paraphrase data obtained via machine translation. - Freeze the LA module and only update the SE adapter.

Cross-Lingual Alignment: - Perform bilingual alignment with English as the pivot language (English embeddings have the highest quality, trained on gold-standard paraphrase data). - Alternating Training: Use cross-lingual paraphrase pairs + MNRL loss in one batch, and parallel pairs + cosine similarity loss in the next batch. - Use Parallel Adapters to prevent alignment training from interfering with monolingual SE capabilities. - Do not train a CLA adapter for English: force other languages to project into the English space.

Training Data: - Translate 5 English paraphrase datasets (MNLI, SentenceCompression, etc., totaling ~600k pairs) into 22 languages using NLLB 3.3B. - Leverage multilingual parallel paraphrase data to construct both paraphrase pairs and parallel pairs simultaneously.

Key Experimental Results

Main Results

Results based on LaBSE (23 languages):

Monolingual Tasks:

Model Config STS ↑ STR ↑ Classification ↑
LaBSE Original 76.7 69.2 82.7
Full_mc (Best Monolithic) 80.0 75.4 86.0
Mod_mc-jt (Ours) 83.9 79.0 86.4

Cross-Lingual Tasks:

Model Config STS ↑ Classification ↑ FLORES Mining ↓ Tatoeba Mining ↓
LaBSE Original 74.5 83.6 0.14 3.87
Full_c (Best Cross-Lingual Monolithic) 77.8 85.3 0.20 4.00
Mod_mc-jt (Ours) 81.4 86.7 0.10 3.12

Alignment Metrics: - Language Bias (lower is better): Mod_mc-jt achieves 0.49 on STSB (vs. Full_mc 0.53) and 0.65 on SICK (vs. 0.64). - RSIM (higher is better): Mod_mc-jt reaches 0.79 (vs. Full_mc 0.77).

Key Findings

  • Substantial Improvement in Monolingual Performance: The modular scheme improves by 3.9 percentage points over the strongest monolithic model on STS (83.9 vs. 80.0).
  • Comprehensive Cross-Lingual Superiority: Achieves SOTA simultaneously in STS, classification, and bitext mining, resolving the trade-offs between tasks.
  • Low-Resource Languages Benefit Most: Language-specific modules effectively alleviate CoM.
  • MT Data is Effective: High-quality training is achievable using only machine-translated paraphrase data.
  • Alternating Paraphrase and Parallel Training for CLA is Optimal: Using only paraphrases (Mod_mc-pp) performs poorly on bitext mining, while using only parallel data (Mod_mc-pl) performs poorly on STS.
  • Equally Effective on mE5: The modular scheme brings consistent improvements on the mE5 base as well.

Highlights & Insights

  • Clearly decomposes multiple conflicts in MSE training: monolingual vs. cross-lingual, STS vs. bitext mining vs. classification.
  • Elegantly resolves conflicts through parameter isolation (modularity), where each module focuses on a single responsibility.
  • The bilingual alignment strategy with English as the pivot is simple and effective, avoiding the training overhead of \(N^2\) language pairs.
  • Demonstrates that MT data can fully replace human-annotated paraphrase data for MSE training, substantially lowering data acquisition costs.
  • Combining FOCUS embedding initialization and LoRA makes language adaptation both parameter-efficient and sample-economical.

Limitations & Future Work

  • The language adaptation step (tokenizer training + MLM continuing training) requires computational resources for each language, incurring costs when scaling to hundreds of languages.
  • Inferencing requires knowing the input language to activate the corresponding module (can be solved via language identification but increases latency).
  • Validated only on LaBSE and mE5-base; larger models (e.g., mE5-large) have not been tested.
  • CLA relies solely on English as a pivot, introducing a strong dependency on the quality of English embeddings.
  • The batch ratio of alternating training (paraphrase vs. parallel) was not systematically tuned.
  • Multilingual Sentence Encoding: LaBSE (Feng et al., 2022), mE5 (Wang et al., 2024), LASER3 (Heffernan et al., 2022)
  • Alleviating the Curse of Multilinguality: Language adapters (Pfeiffer et al., 2020, 2021), FOCUS (Dobler & de Melo, 2023)
  • Parameter-Efficient Fine-Tuning: LoRA (Hu et al., 2022), Parallel Adapter (He et al., 2022)
  • Contrastive Learning for Sentence Embeddings: SimCSE (Gao et al., 2021), mSimCSE (Wang et al., 2022)
  • Cross-Lingual Alignment: Reimers & Gurevych (2020), Artetxe & Schwenk (2019)

Rating

  • Novelty: ⭐⭐⭐⭐ — First to systematically utilize a modular approach to resolve both CoM and task-related trade-offs in MSE.
  • Technical Depth: ⭐⭐⭐⭐ — Rigorous three-step modular design, with each step backed by clear motivation and experimental validation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Highly comprehensive, spanning 4 tasks \(\times\) 23 languages \(\times\) 2 bases \(\times\) multiple variant comparisons.
  • Clarity: ⭐⭐⭐⭐ — Well-structured, supported by clear diagrams, and follows a systematic naming convention for model variants.
  • Impact: ⭐⭐⭐⭐ — Offers direct guidance to the multilingual NLP community with a generalizable scheme.