Skip to content

Improving Language and Modality Transfer in Translation by Character-level Modeling

Conference: ACL 2025
arXiv: 2505.24561
Code: None
Area: Other
Keywords: character-level, multilingual translation, speech translation, SONAR, cross-modal transfer

TL;DR

A cross-lingual and cross-modal translation method is proposed based on a character-level encoder, charSONAR. A character-level text encoder is obtained via teacher-student training and is then connected to a 1000+-language CTC ASR model (MMS) using lightweight adapters. It achieves SOTA on text translation in 75 languages and speech translation in 33 languages, with particularly prominent performance in zero-resource and low-resource scenarios.

Background & Motivation

Background: Translation models currently support 200–400 text languages and 100 speech languages, covering only 5% of all languages globally. Scaling to long-tail, low-resource languages faces data scarcity challenges.

Limitations of Prior Work: (1) The cross-lingual transfer capability of subword tokenization is limited. (2) In speech translation, the mismatch in length/content between the CTC output (character-level) and the text encoder (subword-level) causes a modality gap. (3) Phonemization methods are ambiguous and cannot scale to 1000+ languages.

Key Challenge: How to maximize knowledge transfer between text and speech modalities, as well as between high-resource and low-resource languages?

Goal: Unify the representation input space of text and speech via character-level modeling to enhance cross-lingual and cross-modal transfer.

Key Insight: Based on SONAR (a multilingual fixed-dimensional embedding space) and MMS (a 1000+-language CTC ASR model), the character-level encoder naturally aligns with the CTC output, eliminating the length mismatch between subwords and characters.

Core Idea: Character-level SONAR encoder + pretrained CTC-to-character adapter = data-efficient cross-lingual and cross-modal translation.

Method

Overall Architecture

(1) Teacher-Student: SONAR (subword) -> charSONAR (character), using interpolation MSE loss. (2) Adapter: MMS-CTC -> charSONAR, using MSE loss. During inference, translation is generated from embeddings using the SONAR decoder.

Key Designs

  1. charSONAR Training:

    • Retain only single-character tokens in the SONAR vocabulary (256K -> 8K).
    • Three types of MSE targets: reconstruction (\(\text{MSE}(\mathbf{c}^x, \mathbf{e}^x)\)), translation (\(\text{MSE}(\mathbf{c}^x, \mathbf{e}^y)\)), and interpolation (\(\text{MSE}(\mathbf{c}^x, \frac{\mathbf{e}^x + \mathbf{e}^y}{2})\)).
    • The interpolation target performs best: the average SONAR embeddings of a language pair are more suitable for the cross-lingual space than monolingual embeddings.
    • Incorporate ASR-style augmentations (uncasing, punctuation removal, and character noise) to enhance cross-modal robustness.
  2. Cross-modal Adapter:

    • Pretrained Adapter: Leverage the CTC classification layer of MMS and the embedding layer of charSONAR to perform soft prediction (softmax -> embedding lookup), requiring only ~200K parameters.
    • Dual Adapter: A combination of the pretrained adapter and a randomly initialized adapter using gated weighting, totaling 2.5M parameters.
    • CTC Compression: Collapse identical consecutive predictions and remove blank tokens to compress audio representations to character-level lengths.
  3. Zero-Resource Speech Translation:

    • Freeze both charSONAR and MMS, training only the adapter.
    • Training requires only ASR data (audio-transcript pairs) and does not need parallel speech translation data.

Key Experimental Results

Main Results

Method Text Translation (FLORES+ 75 languages) Speech Translation (FLEURS 33 languages)
SONAR (subword) xCOMET: 0.925 -
charSONAR xCOMET: 0.934 -
SEAMLESS (supervised) - SOTA baseline
Whisper cascade - Strong baseline
charSONAR + MMS Better than SONAR New SOTA

Ablation Study

Configuration xCOMET xSIM++ Description
Reconstruction target 0.929 7.4 Base
Translation target 0.924 6.6 Good retrieval but slightly worse translation
Interpolation target 0.931 6.6 Balanced
+ Pretraining initialization 0.934 6.4 Faster convergence
Gain on low-resource languages Highest - Character-level shows clear advantages in low-resource scenarios
Generalization to zero-resource languages Better than subword Better than subword Character sharing enhances transfer

Key Findings

  • Character-level outperforms subword-level: It performs better overall across 75 languages, with particularly prominent advantages in low-resource and zero-resource scenarios.
  • Interpolated embedding space is superior: The "average" SONAR embedding of a language pair serves as a better cross-lingual anchor than monolingual embeddings, benefiting low-resource languages the most (whose quality improves after averaging with high-resource languages).
  • An extremely lightweight adapter achieves SOTA speech translation: An adapter with only 2.5M parameters outperforms the fully supervised SEAMLESS system.

Highlights & Insights

  • Deep insight into unifying the two modalities via character-level modeling: Since the CTC output is naturally character-level, using character inputs for the text encoder eliminates the modality gap. This is more elegant and simpler than prior phoneme-sharing or subword-compression schemes.
  • Unexpected advantage of the fixed-dimensional embedding bottleneck: SONAR's mean pooling yields fixed-dimensional embeddings, preventing the increased sequence length of character-level inputs from impacting decoder computation.
  • "Pretrained initialization" design of the adapter: Initializing with the existing MMS CTC layer and charSONAR embedding layer maximizes the utilization of pretrained knowledge.

Limitations & Future Work

  • Character-level modeling at the encoder increases the sequence length (1.5–3x), leading to higher encoding costs.
  • Only X -> Eng speech translation directions were tested; other directions remain unverified.
  • Character-level tokenization may behave differently for logographic languages like Chinese or Japanese, where each character carries denser semantic meaning.
  • vs ZeroSwot: ZeroSwot aligns speech and text representations using Wasserstein distance; charSONAR achieves superior performance using simple MSE loss combined with character-level unification.
  • vs ByT5 (character-level LM): ByT5 demonstrates the robustness benefits of character-level modeling against noise; charSONAR extends this to translation and speech.

Rating

  • Novelty: ⭐⭐⭐⭐ The scheme of unifying text and speech translation at the character level is simple and elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 75-language text + 33-language speech along with extensive ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear methodology and comprehensive experiments.
  • Value: ⭐⭐⭐⭐⭐ High practical impact, with the potential to support speech translation for 1000+ languages.