Improving Language and Modality Transfer in Translation by Character-level Modeling¶

Conference: ACL 2025
arXiv: 2505.24561
Code: None
Area: Other
Keywords: character-level, multilingual translation, speech translation, SONAR, cross-modal transfer

TL;DR¶

A cross-lingual and cross-modal translation method is proposed based on a character-level encoder, charSONAR. A character-level text encoder is obtained via teacher-student training and is then connected to a 1000+-language CTC ASR model (MMS) using lightweight adapters. It achieves SOTA on text translation in 75 languages and speech translation in 33 languages, with particularly prominent performance in zero-resource and low-resource scenarios.

Background & Motivation¶

Background: Translation models currently support 200–400 text languages and 100 speech languages, covering only 5% of all languages globally. Scaling to long-tail, low-resource languages faces data scarcity challenges.

Limitations of Prior Work: (1) The cross-lingual transfer capability of subword tokenization is limited. (2) In speech translation, the mismatch in length/content between the CTC output (character-level) and the text encoder (subword-level) causes a modality gap. (3) Phonemization methods are ambiguous and cannot scale to 1000+ languages.

Key Challenge: How to maximize knowledge transfer between text and speech modalities, as well as between high-resource and low-resource languages?

Goal: Unify the representation input space of text and speech via character-level modeling to enhance cross-lingual and cross-modal transfer.

Key Insight: Based on SONAR (a multilingual fixed-dimensional embedding space) and MMS (a 1000+-language CTC ASR model), the character-level encoder naturally aligns with the CTC output, eliminating the length mismatch between subwords and characters.

Core Idea: Character-level SONAR encoder + pretrained CTC-to-character adapter = data-efficient cross-lingual and cross-modal translation.

Method¶

Overall Architecture¶

(1) Teacher-Student: SONAR (subword) -> charSONAR (character), using interpolation MSE loss. (2) Adapter: MMS-CTC -> charSONAR, using MSE loss. During inference, translation is generated from embeddings using the SONAR decoder.

Key Designs¶

charSONAR Training:
- Retain only single-character tokens in the SONAR vocabulary (256K -> 8K).
- Three types of MSE targets: reconstruction (\(\text{MSE}(\mathbf{c}^x, \mathbf{e}^x)\)), translation (\(\text{MSE}(\mathbf{c}^x, \mathbf{e}^y)\)), and interpolation (\(\text{MSE}(\mathbf{c}^x, \frac{\mathbf{e}^x + \mathbf{e}^y}{2})\)).
- The interpolation target performs best: the average SONAR embeddings of a language pair are more suitable for the cross-lingual space than monolingual embeddings.
- Incorporate ASR-style augmentations (uncasing, punctuation removal, and character noise) to enhance cross-modal robustness.
Cross-modal Adapter:
- Pretrained Adapter: Leverage the CTC classification layer of MMS and the embedding layer of charSONAR to perform soft prediction (softmax -> embedding lookup), requiring only ~200K parameters.
- Dual Adapter: A combination of the pretrained adapter and a randomly initialized adapter using gated weighting, totaling 2.5M parameters.
- CTC Compression: Collapse identical consecutive predictions and remove blank tokens to compress audio representations to character-level lengths.
Zero-Resource Speech Translation:
- Freeze both charSONAR and MMS, training only the adapter.
- Training requires only ASR data (audio-transcript pairs) and does not need parallel speech translation data.

Key Experimental Results¶

Main Results¶

Method	Text Translation (FLORES+ 75 languages)	Speech Translation (FLEURS 33 languages)
SONAR (subword)	xCOMET: 0.925	-
charSONAR	xCOMET: 0.934	-
SEAMLESS (supervised)	-	SOTA baseline
Whisper cascade	-	Strong baseline
charSONAR + MMS	Better than SONAR	New SOTA

Ablation Study¶

Configuration	xCOMET	xSIM++	Description
Reconstruction target	0.929	7.4	Base
Translation target	0.924	6.6	Good retrieval but slightly worse translation
Interpolation target	0.931	6.6	Balanced
+ Pretraining initialization	0.934	6.4	Faster convergence
Gain on low-resource languages	Highest	-	Character-level shows clear advantages in low-resource scenarios
Generalization to zero-resource languages	Better than subword	Better than subword	Character sharing enhances transfer

Key Findings¶

Character-level outperforms subword-level: It performs better overall across 75 languages, with particularly prominent advantages in low-resource and zero-resource scenarios.
Interpolated embedding space is superior: The "average" SONAR embedding of a language pair serves as a better cross-lingual anchor than monolingual embeddings, benefiting low-resource languages the most (whose quality improves after averaging with high-resource languages).
An extremely lightweight adapter achieves SOTA speech translation: An adapter with only 2.5M parameters outperforms the fully supervised SEAMLESS system.

Highlights & Insights¶

Deep insight into unifying the two modalities via character-level modeling: Since the CTC output is naturally character-level, using character inputs for the text encoder eliminates the modality gap. This is more elegant and simpler than prior phoneme-sharing or subword-compression schemes.
Unexpected advantage of the fixed-dimensional embedding bottleneck: SONAR's mean pooling yields fixed-dimensional embeddings, preventing the increased sequence length of character-level inputs from impacting decoder computation.
"Pretrained initialization" design of the adapter: Initializing with the existing MMS CTC layer and charSONAR embedding layer maximizes the utilization of pretrained knowledge.

Limitations & Future Work¶

Character-level modeling at the encoder increases the sequence length (1.5–3x), leading to higher encoding costs.
Only X -> Eng speech translation directions were tested; other directions remain unverified.
Character-level tokenization may behave differently for logographic languages like Chinese or Japanese, where each character carries denser semantic meaning.

vs ZeroSwot: ZeroSwot aligns speech and text representations using Wasserstein distance; charSONAR achieves superior performance using simple MSE loss combined with character-level unification.
vs ByT5 (character-level LM): ByT5 demonstrates the robustness benefits of character-level modeling against noise; charSONAR extends this to translation and speech.

Rating¶

Novelty: ⭐⭐⭐⭐ The scheme of unifying text and speech translation at the character level is simple and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 75-language text + 33-language speech along with extensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear methodology and comprehensive experiments.
Value: ⭐⭐⭐⭐⭐ High practical impact, with the potential to support speech translation for 1000+ languages.