🌐 Multilingual & Translation¶
🧪 ICML2026 · 2 paper notes
📌 Same area in other venues: 💬 ACL2026 (32) · 📷 CVPR2026 (2) · 🔬 ICLR2026 (5) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (14) · 📹 ICCV2025 (1)
- ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World
-
ML-Embed extends the Matryoshka concept from one dimension (representation dimension) to three dimensions—embedding parameters (MEL), model depth (MLL), and representation dimension (MRL)—enabling full-stack nested training. It constructs a multilingual training set with 282 natural languages and 40 programming languages, totaling 50 million samples, and releases a family of open-source models from 140M to 8B parameters. On 17 MTEB benchmarks, it ranks first in 9, with notable gains in Polish (+22.89) and Vietnamese (+6.88).
- Optimizing Language Models for Crosslingual Knowledge Consistency
-
This paper addresses the issue where multilingual LLMs provide conflicting answers to the same question in different languages. It proposes an RL objective that uses the "log-likelihood of the answer in another language" as the reward, proves that the optimal policy takes a product-of-experts form and guarantees crosslingual preference consistency when \(\gamma_1\gamma_2=\beta^2\). Based on this, it derives the Direct Consistency Optimization (DCO) algorithm, which requires neither a reward model nor online sampling. DCO improves both crosslingual consistency (RankC) and answer accuracy across 9 LLMs, 3 multilingual QA benchmarks, and 26 languages.