Skip to content

🌐 Multilingual & Translation

🧪 ICML2026 · 2 paper notes

📌 Same area in other venues: 💬 ACL2026 (32) · 📷 CVPR2026 (2) · 🔬 ICLR2026 (5) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (14) · 📹 ICCV2025 (1)

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

ML-Embed extends the Matryoshka concept from one dimension (representation dimension) to three dimensions—embedding parameters (MEL), model depth (MLL), and representation dimension (MRL)—enabling full-stack nested training. It constructs a multilingual training set with 282 natural languages and 40 programming languages, totaling 50 million samples, and releases a family of open-source models from 140M to 8B parameters. On 17 MTEB benchmarks, it ranks first in 9, with notable gains in Polish (+22.89) and Vietnamese (+6.88).

Optimizing Language Models for Crosslingual Knowledge Consistency

This paper addresses the issue where multilingual LLMs provide conflicting answers to the same question in different languages. It proposes an RL objective that uses the "log-likelihood of the answer in another language" as the reward, proves that the optimal policy takes a product-of-experts form and guarantees crosslingual preference consistency when \(\gamma_1\gamma_2=\beta^2\). Based on this, it derives the Direct Consistency Optimization (DCO) algorithm, which requires neither a reward model nor online sampling. DCO improves both crosslingual consistency (RankC) and answer accuracy across 9 LLMs, 3 multilingual QA benchmarks, and 26 languages.