ACL2025 Multilingual & Translation AI paper notes paper summaries Translation LLM Alignment/RLHF Few-/Zero-Shot Learning Speech & Audio Reasoning

🌐 Multilingual & Translation¶

💬 ACL2025 · 86 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (8) · 💬 ACL2026 (64) · 🧪 ICML2026 (3) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (11) · 📹 ICCV2025 (1)

🔥 Top topics: Translation ×23 · LLM ×14 · Alignment/RLHF ×5 · Few-/Zero-Shot Learning ×3 · Speech & Audio ×3

A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs: This work systematically evaluates the zero-shot cross-lingual generalization capabilities of LLMs on three classical languages (Sanskrit, Ancient Greek, and Latin) across three NLU tasks: NER, machine translation, and question answering. It also contributes a dataset of 1,501 Sanskrit QA pairs and validates the effectiveness of RAG strategies, revealing that model scale is the decisive factor in cross-lingual generalization.
Accessible Machine Translation Evaluation For Low-Resource Languages: To address the evaluation dilemma of machine translation in low-resource languages, this paper proposes an accessible evaluation framework that does not rely on high-quality reference translations or large-scale annotated data, enabling effective translation quality assessment for resource-constrained languages.
Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation: The DCSQE framework is proposed to effectively alleviate the distribution shift in synthetic QE data through constrained beam search for generating more realistic synthetic translations, an independent annotator model to correct label bias, and the SPCE algorithm to aggregate token-level labels into phrase-level labels. It outperforms state-of-the-art baselines like CometKiwi in both supervised and unsupervised settings.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT): This paper introduces HPLT v2, a large-scale multilingual dataset extracted from 4.5 PB of Internet Archive and Common Crawl data. It contains 8 trillion tokens of monolingual data covering 193 languages and 380 million parallel sentence pairs covering 51 languages, achieving significantly improved data quality through an enhanced data processing pipeline.
Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral: This work proposes UniMoral, a unified moral reasoning dataset across 6 languages that models moral reasoning as a computational pipeline containing action prediction, moral typology classification, factor attribution, and consequence generation. Benchmarking on three LLMs reveals that implicit moral context enhances models' moral reasoning capabilities, yet specialized methods are still required.
AskQE: Question Answering as Automatic Evaluation for Machine Translation: This paper proposes AskQE, a question answering-based quality estimation framework for machine translation. By generating questions from the source text, answering them based on both the source text and the back-translation output, and comparing the answers to detect translation errors, it helps users who do not understand the target language determine the acceptability of translations. On the BioMQM dataset, its Kendall's \(\tau\) correlation and decision accuracy outperform existing QE metrics.
7 Points to Tsinghua but 10 Points to 清华? Assessing Agentic Large Language Models in Multilingual National Bias: This paper presents the first systematic study of national bias in LLMs acting as multilingual recommendation agents in reasoning-based decision-making tasks. Utilizing three scenarios (university application, travel, and relocation) alongside the ThurstoneケースIII (comparative judgment) method, the study quantifies rating discrepancies for GPT-3.5, GPT-4, and Claude Sonnet across six languages. The findings reveal a widespread prevalence of "local language bias," and demonstrate that Chain-of-Thought (CoT) reasoning paradoxically exacerbates bias in non-English languages.
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization: This paper systematically evaluates the correlation of n-gram and neural network evaluation metrics with human judgments across 8 languages (representing 4 morphological typology families). The authors find that n-gram metrics negatively correlate with human judgments in highly fusional languages (Arabic, Hebrew), whereas COMET, a specially trained neural metric, consistently outperforms other methods across all language typologies.
Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning: Through a systematic analysis of multilingual ICL strategies, this study reveals that mixing demonstrations of various high-resource languages (HRLs) in the prompt consistently outperforms purely English demonstrations, yielding particularly significant improvements on low-resource languages (LRLs) (e.g., an 8.9% to 12.6% average LRL accuracy gain on Llama 3.1). Intriguingly, even merely appending context-irrelevant non-English sentences to the prompt yields measurable gains, revealing the phenomenon that "multilingual exposure itself is effective."
Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention: This paper proposes INCLINE (Inference-Time Cross-Lingual Intervention), a tuning-free inference-time framework. By learning an alignment matrix to transform internal representations of low-performance languages into the representation space of high-performance languages, it significantly boosts multilingual performance across 9 benchmarks and 5 LLMs.
Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce: By surveying 81 low-resource language NLP researchers and annotators, this paper reveals quality issues (unnatural data, cultural misalignment) and ethical concerns (exploitation of annotators' labor, unfair authorship attribution) in low-resource language data construction, proposing six actionable recommendations for improvement.
CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning: This paper proposes CC-Tuning, a multilingual fine-tuning paradigm that explicitly establishes cross-lingual connections in the hidden space. It enhances non-English capabilities by fusing feed-forward activations of English and non-English inputs, and employs a Transform Matrix during inference to simulate this cross-lingual connection.
CLIX: Cross-Lingual Explanations of Idiomatic Expressions: The cross-lingual idiom explanation task (CLIX) is proposed, along with a dataset containing English idioms and their Spanish/German explanations. The performance of seq2seq models and LLMs on this task is systematically evaluated, revealing that a GPT-3.5 Turbo pipeline strategy (generating English explanations followed by translation) combined with few-shot learning achieves the best results, with human evaluation scores for fluency and accuracy exceeding 4.7+/5.
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs: Inspired by the phenomenon of code-switching in human second language acquisition, this paper proposes the CSCL (Code-Switching Curriculum Learning) framework. Through a progressive curriculum training strategy of "token-level CS \(\rightarrow\) sentence-level CS \(\rightarrow\) monolingual corpus", CSCL enhances the cross-lingual transfer capability of LLMs and significantly outperforms monolingual continual pre-training on target languages such as Korean, Japanese, and Indonesian.
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding: This paper proposes the CSRT (Code-Switching Red-Teaming) framework, which leverages the common real-world phenomenon of code-switching to construct mixed-language red-teaming queries. It successfully uncovers severe safety vulnerabilities across 10 mainstream LLMs, achieving an attack success rate 46.7% higher than standard English attacks, thereby revealing the vulnerability of current LLM safety alignment in multilingual scenarios.
Comparative Analysis of Multilingual Hate Speech Detection: This paper systematically compares the performance of various LLMs and pre-trained language models on multilingual hate speech detection tasks, revealing the key bottlenecks of cross-lingual transfer and proposing enhancement strategies for low-resource languages.
Context Augmented Token-Level Post-Editing for Human Interpreting: This paper proposes a context-augmented token-level post-editing method that leverages dialogue context information for fine-grained error correction of automatic speech recognition (ASR) transcripts in human interpreting. It significantly improves transcription quality while maintaining the fluency of the interpreting.
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs: Proposes the CIA (Cross Lingual Auto Evaluation) Suite, a cross-lingual LLM evaluation framework including the evaluator model Hercule and the human-annotated test set Recon. By leveraging English reference answers to score non-English LLM responses, the 8B model outperforms closed-source large models like GPT-4o in multilingual evaluation.
Cross-Lingual Optimization for Language Transfer in Large Language Models: This paper proposes Cross-Lingual Optimization (CLO), which modifies the DPO loss function to achieve cross-lingual preference optimization—preferring target language responses given target language inputs, and English responses given English inputs. It consistently outperforms SFT across 5 models × 6 languages; in low-resource languages, CLO requiring only 3,200 samples outperforms SFT trained on 6,400 samples.
Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning: This paper explores a method for cross-lingual representation alignment without parallel corpora. By performing contrastive learning on multilingual image-caption pairs (CLIP-style), text representations in different languages are implicitly aligned in a shared visual space. It demonstrates that even languages unseen during the pre-training of the encoder (such as Quechua) can be integrated into the alignment framework using this approach.
Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon: By constructing an interpretable experimental framework, this study investigates the cross-lingual transfer of cultural knowledge during the language adaptation process of LLMs. It finds a bidirectional transfer between high-resource languages (Chinese, Korean) and English, whereas low-resource languages (Tibetan, Mongolian) exhibit an asymmetric transfer—knowledge mainly flows from low-resource languages to English, with limited reverse flow. The frequency hypothesis is proposed to explain this phenomenon.
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation: This paper systematically investigates the cross-lingual transfer effects of English debiasing/detoxification fine-tuning across 7 LLMs and 20 languages. The study finds that SFT is effective for debiasing and DPO for detoxification, but transfer to non-English languages is generally accompanied by a decline in language generation capabilities (impaired language consistency, fluency, and diversity). Furthermore, the transfer performance can be predicted by the pre-training data volume of the target language.
Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons: By tracking checkpoints during the pre-training of multilingual language models, this paper discovers that models gradually compress language-specific representations into cross-lingual shared representations: language identification ability in the middle layers decreases, "expert neurons" for semantic concepts align cross-lingually, and manipulating concept neurons extracted from Spanish data unexpectedly causes the model to generate semantically related English text.
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models: This paper proposes an automated approach based on beam search and LLM simulation to efficiently generate bilingual question pairs that expose cross-lingual performance weaknesses of multilingual LLMs in target languages. It establishes a dataset of over 6,000 samples across 16 languages, revealing that even GPT-4o suffers a cross-lingual accuracy drop exceeding 30%.
CruxEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution: CruxEval-X is proposed, a multilingual code reasoning benchmark covering 19 programming languages. It is extended from the Python version of CruxEval using a fully automated test-guided translation pipeline, containing 12,660 problems and 19K test cases. Evaluation of 24 LLMs reveals correlations among programming languages and the cross-lingual generalization capabilities of monolingually trained models.
CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis: CulFiT proposes a culture-aware LLM training paradigm. By leveraging multilingual critique data synthesis and fine-grained reward modeling, it enhances model sensitivity and inclusivity toward diverse cultures, achieving state-of-the-art performance among open-source models on multiple cultural understanding benchmarks.
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries: This paper proposes a cross-lingual vocabulary transfer method based on bilingual dictionaries. By exploiting the BPE tokenizer property where "removing subwords causes a fallback to shorter subwords," it maximizes the mapping coverage of target language subwords through an iterative removal-retokenization-alignment process. It significantly outperforms existing methods that rely on monolingual or parallel corpora in low-resource language scenarios.
Disentangling Language and Culture for Evaluating Multilingual Large Language Models: The Dual Evaluation Framework is proposed to decouple multilingual LLM evaluation along two dimensions: "linguistic medium" and "cultural context." This reveals a "Cultural-Linguistic Synergy" phenomenon—where models perform better when the cultural context aligns with the querying language—and explains this behavior from an interpretability perspective using FFN neuron activation analysis.
Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs: This paper proposes the X-KDE framework, which achieves "edit in one language, update in all languages" cross-lingual knowledge synchronization via Cross-lingual Edition Instruction Tuning (XE-IT) and Target-language Preference Optimization (TL-PO). It achieves an average improvement of +8.19% on the Bi-ZsRE and MzsRE benchmarks, significantly outperforming all existing methods in cross-lingual scenarios.
EXECUTE: A Multilingual Benchmark for LLM Token Understanding: This paper extends the character understanding benchmark CUTE to 8 languages and multiple writing systems, proposing the EXECUTE framework. The study demonstrates that LLM performance varies dramatically across character, word, and sub-character levels in different languages, and counter-intuitively finds that LLMs perform better on token understanding tasks for less familiar languages.
Exploring In-context Example Generation for Machine Translation: Proposes DAT (Demonstration Augmentation for Translation), which enables LLMs to automatically generate relevant and diverse source-target sentence pairs as in-context demonstrations without any external resources. DAT outperforms zero-shot and fixed-demonstration few-shot baselines across five low-resource language translation tasks.
Exploring In-Image Machine Translation with Real-World Background: This paper proposes the DebackX model, which processes images by separating them into background and text-image components. It addresses the In-Image Machine Translation (IIMT) task under real-world complex backgrounds for the first time, outperforming existing methods in both translation quality and visual presentation.
Language Fusion for Parameter-Efficient Cross-lingual Transfer (FLARE): FLARE fuses layer-wise representations of the source (English) and target languages via lightweight linear/non-linear transformations within the low-rank bottleneck of LoRA adapters. It achieves parameter-efficient cross-lingual transfer without requiring extra parameters, improving QA exact match by 4.9% on Llama 3.1.
GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning: GrammaMT is proposed to leverage grammatical information from Interlinear Glossed Text (IGT) to enhance few-shot machine translation in LLMs, achieving an average improvement of 12+ BLEU on endangered languages and consistent improvements on medium- and high-resource languages.
Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model: DMoE is proposed—a method combining parameter deviation-based dynamic language grouping with selective MoE layer expansion. By quantifying language similarity through only 10 steps of fine-tuning, similar languages are grouped to share the same expert. MoE expansion is applied only to layers with large parameter deviations (language-specific layers). It reduces PPL by 11.4% compared to continual pre-training across 18–128 languages, and outperforms X-ELM by 9.6% with 3.6x fewer parameters.
Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings: A method leveraging multilingual Matryoshka embeddings is proposed to achieve hierarchical news clustering. Different dimensional subsets of the embeddings correspond to semantic similarities of various granularities (theme \(\rightarrow\) topic \(\rightarrow\) event). Combined with an improved hierarchical agglomerative clustering algorithm, this approach achieves state-of-the-art (SOTA) performance on SemEval 2022 Task 8 (Pearson \(\rho = 0.816\)).
Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment: This paper proposes utilizing implicit reward signals from a well-aligned English DPO model to annotate preference relationships through cross-lingual instruction-response pairs. Combined with iterative DPO training, this approach achieves efficient multilingual preference alignment, resulting in an average win rate improvement of 12.72% on X-AlpacaEval.
Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models: This work systematically investigates the impact of incorporating parallel data during decoder-only LLM training on multilingual capabilities. It finds that applying parallel data at the final stage of training yields the best performance, significantly outperforming an equivalent amount of monolingual data. Furthermore, LLMs fail to automatically generalize to the reverse direction of the trained translation direction (reversal curse).
KnowCoder-X: Boosting Multilingual Information Extraction via Code: This work proposes KnowCoder-X, which represents multilingual IE schemas through uniform Python classes and introduces an IE cross-lingual alignment instruction tuning stage (including a high-quality ParallelNER dataset), significantly boosting cross-lingual information extraction performance across 64 IE benchmarks.
LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation: The LACA framework is proposed to leverage LLMs to generate high-quality pseudo-labeled data in the target language (rather than relying on machine translation). This significantly improves cross-lingual ABSA performance across six languages, outperforming the previous SOTA by an average of 1.50% and 2.62% on mBERT and XLM-R, respectively.
LangMark: A Multilingual Dataset for Automatic Post-Editing: This paper releases LangMark—a large-scale multilingual Automatic Post-Editing (APE) dataset comprising 206,983 triplets covering English to seven target languages, and demonstrates that LLMs combined with few-shot prompting can effectively improve the output quality of proprietary NMT engines.
LangSAMP: Language-Script Aware Multilingual Pretraining: The LangSAMP method is proposed, which adds language and script embeddings to the output end (rather than the input end) of the Transformer during multilingual pretraining. This enables the model backbone to learn more language-neutral representations, consistently outperforming the baseline in zero-shot cross-lingual transfer across over 500 languages.
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World: Introduces Lemonade—a large-scale multilingual expert-annotated event dataset based on ACLED conflict data (39,786 events, 20 languages, 171 countries, 10,707 entities). It proposes a new task paradigm, Abstractive Event Extraction (AEE), where event arguments are not limited to text spans but are normalized into numerical, categorical, or entity values. The accompanying zero-shot entity linking system, Zest, achieves an F1 score of 45.7% on the AEL subtask, significantly outperforming the baseline of 23.7%.
Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts: By analyzing the cross-lingual representation similarity across different layers of LLMs, this paper proposes LayerMoE, which dynamically allocates different numbers of experts for new languages on a layer-wise basis (fewer for high-similarity layers, more for low-similarity layers). It outperforms SOTA with 60% fewer expert parameters and further mitigates catastrophic forgetting by introducing routing classifiers in high-similarity layers.
LexGen: Domain-aware Multilingual Lexicon Generation: This paper proposes the LexGen framework, which introduces a learnable "Domain Routing" layer into the decoder of a pre-trained multilingual translation model to achieve dynamic fusion of domain-specific and domain-general knowledge. LexGen outperforms baselines such as NLLB and BLICEr on lexicon generation tasks across 6 Indic languages and 8 domains.
LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline: This paper proposes a new paradigm that constructs SFT data by rearranging source and target language tokens into interleaved sequences based on latency requirements. This enables LLMs to perform high-quality Simultaneous Machine Translation (SiMT) as efficiently as offline translation, achieving SOTA performance on multiple benchmarks while preserving original offline translation capabilities.
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models: This work dissects the cross-lingual factual inconsistency in multilingual LLMs using mechanistic interpretability. It reveals that while models process knowledge in a language-agnostic concept space across most layers, inconsistencies arise from failures during the "language transition" process in the final few layers. A linear shortcut method is proposed to bypass these final layers, improving both consistency and accuracy.
Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books: This work decomposes grammar-book-assisted extremely low-resource translation into two steps: grammar rule retrieval and rule application. It proposes a Rule-by-Rule retrieval strategy and a code-format grammar rule representation, achieving a 13.1% BLEU end-to-end improvement in Zhuang translation.
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation: This paper proposes the M-MAD framework, which decouples the MQM evaluation standard into independent dimensions (Accuracy, Fluency, Style, Terminology). It conducts multi-agent pro-con debates within each dimension and uses a judge agent to synthesize the results of all dimensions. M-MAD significantly outperforms existing LLM-as-a-judge methods at the segment level, and even with GPT-4o mini, it achieves performance comparable to SOTA reference-based automatic metrics.
M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation: Proposes M2rc-Eval, a large-scale multilingual repository-level code completion benchmark covering 18 programming languages, combined with AST-based fine-grained annotations at both bucket and semantic levels, and constructs the M2rc-Instruct instruction corpus to enhance model performance.
M3FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset: This work introduces M3FinMeeting, the first multilingual (Chinese, English, Japanese), multi-sector, and multi-task evaluation benchmark for financial meetings. Containing 600 real-world financial meetings with three tasks—summarization, Q&A pair extraction, and question answering—it reveals that state-of-the-art LLMs still have significant room for improvement in understanding financial meetings.
M-RewardBench: Evaluating Reward Models in Multilingual Settings: This work constructs the first multilingual reward model evaluation benchmark, M-RewardBench (covering 23 typologically diverse languages, 2.87K preference instances, across four capability categories: Chat, Safety, Reasoning, and Translation). Systematic evaluation of various RMs reveals a significant performance gap between English and non-English settings, and indicates that RM preferences can shift substantially across different languages.
Machine Translation Models are Zero-Shot Detectors of Translation Direction: An unsupervised translation direction detection method based on NMT translation probabilities is proposed: if \(p(\text{translation}|\text{original}) > p(\text{original}|\text{translation})\), the original translation direction of parallel texts can be determined in a zero-shot manner, achieving 96% document-level detection accuracy for NMT translations.
Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models: Expands the English IFEval benchmark to 30 languages with cultural localization, revealing a 25-35% accuracy gap between high- and low-resource languages in multilingual instruction following, and showing that machine-translated data underestimates model performance by 7-22%.
MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation: Introduces MaXIFE, an evaluation benchmark covering 1,667 verifiable instruction-following tasks across 23 languages. Combining rule-based and model-based evaluation strategies, it systematically assesses the instruction-following capabilities of LLMs in multilingual and cross-lingual scenarios, filling a significant gap in the evaluation landscape.
Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation: This paper presents the first systematic study of how memorization behavior of teacher models is transferred to student models in sequence-level knowledge distillation (SeqKD). It is discovered that although the student model is never directly exposed to the original training data, its extractive memorization rate is \(57\%\) higher than that of the baseline model, accompanied by an increased hallucination rate. Adaptive-SeqKD is proposed to mitigate these issues by fine-tuning the teacher on a high-quality subset.
Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs: Through large-scale analysis of 1000+ language pairs (35 languages, 1190 directions), this work discovers that the middle layer of LLMs has the strongest potential for cross-lingual semantic alignment. It proposes alternately optimizing a middle-layer contrastive alignment loss during task fine-tuning, which significantly improves cross-lingual transfer on three major tasks: slot filling (F1 +1.5), machine translation (COMET +1.1), and JSON generation, while remaining effective for unseen languages and out-of-domain data. The separately trained alignment and task LoRA modules can be merged via weight averaging.
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages: The authors construct MiLiC-Eval, the first standardized LLM evaluation benchmark for China's minority languages (Tibetan, Uyghur, Kazakh, and Mongolian). It contains 24k instances across 9 tasks, revealing the severe deficiencies of current LLMs in processing non-mainstream writing systems.
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment: This paper proposes a modular training scheme for multilingual sentence encoders: it first trains language-specific modules (embedding + language adapter + sentence encoder adapter) to alleviate the curse of multilinguality, and then trains cross-lingual alignment adapters using both parallel and paraphrase data to resolve performance trade-offs among different cross-lingual tasks. This approach consistently outperforms monolithic model training across 4 tasks and 23 languages.
Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models: Proposes a multilingual sensitive question-answering dataset, MSQAD (based on 17 human rights topics from Human Rights Watch across 6 languages), and systematically demonstrates through two statistical hypothesis tests (McNemar's test and PERMANOVA) that LLMs exhibit significant ethical bias when answering the same sensitive questions in different languages: Chinese and Hindi show the highest refusal rates, whereas Spanish and German are most prone to generating inappropriate responses, a bias widely observed across 7 LLMs.
Has Machine Translation Evaluation Achieved Human Parity?: Introduces human performance baselines to the rankings of the WMT Metrics Shared Task for the first time, finding that state-of-the-art automatic metrics often rank on par with or even higher than human evaluators. However, it argues that claiming "human parity" is premature and discusses the fundamental difficulties of measuring progress in MT evaluation.
Multi-perspective Alignment for Increasing Naturalness in Neural Machine Translation: A multi-perspective alignment framework (Multi-perspective Alignment) is proposed to simultaneously reward translation naturalness and content preservation. By fine-tuning NMT models using reinforcement learning with joint reward signals from a translationese classifier and COMET, the model generates vocabulary-rich translations without sacrificing translation accuracy.
Multilingual Encoder Knows More Than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages: An encoder-decoder weight sharing framework is proposed, which constructs a decoder by alternately reusing encoder weight layers and randomly initialized layers, efficiently extending the multilingual encoder CINO to the seq2seq model XLM-SWCM. With fewer than 0.5B parameters, it significantly outperforms mBART and 13B LLaMA on four extremely low-resource languages: Tibetan, Uyghur, Kazakh, and Mongolian.
Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs: This paper reveals that multilingual LLMs exhibit an "English accent" when generating non-English text—biasing toward English patterns lexically and syntactically. It proposes corpus-level naturalness metrics based on JSD (for lexical distribution) and WL graph kernel + MMD (for syntactic dependency trees), and demonstrates that the naturalness of target languages can be effectively improved using DPO alignment.
Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning: A systematic quality audit of three major public multilingual speech datasets (Common Voice 17.0, FLEURS, VoxPopuli) covering 40+ languages is conducted. Issues are categorized into programmable "micro-level issues" and linguistically involved "macro-level issues." It is found that macro-level issues are particularly severe for low-institutionalization languages. A 5-step dataset creation guide incorporating sociolinguistic awareness is proposed.
NameTag 3: A Tool and a Service for Multilingual/Multitagset NER: This paper introduces NameTag 3, an open-source multilingual, multi-dataset, and multi-tagset named entity recognition (NER) tool and cloud service. Based on fine-tuned pre-trained language models, a single 355M parameter model achieves SOTA performance on 21 test sets across 15 languages, while being over 10,000 times faster than LLMs such as DeepSeek-R1.
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set: This paper proposes the first cross-framework, cross-lingual unified discourse relation label set (17 categories). Through attention probing experiments on 23 LLMs, it demonstrates that multilingual LLMs can encode cross-linguistically transferable discourse-level representations in their intermediate layers, and that multilingual training combined with model scale enhances generalization capabilities.
Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation: The Registering method is proposed: by inserting a set of target language markers (registers) between source and target language tokens and modifying the attention mask to make target generation rely solely on the activation of registers, the off-target problem in multilingual translation is thoroughly resolved, enabling the small-scale MITRE-913M model to outperform NLLB-3.3B.
Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-Lingual Transfer: This paper proposes SALT (Semantic Aware Linear Transfer). By constructing independent least-squares transformation matrices for each non-shared vocabulary token based on semantically similar shared token pairs, it transfers the rich embedding representations of a target language PLM to the embedding space of an English-centric LLM. It outperforms existing methods across downstream tasks, continual pre-training convergence speed, and cross-lingual understanding.
SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation: Modeling simultaneous machine translation (SiMT) as a multi-step sequential decision-making problem, this paper proposes the SeqPO-SiMT policy optimization framework. By fusing reward signals of translation quality and latency, it achieves performance on a 7B LLM that is comparable to strong offline translation models.
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework: This paper proposes the ShifCon framework, which significantly improves the performance of low-resource languages by shifting representation of non-dominant languages to the subspace of dominant languages to access richer model knowledge, then shifting it back to the original language subspace for generation, combined with multilingual contrastive learning.
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning: This work constructs SIFT-50M, a speech instruction fine-tuning dataset containing 50 million samples across 5 languages. It automatically generates diverse speech understanding and controllable speech generation instructions from public speech corpora using LLMs and expert models. Training SIFT-LLM on this dataset yields performance that significantly outperforms existing speech-text LLMs on instruction-following benchmarks.
Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models: This work extends the Statement-Tuning method to multilingual scenarios, demonstrating that mDeBERTa, an encoder-only model with only 276M parameters, can achieve cross-lingual zero-shot generalization across unseen tasks and unseen languages after multilingual Statement-Tuning, matching or even surpassing generative LLMs with 70B+ parameters on multiple NLU tasks.
Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs: This paper systematically evaluates the performance of 13 models (LLMs + traditional MT) on English-Korean entity-dense text translation in SemEval-2025 Task 2. Through automatic metrics and bilingual human evaluation, it reveals that while LLMs outperform traditional MT, they still generally fail on entity translations requiring cultural adaptation, and establishes a translation error taxonomy.
The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages: Proposes the Esethu framework—a community-driven, sustainable data governance approach that enables circular reinvestment of data revenue through innovative community-centric licensing, validated using the isiXhosa speech dataset, ViXSD.
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual Contexts: This paper systematically analyzes the impact of preference tuning (such as RLHF/DPO) on the internal representation space of LLMs in multilingual scenarios. It finds that while the alignment mechanism effectively separates the latent space representations of harmful and harmless content in English, this effect is significantly degraded in non-English languages such as Hindi, Chinese, and German, revealing a severe monolingual bias in current alignment methods.
THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation: This paper proposes the THOR-MoE framework, which utilizes hierarchical task-guided routing (automatically predicting domain/language and generating soft-mixed task representations to select a task-level expert subset) and context-responsive routing (injecting global context into token representations to assist expert selection). It achieves significant performance gains in multi-domain and multilingual translation with fewer activated parameters.
Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST): The authors construct GIST, the first large-scale multilingual AI terminology dataset (approximately 5K terms across 5 languages), using a hybrid framework of LLM extraction + human crowdsourced translation + LLM selection. They demonstrate that prompting-based post-translation optimization consistently improves the translation quality of AI terminology in machine translation across metrics such as BLEU and COMET.
Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation: The Trans-Zero self-play framework is proposed, which utilizes only monolingual data. By exploring semantically consistent candidate translations during the multilingual translation process through Genetic Monte-Carlo Tree Search (G-MCTS) and combining this with preference optimization, it achieves parallel-data-free multilingual translation training with performance comparable to large-scale supervised fine-tuning (SFT) methods.
Translation and Fusion Improves Zero-shot Cross-lingual Information Extraction: TransFusion is proposed, which first translates low-resource language texts into English at inference time, performs information extraction annotation on English, and then uses a fusion model to combine English annotations with the source text to generate final predictions. It significantly outperforms baselines on zero-shot cross-lingual IE tasks across 50 languages (increasing average F1 on MasakhaNER2 from 47.9 to 62.4).
Did Translation Models Get More Robust Without Anyone Even Noticing?: Through experiments with synthetic noise and social media texts, it is found that modern large-scale pre-trained translation models (such as TowerInstruct 13B and GPT-3.5) far outperform traditional NMT models (OPUS) in robustness to various character-level noises without using any specialized robustness training techniques. Furthermore, the combination of source-side correction and LLM translation can even surpass GPT-3.5.
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu: This work systematically investigates the impact of various language resources (dictionaries, parallel corpora, grammar books, CoT prompts) on translation quality in LLM in-context machine translation. Using Manchu as a case study, it finds that high-quality dictionaries and retrieved parallel examples are the most valuable, while grammar books are almost useless. Through character-encryption experiments, the study proves that LLMs primarily rely on in-context learning capabilities rather than prior knowledge. Finally, it demonstrates the effectiveness of utilizing in-context translation to generate synthetic parallel data for training traditional NMT models.
Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation: This paper proposes source-based MBR (sMBR) decoding, which utilizes "quasi-sources" generated via paraphrasing or back-translation as "support hypotheses." Combined with a reference-free Quality Estimation (QE) metric as the utility function, this approach is the first to completely rely on source-side information in MBR decoding. It outperforms both QE reranking and standard MBR decoding under both classical and LLM-based NMT settings.
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation: This paper systematically exposes gender disparities in machine translation quality estimation (QE) metrics: masculine forms score higher than feminine forms when source genders are ambiguous; feminine forms have higher error rates in the presence of contextual cues; and the biases propagate to downstream MT systems through data filtering and quality-aware decoding (QAD).
X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System: X-WebAgentBench is proposed—a multilingual interactive web benchmark designed to evaluate the planning and interaction capabilities of language agents across various languages. Multiple LLMs and cross-lingual alignment methods are evaluated, revealing that even GPT-4o combined with cross-lingual techniques fails to achieve satisfactory results.
ZIPA: A Family of Efficient Models for Multilingual Phone Recognition: This paper proposes the Zipa family of efficient speech models. Based on the Zipformer backbone and the IpaPack++ dataset (17,132 hours of multilingual annotated data), Zipa achieves SOTA on multilingual phone recognition. A 64M-parameter model outperforms existing 300M-parameter models, and performance is further boosted across 4000+ languages via noisy student training.