🌐 Multilingual & Translation¶

💬 ACL2026 · 24 paper notes

A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction: This work constructs the first multilingual MRE Mix dataset (MMM, 21 subsets covering English, Chinese, and Japanese) and systematically validates through large-scale ablation experiments that the Mutual Reinforcement Effect (MRE) between word-level and text-level information extraction tasks exists universally across languages.
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation: This paper constructs MENT, a non-literal translation meta-evaluation dataset comprising 7,530 human-annotated instances, reveals the unreliability of traditional metrics and LLM-as-Judge approaches on non-literal translation evaluation, and proposes RATE, an agentic evaluation framework in which a reflective Core Agent dynamically invokes sub-agents to improve correlation with human judgments by 3.2+ points.
BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources: This is the first unified survey dedicated to Indian language NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models/tools. Resources are organized under 17 task categories spanning core linguistic processing to sociocultural tasks. The survey systematically analyzes persistent challenges including uneven language coverage, annotation fragmentation, and evaluation inconsistency.
Efficient Training for Cross-lingual Speech Language Models: This paper proposes CSLM, a data-efficient method for training cross-lingual speech LLMs. It introduces a novel alignment strategy to achieve cross-modal and cross-lingual alignment simultaneously, and presents a speech-text interleaved chain-of-modality generation paradigm to improve quality and reduce latency—without requiring large-scale speech data to extend to new languages.
Exploring Two-Phase Continual Instruction Fine-tuning for Multilingual Adaptation in Large Language Models: This paper proposes a two-phase continual fine-tuning (CFT) framework—first fine-tuning on English instruction data, then on multilingual data—and finds that instruction similarity between the two phases is the key factor determining whether English capability degrades. Generative replay and heuristic layer freezing are shown to effectively mitigate representation drift and English forgetting caused by dissimilar datasets.
IndoTabVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents: This paper presents IndoTabVQA, a cross-lingual visual question answering benchmark for table understanding in Bahasa Indonesia documents. The dataset comprises 1,593 document images annotated with QA pairs in four languages (Indonesian, English, Hindi, and Arabic). The benchmark reveals substantial performance gaps in VLMs for low-resource languages and cross-lingual table understanding, with fine-tuning combined with spatial priors achieving up to 48.5% In-Match accuracy.
Just Use XML: Revisiting Joint Translation and Label Projection: This paper proposes LabelPigeon, a joint translation and label projection method based on XML markup. By fine-tuning the NLLB-200 translation model on high-quality XML-annotated parallel corpora, LabelPigeon surpasses all baselines across 11 languages while actively improving translation quality, achieving gains of up to +40.2 F1 on downstream cross-lingual NER tasks.
Language Models Entangle Language and Culture: This paper evaluates multilingual LLMs on culturally neutral, open-ended advice-seeking questions derived from the WildChat dataset. It finds that query language systematically affects both response quality and cultural context — low-resource language queries yield notably lower quality responses than English, and language choice implicitly shifts the cultural framing of responses. A translated version of CulturalBench further validates the entanglement between language and culture in LLMs.
Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality: This paper proposes XBridge, an architecture that composes pretrained multilingual encoder-decoder translation models (e.g., NLLB) with English-centric LLMs — the encoder handles multilingual understanding, the LLM handles knowledge reasoning, and the decoder handles multilingual generation. Lightweight mapping layers and optimal transport alignment are employed to bridge cross-model semantic gaps, yielding significant improvements over baselines on low-resource and unseen languages.
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs: This paper introduces LocQA, a benchmark comprising 2,156 location-sensitive QA pairs across 12 languages and 49 regions. By employing geographically ambiguous queries (e.g., "What is the emergency phone number?"), it exposes implicit biases in LLMs: a persistent US-centric default across languages (50% of model responses contain US answers vs. only 26% in the data), a within-language "demographic probability engine" effect driven by population size, and an exacerbation of global bias following instruction fine-tuning.
Lost in Translation: Do LVLM Judges Generalize Across Languages?: This paper introduces MM-JudgeBench, the first large-scale multilingual multimodal judge benchmark (25 languages, 60K+ preference instances), evaluating 22 LVLMs and revealing significant cross-lingual performance disparities in current LVLM judges. Model size and architecture cannot predict multilingual robustness, and even state-of-the-art judges exhibit inconsistent behavior, underscoring the necessity of multilingual multimodal evaluation benchmarks.
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation: This paper proposes LQM (Linguistically Motivated Multidimensional Quality Metrics), a six-tier linguistically motivated MT error taxonomy spanning sociolinguistics → pragmatics → semantics → morphosyntax → orthography → graphetics. A bidirectional parallel corpus of 3,850 sentences across seven Arabic dialects is constructed, and 6,113 expert-annotated error spans are produced to reveal systematic deficiencies of existing MT systems in dialect-aware and culturally sensitive translation.
Mitigating Extrinsic Gender Bias for Bangla Classification Tasks: To address extrinsic gender bias in pretrained language models applied to Bangla downstream classification tasks, this paper proposes RandSymKL, a method that jointly optimizes randomized cross-entropy loss and symmetric KL divergence to effectively reduce gender prediction disparities while maintaining classification accuracy.
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation: This paper introduces MORPHOGEN, a large-scale gender-aware morphological generation benchmark covering French, Arabic, and Hindi (20,328 sentence pairs in total). It defines the GENFORM task (rewriting first-person sentences into the opposite gender), proposes three evaluation metrics—SGA, GIoU, and CGA—and benchmarks 15 multilingual LLMs, revealing systematic deficiencies in complex morphological reasoning, gender bias, and multi-entity interference.
Multilingual Language Models Encode Script Over Linguistic Structure: This paper systematically analyzes language-associated units in multilingual LMs using the LAPE metric and sparse autoencoders, finding that these units are primarily driven by orthography (writing system) rather than abstract linguistic structure. Romanization activates nearly entirely disjoint sets of neurons; word-order shuffling has minimal effect; typological information becomes accessible only gradually in deeper layers; and causal interventions reveal that functional importance correlates with surface-form invariance.
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs: This paper demonstrates that no single prompting strategy is universally optimal across all languages and tasks. It proposes to model strategy selection as a learned decision problem, using a lightweight classifier to predict the optimal strategy for each instance, achieving significant improvements over fixed strategies on four benchmarks.
Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition: This paper proposes NOVA-ARC, the first framework to formulate multilingual speech emotion recognition (SER) as an unsupervised transfer problem from labeled non-verbal vocalizations (NVV) to unlabeled verbal speech (UVS). By leveraging a hyperbolic prosody vector-quantized codebook, a Hyperbolic Emotion Lens, and optimal transport prototype alignment, NOVA-ARC achieves cross-modal emotion transfer and validates the feasibility and superiority of NVV→UVS transfer across 6 datasets.
SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams: This paper proposes the SERM framework, which continuously self-evolves a search relevance model from large-scale real-world query streams via a multi-agent sample miner and a multi-agent relevance annotator. After three iterative rounds on an industrial search platform, SERM achieves a NDCG@1 improvement of +2.99, and significantly improves user retention in online A/B testing.
Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation: This paper is the first to explore Universal Dependencies (UD) syntactic information as an augmentation source for in-context learning (ICL) in low-resource Coptic-to-English machine translation. While syntactic information alone is less effective than a bilingual lexicon, combining lexicon with syntactic information (LEX+SYN) achieves the best results across all tested models, with Gemma-27B reaching a BERTScore F1 of 0.8746 (+0.0361).
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models: This paper presents the GaoYao benchmark, comprising 182.3K samples across 26 languages and 51 countries/regions. Through a three-tier cultural evaluation framework (general multilingual / cross-cultural / mono-cultural) and nine cognitive sub-layers, combined with a human-localized subjective test set and an expert-validated cross-cultural synthetic dataset SuperBLEnD, GaoYao performs in-depth diagnosis of 20+ flagship and compact LLMs, revealing pronounced geographic digital divides and task-level capability stratification.
Unlocking the Edge: Multi-LoRA On-Device Deployment and Acceleration: This paper presents an on-device LLM deployment framework for Samsung Galaxy S24/S25, achieving dynamic task switching by treating LoRA weights as runtime inputs, reducing style-variant generation latency by 6× via multi-stream concurrent token generation, and accelerating decoding by 2.3× through draft-model-free Dynamic Self-Speculative Decoding—yielding an overall 4–6× optimization across 9 languages and 8 tasks.
Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic: This paper demonstrates that LLMs encode morphological inflections (e.g., walk→walked) as linear directions in embedding space, and proposes a compositional vocabulary design: replacing independently assigned tokens for each surface form with additive combinations of base words and transformation vectors. With the pretrained backbone frozen, only a small adapter module is trained, freeing 10–40% of vocabulary slots for multilingual expansion with negligible impact on downstream performance.
What Factors Affect LLMs and RLLMs in Financial Question Answering?: This paper systematically investigates how prompting methods, agent frameworks, and multilingual alignment approaches affect LLMs and RLLMs (Reasoning Large Language Models) on financial question answering tasks. The key finding is that existing methods essentially improve LLM performance by simulating Long CoT, but offer limited gains for RLLMs that already possess native Long CoT capabilities.
Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?: This paper presents the first systematic investigation into the sources of multilingual reasoning gaps in reasoning language models (RLMs), identifying language understanding failure as the primary cause, and proposes Selective Translation—applied only upon detected understanding failure—as an efficient mitigation strategy.