🌐 Multilingual & Translation¶
💬 ACL2026 · 64 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (8) · 🧪 ICML2026 (3) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (11) · 📹 ICCV2025 (1)
🔥 Top topics: Translation ×18 · LLM ×8 · Speech & Audio ×3 · Agents ×3 · Sentiment Analysis ×2
- A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction
-
Constructs the first multilingual MRE Mix dataset (MMM, 21 subsets covering English, Chinese, and Japanese) and systematically validates that the Mutual Reinforcement Effect (MRE) between word-level and text-level information extraction tasks is cross-linguistically universal through large-scale ablation experiments.
- Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
-
Alexandria constructs a multi-turn Dialectal Arabic-English parallel dataset covering 13 Arabic countries, 11 social impact domains, and 107K turns. Through a community-driven human translation and revision process, it provides unprecedented fine-grained training and evaluation resources for Dialectal Arabic machine translation and systematically benchmarks 24 LLMs.
- BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation
-
BabelDOC is proposed as a layout-preserving PDF translation system based on an Intermediate Representation (IR) that decouples visual layout from semantic content. This allows NLP operations—such as LLM translation, terminology extraction, cross-page context awareness, and formula masking—to be performed at the semantic layer before being re-anchored to the original layout via an adaptive typesetting engine. On a 200-page benchmark, it outperforms PDFMathTranslate and DeepL Document Translation in BIoU, layout fidelity, and terminology consistency.
- Beyond Literal Mapping: Benchmarking and Improving Non-Literal Evaluation Evaluation
-
The authors construct MENT, a meta-evaluation dataset for non-literal translation (7,530 human annotations), revealing the unreliability of traditional metrics and LLM-as-Judge in non-literal scenarios. They propose the RATE agentic evaluation framework, which improves correlation with human judgment by over 3.2 points through a reflective core agent that dynamically invokes functional sub-agents.
- BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources
-
The first unified survey specifically targeting Indic NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models/tools. Organized by 17 task categories (from core language processing to socio-cultural tasks), it systematically analyzes persistent challenges such as uneven linguistic coverage, fragmented annotation, and inconsistent evaluation.
- CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
-
This paper proposes CLewR (Curriculum Learning with Restarts), a strategy that sorts data from easy to hard during preference optimization and restarts the curriculum every epoch. This effectively mitigates catastrophic forgetting and consistently improves machine translation performance across multiple model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization algorithms (DPO, CPO, ARPO).
- Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media
-
By systematically comparing emoji frequency, semantics, and sentiment polarity across 100 million financial microblogs in 4 languages, 2 platforms, and 2 asset classes, this study finds that while emoji frequency varies significantly across languages/platforms, their semantics and polarity remain highly stable. Consequently, in zero-shot sentiment transfer, incorporating emojis into text consistently reduces the cross-platform transfer gap from as high as 21% to nearly 0%.
- DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
-
This SemEval system paper utilizes the FLORES parallel corpus to extract language directions and injects language steering vectors into the residual stream of multilingual LLMs during inference. The system achieved an official MCQ accuracy of 86.96% (7th out of 17 teams), though post-hoc analysis indicates that gains are highly sensitive to layers, prompts, models, and locales.
- Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts
-
This is the first end-to-end Handwritten Text Recognition (HTR) pipeline for Old Nepali. By employing a "Synthetic Devanagari → Printed Nagari → Old Nepali Manuscripts" three-stage transfer learning curriculum, \(8\times\) data augmentation with 20 techniques, byte-level BPE, and a script-aware decoder, the CER is reduced from a fine-tuned TrOCR baseline of \(9.6\%\) to \(4.9\%\). The code, models, and a Streamlit web application are open-sourced.
- Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
-
TriMix decomposes Low-Resource Language (LRL) adaptation into three logit benefit vectors: "language capability + task capability + scaling dividends." It only requires continual pre-training (CPT) on a small model. At inference time, weights are dynamically determined via perplexity. It consistently outperforms single-model baselines and Proxy Tuning across 4 model families and 8 LRLs. A core empirical discovery is that "the weight of the small CPT model should be higher than that of the large instruction model," directly challenging the "large-model-dominant" assumption in Proxy Tuning.
- Efficient Training for Cross-lingual Speech Language Models
-
This paper proposes CSLM, an efficient training method for cross-lingual speech LLMs. By utilizing a novel alignment strategy to achieve cross-modal and cross-lingual alignment and introducing speech-text interleaved chain-of-modality generation, the model improves quality and reduces latency while scaling to new languages without requiring large-scale speech data.
- EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context
-
EMCEE enables LLMs to first extract synthetic multilingual context related to non-English queries from their internal parameters, then merges context-augmented responses with CoT reasoning responses via an LLM-as-a-Judge, significantly improving performance on low-resource languages across four multilingual tasks.
- Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization
-
This paper integrates a KAN block into a BiGRU classifier and an attention-based GRU summarization model for low-resource multilingual Bengali legal documents. The approach achieves a classification accuracy of 67.96% and ROUGE-1/2/L scores of 0.38/0.23/0.31, improving the BiGRU accuracy from 57.34% to 67.96% in ablation studies.
- Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
-
This paper proposes MulTypo—a multilingual typo generation algorithm based on language-specific keyboard layouts and 10-finger typing habits. It systematically evaluates the robustness of 18 open-source LLMs across 12 languages and 5 downstream tasks, demonstrating that typos significantly impact generation and reasoning tasks, instruction-tuned models are more fragile, and typo effects exhibit cross-lingual and directional asymmetry.
- Evaluating the Impact of Verbal Multiword Expressions on Machine Translation
-
This paper presents the first systematic evaluation of the impact of Verbal Multiword Expressions (VMWEs: Verbal Idioms (VID), Verb-Particle Constructions (VPC), and Light Verb Constructions (LVC)) on machine translation quality. Analyzing 8 MT systems across 7 language pairs using two QE models and human DA scores, the study proves that VMWEs consistently lead to performance degradation. This degradation is strictly positively correlated with "non-compositionality" (VID > VPC > LVC), and even GPT-4.1/GPT-5.1 cannot eliminate this regression.
- Exploring Two-Phase Continual Instruction Fine-tuning for Multilingual Adaptation in Large Language Models
-
This paper proposes a two-phase continual fine-tuning (CFT) framework—fine-tuning on English instruction data first, followed by multilingual data—finding that the instruction similarity between datasets across phases is the key factor determining whether English proficiency degrades. It further effectively mitigates representation drift and English forgetting caused by dissimilar datasets through generative replay and heuristic layer freezing.
- FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation
-
Proposes FairQE, a multi-agent framework that effectively mitigates systematic gender bias in QE models through gender cue detection, gender-flipped variant generation, and a dynamic bias-aware score aggregation mechanism, without sacrificing the accuracy of translation quality assessment.
- From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
-
This paper proposes DeFactoX, which organizes Hindi news preference data using curriculum learning and incorporates two signals—Actuality (factuality) and Finesse (stability)—into DPO. This enables the model to simultaneously predict news veracity and generate Hindi rationales that closely align with manual fact-checking explanations.
- From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages
-
The authors perform a systematic comparison of traditional taggers (UDPipe/COLaF) and open-source LLMs (Gemma3-12B/Phi4-14B) for POS tagging across three Medieval Romance languages (Old Occitan NAF, Old Catalan CAT, Old French Chauliac). Evaluating five settings—zero-shot, few-shot, monolingual fine-tuning, bilingual CLTF, and trilingual CLTF—they find that LLMs consistently outperform traditional methods. Catalan acts as a "bridge language," where CAT+FR bilingual training elevates the Old French Chauliac corpus to a peak accuracy of 93.14%.
- Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech
-
This paper proposes Hierarchical Policy Optimization (HPO), which post-trains LLM-based simultaneous speech translation models using a hierarchical reward design. By suppressing latency optimization when translation quality fails to meet a threshold, it achieves a +7 COMET translation quality improvement at a 1.5-second latency.
- IndoTabVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents
-
This paper proposes IndoTabVQA, a cross-lingual Visual Question Answering benchmark for tables in Bahasa Indonesia documents. It consists of 1,593 document images with QA annotations in four languages (Indonesian, English, Hindi, and Arabic), revealing significant performance gaps in VLMs for low-resource languages and cross-lingual table understanding. Fine-tuning combined with spatial priors achieves an In-Match accuracy of up to 48.5%.
- Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
-
The authors organized 19 native experts to conduct 8.8k human-machine text discrimination trials across 16 datasets involving 9 languages, 9 domains, and 11 SOTA LLMs. They found that the average expert accuracy reached 87.6% (significantly higher than the "near random" conclusions of early studies) and further revealed that while machine text rewritten with prompts explicitly addressing differences can lower detection accuracy to 72.5%, humans tend to prefer machine text when they cannot distinguish its source, challenging the implicit assumption that "human-like equals liked-by-human."
- Just Use XML: Revisiting Joint Translation and Label Projection
-
LabelPigeon is proposed as a joint translation and label projection method based on XML tags. By fine-tuning the NLLB-200 translation model on high-quality XML-tagged parallel corpora, it outperforms all baselines across 11 languages and actively improves translation quality, achieving up to a +40.2 F1 gain in downstream cross-lingual NER tasks.
- Language Models Entangle Language and Culture
-
This paper evaluates multilingual LLMs using general advice-seeking questions constructed from the WildChat dataset. It discovers systematic differences in response quality and cultural context across different language queries—response quality in low-resource languages is significantly lower than in English. Furthermore, the choice of language implicitly alters the cultural information utilized in responses. This entanglement between language and culture in LLMs is verified through a translated version of CulturalBench.
- Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality
-
This paper proposes XBridge, an architecture that composes pretrained multilingual encoder-decoder translation models (e.g., NLLB) with English-centric LLMs. The encoder handles multilingual understanding, the LLM performs knowledge reasoning, and the decoder executes multilingual generation. Cross-model semantic bridging is achieved through lightweight mapping layers and optimal transport alignment, significantly outperforming baselines on low-resource and unseen languages.
- LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
-
LaoBench is the first large-scale, multidimensional Lao evaluation benchmark for LLMs, containing 17,000+ expert-curated samples. It covers three dimensions: Culture-Knowledge Application, Lao K12 Curriculum, and Lao-Chinese-English trilingual translation. It features a unique three-part design—Open-source 7k + Black-box 10k + Open-ended 500. The 10k black-box set prevents contamination via a controlled scoring service. Mainstream closed-source models (GPT-5-High, Gemini-2.5-Pro, etc.) still lag behind human experts by ~10-20 percentage points, indicating that Lao cultural reasoning and translation fidelity remain significant unsolved challenges.
- Lingo_Research_Group at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection
-
This SemEval-2026 Task 9 system paper utilizes Gemma3-27B and 12 types of English prompt variants to perform online polarization detection across 22 languages. It finds that prompt-only methods effectively complete coarse-grained binary classification but exhibit significant degradation in fine-grained multi-label tasks such as identifying polarization targets and manifestations.
- LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models
-
A two-stage enhancement module consisting of "LLM Refinement + Self-consistency Voting + MMD Word Distribution Alignment + QA-style Document Semantic Alignment" is wrapped around pre-trained cross-lingual topic models. Acting as a plug-in for various backbones like NMTM, InfoCTM, and XTRA, it improves CNPMI by 9%–51% and TQ by 6%–44% across three bilingual corpora (EC News, Amazon Review, Rakuten Amazon), while reducing LLM calls to "once every \(f\) epochs."
- Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
-
This paper introduces the LocQA benchmark (12 languages, 49 regions, 2,156 region-relevant Q&As) to reveal implicit biases in LLMs through geographically ambiguous questions (e.g., "What is the emergency phone number?"). It uncovers persistent cross-lingual US-centric defaults (50% of model responses contain US answers vs. 26% in the data) and a "population probability engine" effect driven by population size within languages. Furthermore, instruction tuning is found to exacerbate global bias.
- LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
-
Ours proposes LQM (Linguistically Motivated Multidimensional Quality Metrics), a six-level linguistically motivated MT error typology (sociolinguistics → pragmatics → semantics → morphosyntax → orthography → graphetics), and constructs a bidirectional parallel corpus of 3,850 sentences across 7 Arabic dialects. Through expert annotation of 6,113 error spans, the study reveals systematic deficiencies in existing MT systems regarding dialectal and culture-aware translation.
- Massively Multilingual Joint Segmentation and Glossing
-
This work addresses the "morphological segmentation + morpheme-by-morpheme glossing" joint prediction task for endangered language documentation. The authors expanded the GlossLM corpus to 340,000 examples covering 2,077 languages to train PolyGloss, a family of ByT5-based multilingual seq2seq models. PolyGloss simultaneously predicts morpheme boundaries and gloss tags from raw transcriptions, outperforming GlossLM in glossing and multiple open-source LLMs across segmentation, glossing, and alignment, while supporting rapid adaptation to new languages via LoRA.
- Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
-
Proposes Source-Shielded Updates (SSU), a column-wise freezing strategy driven by source data importance scores. In continual pre-training (CPT) using only unlabeled target language data, it reduces source language performance degradation from 20.3% (Full Fine-Tuning) to 3.4% while maintaining comparable or superior target language performance.
- Mitigating Extrinsic Gender Bias for Bangla Classification Tasks
-
Addressing extrinsic gender bias in Bangla pre-trained models for downstream classification tasks, the authors propose RandSymKL. This method employs joint optimization of randomized cross-entropy loss and symmetric KL divergence to effectively reduce predictive disparities between genders while maintaining classification accuracy.
- Modular Monolingual Adaptation using Pretrained Language Models
-
For adapting multilingual pretrained language models (PMLMs) to low-resource languages, the authors advocate a modular approach: "adopting a language-specific tokenizer + freezing input/output embeddings while training only the Transformer body." This method consistently outperforms full fine-tuning on Masked Language Modeling (MLM), NER, and POS tasks for Scottish Gaelic, Irish, and Quechua, while reducing trainable parameters by approximately 25% and nearly halving GPU memory and training time.
- MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
-
This paper proposes MORPHOGEN, a large-scale gender-aware morphological generation benchmark covering French, Arabic, and Hindi (20,328 sentence pairs). It defines the GENFORM task (rewriting first-person sentences to the opposite gender) and introduces three evaluation metrics: SGA, GIoU, and CGA. Benchmarking 15 multilingual LLMs reveals systematic deficiencies in complex morphological reasoning, gender bias, and multi-entity interference.
- Multilingual Language Models Encode Script Over Linguistic Structure
-
This paper systematically analyzes language-associated units in multilingual LMs using LAPE metrics and Sparse Autoencoders (SAEs), discovering that these units are primarily driven by orthography (writing systems) rather than abstract linguistic structure: Romanized transliterations activate almost entirely non-overlapping sets of neurons, word shuffling has minimal impact, typological information only becomes accessible in deeper layers, and causal interventions show that functional importance is tied to surface form invariance.
- Multilingual Refusal Alignment for Safer Large Language Models
-
To be added after further reading
- Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
-
This paper demonstrates that multilingual sparse autoencoders combined with layer selection at the intersection of "multilingual alignment and language separability" make SAE language steering more stable. This approach transforms the empirical layer selection problem in multilingual control into a predictable representation diagnostic problem.
- NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning
-
NeoAMT transforms neologism translation from a problem purely dependent on model parametric knowledge into an agentic MT task characterized by "reasoning, then dictionary lookup, then translation." By using GRPO training sessions targeting neologism hit rates, overall translation quality, and translation difficulty, an 8B model significantly outperforms SFT, retrieval-free RL, and various general/translation-specific LLMs on the Neko neologism translation benchmark.
- NiuTrans.LMT: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs
-
This paper introduces NiuTrans.LMT, an open-source LLM machine translation suite covering 60 languages and 234 Chinese-English dual-centric translation directions across four scales (0.6B/1.7B/4B/8B). It identifies that multi-way parallel data causes X→Zh/En directional degeneration in symmetric SFT and restores quality to the level of strong open-source MMT systems using Strategic Downsampling, Parallel Multilingual Prompting, and GRPO with COMET rewards.
- No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
-
This paper demonstrates that no single prompting strategy is universally optimal across all languages and tasks. It proposes modeling strategy selection as a learned decision problem, using a lightweight classifier to predict the optimal strategy for each instance, which significantly outperforms fixed strategies across four benchmarks.
- PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation
-
PEAR transforms reference-free MT quality estimation from "assigning absolute scores to single translations" to "directly comparing the relative differences between two candidate translations." It outperforms matched single-candidate QE baselines and some large-scale metrics in the WMT24 MQM evaluation with a smaller model size.
- PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
-
PluRule models Reddit community moderation as a multiple-choice task: "Given a comment and its context, select which community rule was violated or if no violation occurred." The authors construct a benchmark covering 1,989 communities, 2,885 rules, and 9 languages, showing that even GPT-5.2 high reasoning achieves only approximately 57.6% accuracy with full context.
- Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition
-
This paper proposes NOVA-ARC, which models multilingual Speech Emotion Recognition (SER) for the first time as an unsupervised transfer problem from labeled Non-Verbal Vocalizations (NVV) to unlabeled Verbal Speech (UVS). It achieves cross-modal emotion transfer through a prosodic vector quantization codebook in hyperbolic space, a hyperbolic emotion lens, and optimal transport prototype alignment, validating the feasibility and superiority of NVV \(\to\) UVS transfer across 6 datasets.
- Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
-
This paper redefines low-resource language expansion from token-level imitation to semantic space alignment. By employing GRPO and embedding-based semantic rewards to train Qwen3-4B, the authors achieve enhanced capabilities in Tibetan-Chinese translation and Tibetan headline generation. Crucially, this approach preserves dominant language performance (e.g., Chinese CMRC) significantly better than strong SFT.
- RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
-
RouteLMT formalizes the routing problem in hybrid LLM translation as a marginal gain allocation under a fixed large-model budget. It utilizes the internal representations of the last prompt token from a small translation model to predict "how much improvement the large model can bring relative to the small model." Across four translation directions, it achieves superior quality-budget Pareto frontiers compared to length-based, quality estimation (QE), and external router methods.
- Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP
-
This paper systematically reviews the evolving role of transliteration in cross-lingual NLP, proposes a taxonomy of five motivations (Named Entity/OOV handling, code-mixing, leveraging cross-script similarity, English-centric transfer, and unified preprocessing), compares the pros and cons of six integration methods, and discusses whether transliteration remains necessary in the context of modern LLMs.
- Selective Contrastive Learning For Gloss Free Sign Language Translation
-
This paper discovers that random in-batch negative samples in sign language translation often serve as unreliable or semantically conflicting supervision signals. Consequently, it utilizes similarity trajectories from a reference model to filter more informative negative samples and improves gloss-free sign language translation quality through a curriculum-based contrastive learning approach from easy to hard.
- SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams
-
The authors propose the SERM framework, which utilizes a Multi-agent Sample Miner and a Multi-agent Relevance Annotator to continuously evolve search relevance models from massive real-world query streams. After three iterations, it achieved a +2.99 increase in NDCG@1 on an industrial search platform and significantly improved user retention in online A/B tests.
- SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics
-
SteerEval investigates aligning the hidden representations of multilingual evaluation models toward high-resource pivot languages during inference. It finds that steering toward English or French generally improves the correlation between automated multilingual summarization metrics and human scores, particularly benefiting low-baseline languages and encoder-based COMET metrics.
- Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts
-
SGER proposes a two-stage curriculum learning framework to fine-tune Llama 3 8B for entity name matching: Phase 1 trains the model to parse name structures (outputting JSON), and Phase 2 trains a binary matcher starting from the Phase 1 checkpoint. It achieves 99.02% accuracy and 0.994 F1 on a dataset of 50,000 Indian KYC pairs and has been deployed in the production environment of Dream11 (250 million users).
- Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation
-
This paper represents the first exploration of using Universal Dependencies (UD) syntactic information as an enhancement source for In-Context Learning (ICL) in low-resource Coptic-to-English machine translation. The findings indicate that while syntactic information alone is less effective than a lexicon, combining the lexicon with syntax (LEX+SYN) achieves the best performance across all models, with Gemma-27B reaching a BERTScore F1 of \(0.8746\) (\(+0.0361\)).
- The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models
-
This paper introduces the GaoYao benchmark, featuring 182.3K samples across 26 languages and 51 countries/regions. Utilizing a three-tier cultural evaluation framework (General Multilingual / Cross-cultural / Mono-cultural) and nine cognitive sub-layers, it combines human-localized subjective test sets with the expert-verified cross-cultural synthetic dataset SuperBLEnD to deeply diagnose the multilingual capabilities of over 20 flagship and compact LLMs, revealing significant geo-digital divides and task capability stratification.
- Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
-
Ours proposes SignThought, a reasoning-driven gloss-free sign language translation framework. It introduces learnable latent thought slots as an explicit intermediate semantic layer between video and text. Using a "plan-then-locate" dual-stream decoder, it decouples semantic planning from visual evidence retrieval, outperforming existing gloss-free methods on multiple benchmarks.
TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
- Toward Culturally Grounded Natural Language Processing
-
This synthesis paper integrates over 50 works on multilingual and cultural NLP, pointing out that "language coverage" does not equate to "cultural competence," and proposes a layered evaluation protocol and research agenda centered on communicative ecologies.
- TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law
-
This paper constructs the first sentence-level parallel dataset, HKCFA Judgement 97-22, specifically for English-Chinese translation of Hong Kong Court of Final Appeal judgements. It proposes the TransLaw multi-agent system, which simulates professional legal translation workflows. TransLaw significantly outperforms single-agent benchmarks in automatic metrics, professional legal translator evaluations, and cost-efficiency.
- Unlocking the Edge: Multi-LoRA On-Device Deployment and Acceleration
-
This paper proposes an on-device LLM deployment framework for the Samsung Galaxy S24/S25. It achieves dynamic task switching by using LoRA weights as runtime inputs, reduces style variant latency by up to 6x through multi-stream concurrent token generation, and accelerates decoding by up to 2.3x via Dynamic Self-Speculative Decoding without any draft models. Overall optimization of 4-6x is realized across 8 tasks in 9 languages.
- Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic
-
This paper discovers that LLMs encode morphological variations (e.g., walk→walked) as linear directions in the embedding space. Based on this, a compositional vocabulary design is proposed: replacing independent tokens for each surface form with an additive combination of a base word and transformation vectors. By training a small adaptation module while freezing the pre-trained backbone, this method releases 10-40% of vocabulary slots for multilingual expansion with negligible impact on downstream performance.
- Vocabulary Shapes Cross-Lingual Variation of Word-Order Learnability in Language Models
-
This paper uses the Mallows model to generate continuous word-order perturbation spectra for 10 European languages. After training small autoregressive LMs, it finds that more irregular word orders are harder to learn, but cross-lingual differences are primarily explained by vocabulary coverage, sentence length, and morphological complexity, rather than simple free vs. fixed word-order labels.
- What Factors Affect LLMs and RLLMs in Financial Question Answering?
-
This paper systematically investigates the impact of prompting methods, Agent frameworks, and multilingual alignment methods on LLMs and RLLMs (Reasoning LLMs) in financial QA tasks. It finds that existing methods essentially improve LLM performance by simulating Long CoT, but provide limited benefits to RLLMs that already possess inherent Long CoT capabilities.
- Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
-
This study provides the first systematic analysis of the sources of multilingual reasoning gaps in Reasoning Language Models (RLMs). It identifies language understanding failure as the primary cause and proposes Selective Translation, which detects understanding failures to efficiently bridge the gap.
- Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish
-
Using Luxembourgish—theoretically an ideal case for cross-lingual transfer—as a "best-case" scenario, this paper argues that low-resource NLP cannot rely solely on the spontaneous transfer of multilingual models. Instead, it must integrate cross-lingual scaffolding with target-language-specific data cleaning, resource construction, and task design.
- XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
-
Ours constructs XQ-MEval, the first translation evaluation benchmark with cross-lingual parallel quality. By generating controllable-quality pseudo-translations through semi-automatic MQM error injection, it empirically reveals cross-lingual scoring biases in automatic metrics for the first time and proposes the LGN normalization strategy to effectively calibrate multilingual metric evaluations.