Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models¶
Conference: ACL 2025
arXiv: 2507.11882
Code: GitHub
Area: Multilingual Translation
Keywords: multilingual benchmark, instruction following, localization, cross-lingual evaluation, IFEval
TL;DR¶
Expands the English IFEval benchmark to 30 languages with cultural localization, revealing a 25-35% accuracy gap between high- and low-resource languages in multilingual instruction following, and showing that machine-translated data underestimates model performance by 7-22%.
Background & Motivation¶
Monolingual Limitations of Existing Benchmarks: Instruction-following evaluation benchmarks like IFEval are predominantly designed for English, falling short of assessing the true capabilities of LLMs in multilingual scenarios.
Inadequate Quality of Machine Translation Data: Multilingual datasets such as Multi-IF are generated solely via machine translation, failing to capture linguistic and cultural nuances, which leads to distorted evaluation results.
Lack of Adaptation to Language-Specific Features: Different languages possess unique linguistic constraints (e.g., the lack of letter casing in Chinese, different passive voice structures in Japanese), which simple translation cannot handle.
Need for Cultural Context Localization: Cultural references in instructions (festivals, locations, company names) must undergo target-language localization to ensure cultural relevance in evaluations.
Neglect of Low-Resource Languages: Existing evaluations cover a limited range of languages; the instruction-following capabilities in low-resource languages like Yoruba (yo), Nepali (ne), and Kazakh (kk) remain largely unassessed.
Compositional Instruction Understanding as a Bottleneck: LLMs perform reasonably well when satisfying a single instruction, but the prompt-level accuracy when satisfying multiple constraints simultaneously is 10-20% lower than instruction-level accuracy. This performance gap is even more pronounced in multilingual settings.
Method¶
Overall Architecture¶
Marco-Bench-MIF utilizes a three-stage pipeline (preprocessing \(\rightarrow\) translation/localization \(\rightarrow\) post-processing) to extend the 541 English instruction-response pairs of IFEval to 30 languages (covering 6 major language families), with 541 instances per language. Automatic translation is combined with two rounds of manual verification to ensure high quality.
Module 1: Preprocessing—Constraint Classification and Filtering¶
- Cardinality Dimension: Categorizes instructions into single-constraint (SC, 49.9%) and multi-constraint (MC, 50.1%).
- Type Dimension: Divides constraints into expression constraints (EC, e.g., formatting/structural requirements) and content constraints (CC, e.g., containing specific information).
- Employs a progressive adaptation strategy: simpler SC+EC are processed first, followed by complex MC+CC to minimize error propagation.
- Performs data filtering to remove ambiguous instructions and balance the distribution of constraint types.
Module 2: Translation and Localization¶
- Translation Pipeline: Google Translate for initial translation \(\rightarrow\) bilingual professional translator proofreading \(\rightarrow\) LLM-assisted error correction.
- Three-Step Localization Method:
- Lexical Substitution: Replaces culture-specific terms (names, locations) while keeping the constraint positions unchanged.
- Topical Transposition: Adapts the scenario background to domains familiar to the target culture.
- Pragmatic Restructuring: Reorganizes the instruction using the rhetorical conventions of the target language.
- Conducts cultural localization based on ten sociolinguistic dimensions (historical background, social customs, lifestyles, regional features, etc.).
- Creates parallel corpora of MT baselines and localized versions for 5 languages (ar, es, ms, yo, zh) to support comparative experiments.
Module 3: Post-processing—Multi-layer Quality Assurance¶
- Combines automatic pattern detection and manual review targeting six common translation failure points: keywords, concluding remarks, echoed content, postscript consistency, case-sensitivity adherence, and Latin character frequencies in non-Latin scripts.
- Double LLM cross-verification: one LLM generates output, and another analyzes failure cases, distinguishing among model capability limits, instruction set defects, and evaluation logic vulnerabilities.
- Systematic localization of the evaluation framework across 30 languages: punctuation alignment, response language verification, multi-paragraph coherence validation, and constrained output checks.
Evaluation Metrics¶
- Strict/Loose: Strict uses exact rule matching, whereas Loose permits matching after text normalization (e.g., markdown removal, boundary adjustments).
- Prompt-level/Instruction-level: Prompt-level requires all instructions within a prompt to be fully satisfied, while Instruction-level evaluates the individual adherence rate of each instruction constraint.
Experiments¶
Table 1: Overall Results (20+ Models, Average across 4 Metrics)¶
| Model | Prompt(S) | Prompt(L) | Inst.(S) | Inst.(L) | Avg |
|---|---|---|---|---|---|
| Ministral-8B | 21.74 | 24.49 | 46.45 | 49.72 | 35.60 |
| Qwen2.5-7B | 42.99 | 47.43 | 64.42 | 68.02 | 55.72 |
| Gemma2-27B | 58.86 | 61.35 | 77.21 | 78.78 | 69.05 |
| LLaMA3.3-70B | 67.42 | 70.32 | 80.43 | 82.25 | 75.11 |
| GPT-4o | 71.43 | 75.89 | 84.49 | 87.13 | 79.73 |
| Claude3.5-sonnet | 73.61 | 76.77 | 85.62 | 87.71 | 80.93 |
Table 2: Linguistic Analysis (30 Languages, Average Instruction-level Loose Accuracy)¶
| Language Category | Representative Languages | Accuracy Range |
|---|---|---|
| High-resource (Europe/East Asia) | de, fr, zh, en | 70-90% |
| Mid-resource | ar, ko, tr | 55-70% |
| Low-resource | yo, ne, kk | 29-50% |
Key Findings¶
- Instruction-level vs. Prompt-level Gap: The instruction-level accuracy of all models is 10-20% higher than their prompt-level accuracy, with smaller models exhibiting larger gaps (e.g., Ministral-8B exhibits a 24.7 percentage point difference). This indicates that compositional instruction reasoning remains a key bottleneck.
- Model Scaling Effect: Models larger than 70B outperform 8B models by 45-60% in absolute accuracy. However, Qwen2.5-7B already achieves a 64.42% strict instruction-level accuracy, showing that basic instruction comprehension can be realized at smaller scales.
- High- vs. Low-Resource Language Divide: Even Claude 3.5 Sonnet achieves only 62.3% in Yoruba (yo) compared to 90.3% in English, resulting in a gap of approximately 28 percentage points.
- Script-Specific Challenges: Right-To-Left (RTL) scripts (ar, he) are particularly sensitive; for instance, LLaMA-3.3-70B achieves 78.5% in Hebrew vs. 54.7% in Urdu.
- Performance Underestimation by MT Data: Localized data yields 7-22% higher accuracy compared to machine-translated data. This discrepancy is most pronounced in low-resource languages (e.g., Claude's performance in Yoruba is underestimated by 7.1%).
- Language-Specific Response Capability: The multilingual-enhanced model Aya-expanse-32B achieves 95.69% in the "respond in the specified language" task, outperforming most commercial models.
Highlights & Insights¶
- Covers a high-quality localized benchmark across 30 languages and 6 major language families, significantly exceeding existing multilingual instruction-following evaluations.
- Systematically resolves cross-lingual and cross-cultural adaptation challenges through a three-step localization method (lexical substitution, topical transposition, and pragmatic restructuring).
- Experiments reveal the systematic underestimation effect of MT-based evaluation data, offering a valuable reference for multilingual evaluation methodologies.
- Conducts fine-grained analyses across multiple dimensions—such as constraint types, script characteristics, and language family transfer—yielding highly practical insights.
Limitations & Future Work¶
- The 30 languages covered still do not include non-Latin scripts (e.g., Ge'ez/Ethiopic, Cherokee) or dialectal variations (e.g., Arabic dialects).
- Cultural localization is confined to superficial adaptations (e.g., date formats) and does not delve into deeper pragmatic levels (e.g., Japanese honorific strategies).
- Residual biases persist in automatic localization (e.g., GPT-4's tendency toward formal register), and translation artifacts may still remain in certain languages.
- Evaluates only static prompts without addressing interactive instruction-refinement scenarios.
- Being an extension of IFEval, it inherits the inherent constraints on constraint types and instruction design from the original IFEval.
Related Work & Insights¶
- IFEval (Zhou et al., 2023): English instruction-following evaluation benchmark, serving as the foundation for the expansion in this work.
- Multi-IF (He et al., 2024): Multi-turn multilingual instruction-following benchmark, but primarily relying on MT data.
- CulturalBench (Chiu et al., 2024): Cultural knowledge evaluation benchmark.
- BLEND (Myung et al., 2024): Multicultural and multilingual everyday knowledge benchmark.
- CVQA (Romero et al., 2024): Multicultural and multilingual visual question answering benchmark.
Rating¶
- Novelty: ⭐⭐⭐ — The methodological framework follows a standard pipeline of translation, localization, and verification; the core contribution lies in its scale and systematic approach.
- Effectiveness: ⭐⭐⭐⭐ — Conducts a large-scale evaluation over 20+ models and 30 languages, presenting solid and credible findings.
- Value: ⭐⭐⭐⭐ — Fills a gap in multilingual instruction-following evaluation, with direct reference value for the development and assessment of multilingual LLMs.
- Recommendation Index: ⭐⭐⭐⭐ — A benchmark paper with outstanding empirical contributions, highly recommended for researchers focusing on multilingual LLM evaluation.