MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation¶
Conference: ACL 2025
arXiv: 2502.17163
Code: GitHub
Area: NLP / RAG Evaluation / Multilingual
Keywords: RAG, Meta-Evaluation, Multilingual, Faithfulness, LLM-as-a-Judge
TL;DR¶
Constructs MEMERAG, the first native multilingual RAG meta-evaluation benchmark covering 5 languages. It achieves high inter-annotator agreement through flowchart-guided annotation, and is designed to evaluate and compare multilingual RAG automatic evaluators.
Background & Motivation¶
Retrieval-Augmented Generation (RAG) is one of the most important application paradigms for LLMs, but reliably evaluating the quality of RAG-generated text remains an open problem. Current RAG evaluation benchmarks suffer from three critical limitations:
Lack of Multilingual Meta-Evaluation Benchmarks: Existing RAG evaluation benchmarks (such as RAGAs) are almost entirely focused on English. Multilingual evaluation is either missing or heavily relies on translated data.
Limitations of Translated Data: Translated data suffers from "translationese" (simplified syntax and vocabulary choices) and fails to realistically reflect the experiences and preferences of native users.
Difficulties in Faithfulness Annotation: Faithfulness evaluation in RAG involves subjective judgment, ambiguous label spaces, and low inter-annotator agreement.
The authors' key stance is: parallel (translated) benchmarks should be complemented by native multilingual benchmarks. Based on native multilingual questions from the MIRACL dataset, they end-to-end construct a complete meta-evaluation pipeline covering question generation \(\to\) retrieval \(\to\) answer generation \(\to\) human evaluation.
The meaning of meta-evaluation: This benchmark itself is not used to directly evaluate RAG systems, but rather to evaluate "RAG automatic evaluators" (such as LLM-as-a-Judge) by measuring the correlation between automated evaluators and human judgments to select the best evaluation schemes.
Method¶
Overall Architecture¶
The construction pipeline of MEMERAG is as follows:
- Question Selection: Select non-time-dependent native questions from the MIRACL dataset.
- Context Selection: Retrieve the top-5 passages using BM25, ensuring at least one relevant passage is included.
- Answer Generation: Generate answers using 5 different LLMs.
- Human Annotation: Expert annotators label the faithfulness and relevance of each sentence in every answer.
- Meta-Evaluation Application: Use the annotated data to evaluate the performance of LLM-as-a-Judge.
Key Designs¶
-
Native Question Source: Instead of using translated questions, native questions authored by native speakers from MIRACL are directly used. It covers 5 languages: EN, DE, ES, FR, and HI, representing multiple language families and various resource levels (high/low). Time-dependent questions (e.g., "Who is the president of Spain?") are filtered out, removing 3-7% of questions per language.
-
Multi-Model Answer Generation: Uses 5 diverse LLMs: Claude 3 Sonnet, Llama3 70B, Llama3 8B, Mistral 7B, and GPT-4o mini. All models are instructed with English prompts to answer based on the context, and are required to generate answers in the same language as the questions. Temperature is set to 0.1, with a maximum of 1000 tokens.
-
Flowchart-Guided Annotation (Core Innovation):
- Faithfulness Annotation: 3 coarse-grained labels (Supported / Not Supported / Challenging to determine) + 10 fine-grained labels (such as Direct paraphrase, Logical conclusion, Adds new info, Contradiction, Mis-referencing, etc.).
- Relevance Annotation: 3 labels (Directly answers / Adds context / Unrelated).
- The annotation process is guided by a decision flowchart where annotators follow steps to make judgments rather than choosing labels directly, which significantly improves agreement.
- Highlights of "potentially supporting sentences" generated by LLMs are provided to further assist annotators in locating key info.
-
Annotation Quality Assurance: 250 questions are selected for each language, among which 10 questions are annotated by 3 annotators to calculate inter-annotator agreement (IAA). Using Gwet's AC1 and Fleiss Kappa:
- Faithfulness IAA: AC1 = 0.84-0.93, Kappa = 0.70-0.88 (consistently higher than the 0.34-0.42 reported in prior works).
- Relevance IAA: AC1 = 0.95-1.0, Kappa = 0.63-1.0.
Meta-Evaluation Experimental Design¶
- Evaluation Dimension: Coarse-grained faithfulness (binary classification: Supported vs Not Supported).
- Prompting Strategies: Zero-shot, CoT, Annotation Guidelines (AG), AG+CoT.
- Evaluation Models: GPT-4o mini, Qwen 2.5 32B, Llama 3.2 11B/90B.
- Metrics: Balanced Accuracy (BAcc), giving equal weight to all labels and languages.
Key Experimental Results¶
Main Results: Overall Multilingual Faithfulness Evaluation¶
| Prompt | GPT-4o mini | Qwen 2.5 32B | Llama 3.2 90B | Llama 3.2 11B |
|---|---|---|---|---|
| Zero-shot | 59.7 | 66.7 | 58.0 | 55.4 |
| CoT | 61.4 | 68.8 | 59.9 | 62.5 |
| AG | 71.6 | 72.6 | 62.8 | 57.9 |
| AG+CoT | 71.7 | 71.8 | 64.4 | 61.6 |
Ablation Study: Distribution of Faithfulness Labels Across Languages¶
| Language | Supported | Not Supported | Challenging |
|---|---|---|---|
| EN | 65.2% | 31.5% | 3.2% |
| DE | 71.2% | 26.7% | 2.1% |
| ES | 65.7% | 32.9% | 1.4% |
| FR | 62.0% | 37.8% | 0.2% |
| HI | 73.8% | 25.6% | 0.6% |
Cross-lingual differences in fine-grained error types (partial list):
| Error Type | EN | DE | ES | FR | HI |
|---|---|---|---|---|---|
| Wrong reasoning | 10.0% | 0.6% | 1.4% | 1.9% | 0.3% |
| Adds new info | 7.0% | 9.6% | 16.0% | 15.0% | 14.8% |
| Contradiction | 4.5% | 11.3% | 8.3% | 5.9% | 7.1% |
Key Findings¶
- Annotation Guidelines (AG) is the most critical prompt improvement: Introducing AG boosts GPT-4o mini's performance from 59.7% to 71.6%, a 12 percentage point increase, vastly outperforming the 1.7% gain from CoT.
- Qwen 2.5 32B performs best "out-of-the-box": Leads in both zero-shot and CoT configurations, but is matched by GPT-4o mini once AG is added, indicating that Qwen's default behavior aligns more closely with human judgment.
- Cross-lingual error patterns differ significantly: The main error type in English is "wrong reasoning" (10%), whereas in Spanish, it is "adding new information/hallucination" (16%). This discrepancy stems from question complexity and model performance variations across languages.
- Flowchart annotation significantly elevates IAA: Compared to prior works' Kappa of 0.34-0.42, this study achieves 0.70-0.88, demonstrating the effectiveness of the flowchart-guided approach.
- Spanish responses are the most verbose (averaging 52.1 words vs 30.3 words for English) and have the highest proportion of "adding context" relevance labels.
Highlights & Insights¶
- The stance of "Native vs Translated" is distinct and significant: The translationese issue of translated data has long been overlooked in multilingual NLP evaluation, which this paper directly addresses.
- Flowchart-guided annotation is a practical methodological contribution: Structuring the annotation process as a decision tree minimizes subjective judgment from annotators and can be generalized to other annotation tasks requiring high IAA.
- Robust design of the meta-evaluation framework: The two application scenarios—prompt selection and model selection—align well with real-world needs, offering a clear mode of benchmark evaluation.
- Fine-grained error analysis reveals linguistic nuances: LLMs make different types of errors across different languages—a finding that offers direct guidance for developing multilingual RAG systems.
Limitations & Future Work¶
- Limited language coverage: Contains only 5 languages, lacking key languages like Chinese, Japanese, and Arabic, as well as lower-resource languages.
- Few LLMs evaluated: Only assessed 4 evaluator models; specialized, fine-tuned faithfulness evaluators were not tested.
- Non-parallel data: Questions vary across languages, making direct cross-linguistic correlation comparisons difficult (due to potential variations in question difficulty).
- Control limited to the question side: Fails to control the complexity or error type distribution on the generation side of the LLMs.
- Only 250 questions per language: The scale is relatively small, which might limit statistical significance.
Related Work & Insights¶
- RAGAs: An English-centric RAG evaluation framework whose philosophy is extended to multilingual scenarios in this work.
- MIRACL: A multilingual retrieval benchmark from which the source data for this study is drawn.
- LLM-as-a-Judge: An increasingly popular automatic evaluation paradigm; this work provides a calibrated multilingual benchmark for it.
- MIRAGE-BENCH: Another multilingual RAG benchmark, which, unlike MEMERAG, relies on GPT-4o-synthesized judgments instead of human annotation, risking self-preference bias.
- SummEdits / ExpertQA: Related works on faithfulness evaluation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Built the first native multilingual RAG meta-evaluation benchmark, employing a practical and novel flowchart-guided annotation methodology.
- Experimental Thoroughness: ⭐⭐⭐ — The experiments are sound but small in scale (250 questions/language) with limited coverage of evaluation models.
- Writing Quality: ⭐⭐⭐⭐ — Clearly structured, with highly documented annotation processes and comprehensive appendices.
- Value: ⭐⭐⭐⭐ — Fills the gap in multilingual RAG meta-evaluation, offering a widely reusable annotation methodology.