Truth Knows No Language: Evaluating Truthfulness Beyond English¶

Conference: ACL 2025
arXiv: 2502.09387
Code: github.com/hitz-zentroa/truthfulqa-multi
Area: LLM Safety
Keywords: Truthfulness evaluation, TruthfulQA, Multilingual, LLM-as-a-Judge, Low-resource languages

TL;DR¶

The first professionally translated multilingual TruthfulQA benchmark (Basque, Catalan, Galician, Spanish) is constructed, revealing that cross-lingual truthfulness disparities in LLMs are smaller than expected, and that LLM-as-a-Judge aligns better with human judgment than multiple-choice metrics.

Background & Motivation¶

TruthfulQA is the standard benchmark for evaluating LLM truthfulness, with the core concept of testing whether models mimic human false beliefs and misconceptions. However, this benchmark has prominent limitations:

English-only: Although some developers have performed machine translation, there is a lack of professionally translated versions and systematic cross-lingual evaluations.

Controversial evaluation methods: Whether the standard multiple-choice metric (MC2) is sufficient to measure truthfulness remains questionable, especially in non-English scenarios.

Cultural bias: TruthfulQA possesses a strong English/US cultural background, with many questions involving US laws, English proverbs, etc.

The motivation of this work is: Are LLMs equally truthful across different languages? If a misconception can be avoided in English, can it also be avoided in Basque?

The choice of target languages is also deliberate: - Basque: An agglutinative language isolate with extremely scarce pre-training data for LLMs. - Catalan, Galician: Low-resource Romance languages. - Spanish: A relatively high-resource control group.

Method¶

Overall Architecture¶

The work comprises three main components:

Professionally translated dataset: Translating 817 questions of TruthfulQA into 4 target languages.
Three evaluation methods: Human evaluation, the MC2 multiple-choice metric, and LLM-as-a-Judge.
Machine translation substitution experiments: Verifying whether machine translation can replace professional translation.

Key Designs¶

Translation Strategy Choice: The team faced two options: (1) localizing to fit the target culture, or (2) retaining the original cultural context. They ultimately chose to preserve the original cultural context to maintain complete cross-lingual parallelism. Specific translation guidelines include: - Proverbs and misquotes: Employing a literal translation strategy (e.g., directly translating "an apple a day"). - Acronym misconceptions: Keeping the original English terms and annotating "in English" within the question. - Fictional named entities: Using existing translations (such as movie character names) or borrowing from English when no translation exists. - All translations were conducted by professional translators who are native speakers of the target languages.

Human Evaluation Design: - Evaluating 400 responses (4 models \(\times\) 100 questions), covering truthfulness and informativeness. - Utilizing binary labels (truthful/untruthful, informative/uninformative) instead of the scalar scores in the original paper. - Adding additional guidelines for instruction-tuned models: extra information in long responses must be verified by the evaluators. - Computing inter-annotator agreement using 50 overlapping annotations.

LLM-as-a-Judge Training: - Base models: Llama 2 7B (existing fine-tuned version), Gemma 2 9B, Llama 3.1 8B. - Training data: Original English data vs. all-language data including machine translation. - Best configuration: Gemma 2 9B instruct + all-language translation data. - The informativeness judge model was trained separately but was only effective for base models (as instruction models yielded almost no uninformative responses).

MC2 Metric: - Measures the normalized probability of true answers out of all reference answers. - Uses LM Evaluation Harness in a 6-shot setup. - Chat models use a multi-turn conversation format.

Loss & Training¶

The training of LLM-as-a-Judge uses a learning rate of 0.01 and runs for 5 epochs. The core objective is binary classification of truthfulness on (question, answer) pairs.

Key Experimental Results¶

Main Results¶

Human Evaluation Results (truthfulness percentage of 100 samples/model/language):

Model	English	Spanish	Catalan	Galician	Basque
Gemma-2-27b-it	73%	73%	71%	72%	62%
Llama-3-70B-IT	67%	70%	62%	58%	48%
Llama-3-8B-IT	67%	61%	63%	51%	34%
Llama-3-70B (base)	36%	58%	58%	60%	54%

Key Observations: - Instruction-tuned models typically perform best in English and worst in Basque, but the disparities are smaller than expected. - The base Llama-3-70B model exhibits the lowest truthfulness in English (36%) but higher in other languages—this is because it is more "informative" in English but prone to mimicking false beliefs. - In terms of informativeness, base models frequently generate uninformative responses in non-English languages.

Correlation between Judge-LLM / MC2 and Human Judgments (Cohen's Kappa):

Method	English	Spanish	Catalan	Galician	Basque
MC2	~0.3	~0.2	~0.2	~0.2	~0.1
Judge-LLM (Best)	0.74	0.70	0.75	0.72	0.60
Inter-annotator Agreement	~0.75	~0.72	~0.70	~0.70	~0.65

Key Result: The agreement between Judge-LLM and human judgment is significantly higher than that of MC2, and is close to the inter-annotator agreement among humans.

Full Judge-LLM Evaluation (12 models \(\times\) 5 languages): - Gemma-2-27b-it performs best across all languages (averaging ~61%). - Instruction models average ~57%, while base models average ~46%. - Disparities across languages: English (~50-58%) > Spanish \(\approx\) Galician > Catalan > Basque.

Ablation Study¶

Machine Translation vs. Professional Translation: - Automatically translating TruthfulQA using Claude 3.5 Sonnet. - Taking professional translation as the reference, the MT quality is high (slightly lower for Basque due to its agglutinative nature). - Judge-LLMs trained with the MT version perform comparable to those trained with the professional translation version. - Conclusion: Machine translation serves as a viable alternative for extending truthfulness benchmarks to more languages.

General vs. Context-Dependent Questions: - Dividing TruthfulQA questions into "general knowledge" (e.g., why chameleons change color) and "context/time-dependent" (e.g., US laws). - General knowledge questions show more consistent performance across all languages. - Context-dependent questions display larger cross-lingual disparities, making them more suitable for evaluating multilingual truthfulness.

Key Findings¶

The cross-lingual truthfulness disparities in LLMs are much smaller than expected—even in Basque (the lowest-resource language), the performance degradation is limited.
MC2 as the sole metric for truthfulness evaluation is insufficient; LLM-as-a-Judge is a more reliable alternative.
Informativeness is a key factor in truthfulness evaluation—base models often generate uninformative responses, and ignoring informativeness distorts the assessment results.
Larger LLMs are generally more truthful than smaller models of the same family, contradicting early findings of Lin et al. (2022).
Qualitative analysis shows that responses in English still enjoy a significant lead in reasoning depth.

Highlights & Insights¶

First Professional Translation: Avoids systematic biases that machine translation might introduce, providing a reliable baseline for subsequent research.
In-depth Comparison of Evaluation Methods: Systematically compares human evaluation, MC2, and Judge-LLM, quantifying correlation using Cohen's Kappa.
Counter-intuitive Phenomenon in Base Models: Base models exhibit lower truthfulness in English than in other languages, because they are more "confident" in mimicking false beliefs in English.
Practical Finding: Machine translation serves as a viable alternative to professional translation, significantly lowering the cost of building multilingual benchmarks.
Dimension of Cultural and Temporal Dependence: Emphasizes that truthfulness evaluation should distinguish between general knowledge and context-dependent knowledge.

Limitations & Future Work¶

TruthfulQA itself is highly Anglo-American centric—even professional translation cannot alter the cultural background of the questions.
Only four target languages (Spanish, Catalan, Galician, Basque) are covered; there is a need to expand to more language families.
Informativeness evaluation is only effective for base models; instruction-tuned models lack uninformative answers, leaving training insufficient.
The sample size for human evaluation is limited (100 questions \(\times\) 4 models \(\times\) 5 languages = 2000 assessments).
"Culture-specific misconceptions" in different languages were not explored—different cultures may harbor different false beliefs.

TruthfulQA (Lin et al., 2022): The original benchmark, which used GPT-3 as a judge. This work replaces it with a stronger multilingual judge.
Aula-Blasco et al. (2025): A multilingual truthfulness benchmark that distinguishes between general vs. context-dependent questions—this work adopts this classification and provides empirical support.
HuggingFace OpenLLM Leaderboard: Extensively uses the MC2 metric—this work demonstrates that MC2 has low correlation with human judgment.
Insights: Cross-lingual evaluation is not only about translation quality but also involves fundamental reflections on cultural adaptation and evaluation methodology.

Rating¶

Novelty: ⭐⭐⭐ — The core contribution lies in high-quality translation and systematic evaluation, with limited methodology innovation.
Value: ⭐⭐⭐⭐ — Provides an important benchmark and methodological guidance for multilingual truthfulness evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 12 models \(\times\) 5 languages \(\times\) 3 evaluation methods, plus human evaluation and IAA analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, with detailed elaboration on translation guidelines and evaluation methodology.