ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty¶
Conference: ACL 2025
arXiv: 2412.20251
Code: https://github.com/HKUST-KnowComp/ComparisonQA
Area: LLM Safety
Keywords: factuality, knowledge frequency, robustness, uncertainty, benchmark
TL;DR¶
The ComparisonQA benchmark (283K paired questions) is constructed to achieve controlled comparisons by having high- and low-frequency entities share the same abstract question. Combining a two-stage evaluation method of accuracy and uncertainty, the study reveals that LLMs (including GPT-4o) exhibit extremely poor robustness to low-frequency knowledge.
Background & Motivation¶
Background: Factuality evaluation of LLMs is a highly active research area. Benchmarks like PopQA and SimpleQA have revealed that models perform poorly on low-frequency entities.
Limitations of Prior Work: Existing comparison methods use different questions (differing in difficulty and format) for high- and low-frequency entities, failing to eliminate the confounding factor of question difficulty.
Key Challenge: How to demonstrate that knowledge frequency is indeed the critical factor affecting LLM performance under strictly controlled variables?
Goal: Construct a controlled comparison benchmark and address the semantic shortcut issue.
Key Insight: Pair entities to share the same "abstract question" (replacing concrete entities with hypernyms) to ensure that entity frequency is the sole independent variable.
Core Idea: Achieve controlled and shortcut-free factuality robustness evaluation through shared abstract questions combined with a two-stage evaluation (accuracy + uncertainty).
Method¶
Overall Architecture¶
Extracting high- and low-frequency entity pairs from DBpedia -> Generating shared abstract questions using GPT-4 -> Two-stage evaluation (Stage 1 measures accuracy, Stage 2 filters semantic shortcuts using uncertainty) -> Constructing the ComparisonQA-Hard subset.
Key Designs¶
-
Entity Pair Extraction
- Entities are categorized into high-frequency (top 1/3) and low-frequency (bottom 1/3) based on the number of relations in DBpedia.
- Pairing requirement: Entities must share the same hypernym (e.g., both are "cities") to ensure they can share questions.
- Design Motivation: The number of DBpedia relations is highly correlated with the entity frequency in LLM training data.
-
Abstract Question Generation
- Multiple-choice questions (MCQs) are generated by replacing concrete entity names with hypernyms (e.g., "What is the population of this city?").
- The same question is instantiated separately with high-frequency and low-frequency entities.
- Design Motivation: Shared abstract questions guarantee that entity frequency is the sole variable.
-
Two-Stage Evaluation Method
- Stage 1: Standard MCQ testing for accuracy.
- Stage 2: Measuring model uncertainty (token probability entropy) to detect questions answered correctly by exploiting semantic shortcuts.
- Design Motivation: Evaluating based on accuracy alone overestimates model capabilities.
-
ComparisonQA-Hard Subset
- Automatically filtering high-quality, shortcut-free, and difficult low-frequency questions (81K) by combining accuracy and uncertainty.
Key Experimental Results¶
Main Results: Accuracy Comparison Between High- and Low-Frequency Entities¶
| Model | High-Freq Accuracy | Low-Freq Accuracy | Gap |
|---|---|---|---|
| GPT-4o | ~85% | ~55% | -30% |
| Llama-3-70B | ~78% | ~45% | -33% |
| Qwen-2-72B | ~80% | ~48% | -32% |
Robustness Evaluation (After Two-Stage Method)¶
| Model | High-Freq Robustness Rate | Low-Freq Robustness Rate | Description |
|---|---|---|---|
| GPT-4o | ~70% | ~35% | Robustness rate is much lower than accuracy |
| Average of All Models | ~65% | ~30% | Extremely poor robustness on low-frequency knowledge |
Key Findings¶
- Frequency is a deterministic factor: Controlled comparison demonstrates a decline of over 30 percentage points on low-frequency entities.
- Semantic shortcuts are prevalent: A large number of correctly answered questions are actually guessed by exploiting semantic clues in options.
- Uncertainty is an effective filtering tool: The combination of low uncertainty and high accuracy effectively identifies shortcut-based questions.
- GPT-4o is no exception: Even the most powerful models exhibit extremely poor robustness on low-frequency knowledge.
Highlights & Insights¶
- Shared abstract questions provide an elegant solution for controlled comparison, ensuring the validity of causal inference.
- The two-stage evaluation method introduces uncertainty into factuality assessment, addressing the blind spots of accuracy-only evaluations.
- The 283K paired dataset offers rich resources for systematic research into knowledge frequency effects.
Limitations & Future Work¶
- Frequency definition relies on the number of relations in DBpedia, which may not perfectly align with the frequencies in actual pre-training data.
- Abstract questions generated by GPT-4 may introduce bias.
- Only the MCQ format is evaluated, while open-ended generation scenarios are not covered.
Related Work & Insights¶
- vs PopQA: PopQA compares entities of different frequencies using distinct questions, failing to isolate question difficulty.
- vs SimpleQA: SimpleQA selects questions based solely on adversarial accuracy, overlooking the issue of semantic shortcuts.
Rating¶
- Novelty: ⭐⭐⭐⭐ Innovations in both the controlled comparison design and the two-stage evaluation method.
- Experimental Thoroughness: ⭐⭐⭐⭐ 283K data + multiple models + uncertainty analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation of the problem.
- Value: ⭐⭐⭐⭐ Provides a more rigorous methodology for factuality evaluation.