ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty¶

Conference: ACL 2025
arXiv: 2412.20251
Code: https://github.com/HKUST-KnowComp/ComparisonQA
Area: LLM Safety
Keywords: factuality, knowledge frequency, robustness, uncertainty, benchmark

TL;DR¶

The ComparisonQA benchmark (283K paired questions) is constructed to achieve controlled comparisons by having high- and low-frequency entities share the same abstract question. Combining a two-stage evaluation method of accuracy and uncertainty, the study reveals that LLMs (including GPT-4o) exhibit extremely poor robustness to low-frequency knowledge.

Background & Motivation¶

Background: Factuality evaluation of LLMs is a highly active research area. Benchmarks like PopQA and SimpleQA have revealed that models perform poorly on low-frequency entities.

Limitations of Prior Work: Existing comparison methods use different questions (differing in difficulty and format) for high- and low-frequency entities, failing to eliminate the confounding factor of question difficulty.

Key Challenge: How to demonstrate that knowledge frequency is indeed the critical factor affecting LLM performance under strictly controlled variables?

Goal: Construct a controlled comparison benchmark and address the semantic shortcut issue.

Key Insight: Pair entities to share the same "abstract question" (replacing concrete entities with hypernyms) to ensure that entity frequency is the sole independent variable.

Core Idea: Achieve controlled and shortcut-free factuality robustness evaluation through shared abstract questions combined with a two-stage evaluation (accuracy + uncertainty).

Method¶

Overall Architecture¶

Extracting high- and low-frequency entity pairs from DBpedia -> Generating shared abstract questions using GPT-4 -> Two-stage evaluation (Stage 1 measures accuracy, Stage 2 filters semantic shortcuts using uncertainty) -> Constructing the ComparisonQA-Hard subset.

Key Designs¶

Entity Pair Extraction
- Entities are categorized into high-frequency (top 1/3) and low-frequency (bottom 1/3) based on the number of relations in DBpedia.
- Pairing requirement: Entities must share the same hypernym (e.g., both are "cities") to ensure they can share questions.
- Design Motivation: The number of DBpedia relations is highly correlated with the entity frequency in LLM training data.
Abstract Question Generation
- Multiple-choice questions (MCQs) are generated by replacing concrete entity names with hypernyms (e.g., "What is the population of this city?").
- The same question is instantiated separately with high-frequency and low-frequency entities.
- Design Motivation: Shared abstract questions guarantee that entity frequency is the sole variable.
Two-Stage Evaluation Method
- Stage 1: Standard MCQ testing for accuracy.
- Stage 2: Measuring model uncertainty (token probability entropy) to detect questions answered correctly by exploiting semantic shortcuts.
- Design Motivation: Evaluating based on accuracy alone overestimates model capabilities.
ComparisonQA-Hard Subset
- Automatically filtering high-quality, shortcut-free, and difficult low-frequency questions (81K) by combining accuracy and uncertainty.

Key Experimental Results¶

Main Results: Accuracy Comparison Between High- and Low-Frequency Entities¶

Model	High-Freq Accuracy	Low-Freq Accuracy	Gap
GPT-4o	~85%	~55%	-30%
Llama-3-70B	~78%	~45%	-33%
Qwen-2-72B	~80%	~48%	-32%

Robustness Evaluation (After Two-Stage Method)¶

Model	High-Freq Robustness Rate	Low-Freq Robustness Rate	Description
GPT-4o	~70%	~35%	Robustness rate is much lower than accuracy
Average of All Models	~65%	~30%	Extremely poor robustness on low-frequency knowledge

Key Findings¶

Frequency is a deterministic factor: Controlled comparison demonstrates a decline of over 30 percentage points on low-frequency entities.
Semantic shortcuts are prevalent: A large number of correctly answered questions are actually guessed by exploiting semantic clues in options.
Uncertainty is an effective filtering tool: The combination of low uncertainty and high accuracy effectively identifies shortcut-based questions.
GPT-4o is no exception: Even the most powerful models exhibit extremely poor robustness on low-frequency knowledge.

Highlights & Insights¶

Shared abstract questions provide an elegant solution for controlled comparison, ensuring the validity of causal inference.
The two-stage evaluation method introduces uncertainty into factuality assessment, addressing the blind spots of accuracy-only evaluations.
The 283K paired dataset offers rich resources for systematic research into knowledge frequency effects.

Limitations & Future Work¶

Frequency definition relies on the number of relations in DBpedia, which may not perfectly align with the frequencies in actual pre-training data.
Abstract questions generated by GPT-4 may introduce bias.
Only the MCQ format is evaluated, while open-ended generation scenarios are not covered.

vs PopQA: PopQA compares entities of different frequencies using distinct questions, failing to isolate question difficulty.
vs SimpleQA: SimpleQA selects questions based solely on adversarial accuracy, overlooking the issue of semantic shortcuts.

Rating¶

Novelty: ⭐⭐⭐⭐ Innovations in both the controlled comparison design and the two-stage evaluation method.
Experimental Thoroughness: ⭐⭐⭐⭐ 283K data + multiple models + uncertainty analysis.
Writing Quality: ⭐⭐⭐⭐ Clear motivation of the problem.
Value: ⭐⭐⭐⭐ Provides a more rigorous methodology for factuality evaluation.