Skip to content

ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty

Conference: ACL 2025
arXiv: 2412.20251
Code: https://github.com/HKUST-KnowComp/ComparisonQA
Area: LLM Safety
Keywords: factuality, knowledge frequency, robustness, uncertainty, benchmark

TL;DR

The ComparisonQA benchmark (283K paired questions) is constructed to achieve controlled comparisons by having high- and low-frequency entities share the same abstract question. Combining a two-stage evaluation method of accuracy and uncertainty, the study reveals that LLMs (including GPT-4o) exhibit extremely poor robustness to low-frequency knowledge.

Background & Motivation

Background: Factuality evaluation of LLMs is a highly active research area. Benchmarks like PopQA and SimpleQA have revealed that models perform poorly on low-frequency entities.

Limitations of Prior Work: Existing comparison methods use different questions (differing in difficulty and format) for high- and low-frequency entities, failing to eliminate the confounding factor of question difficulty.

Key Challenge: How to demonstrate that knowledge frequency is indeed the critical factor affecting LLM performance under strictly controlled variables?

Goal: Construct a controlled comparison benchmark and address the semantic shortcut issue.

Key Insight: Pair entities to share the same "abstract question" (replacing concrete entities with hypernyms) to ensure that entity frequency is the sole independent variable.

Core Idea: Achieve controlled and shortcut-free factuality robustness evaluation through shared abstract questions combined with a two-stage evaluation (accuracy + uncertainty).

Method

Overall Architecture

Extracting high- and low-frequency entity pairs from DBpedia -> Generating shared abstract questions using GPT-4 -> Two-stage evaluation (Stage 1 measures accuracy, Stage 2 filters semantic shortcuts using uncertainty) -> Constructing the ComparisonQA-Hard subset.

Key Designs

  1. Entity Pair Extraction

    • Entities are categorized into high-frequency (top 1/3) and low-frequency (bottom 1/3) based on the number of relations in DBpedia.
    • Pairing requirement: Entities must share the same hypernym (e.g., both are "cities") to ensure they can share questions.
    • Design Motivation: The number of DBpedia relations is highly correlated with the entity frequency in LLM training data.
  2. Abstract Question Generation

    • Multiple-choice questions (MCQs) are generated by replacing concrete entity names with hypernyms (e.g., "What is the population of this city?").
    • The same question is instantiated separately with high-frequency and low-frequency entities.
    • Design Motivation: Shared abstract questions guarantee that entity frequency is the sole variable.
  3. Two-Stage Evaluation Method

    • Stage 1: Standard MCQ testing for accuracy.
    • Stage 2: Measuring model uncertainty (token probability entropy) to detect questions answered correctly by exploiting semantic shortcuts.
    • Design Motivation: Evaluating based on accuracy alone overestimates model capabilities.
  4. ComparisonQA-Hard Subset

    • Automatically filtering high-quality, shortcut-free, and difficult low-frequency questions (81K) by combining accuracy and uncertainty.

Key Experimental Results

Main Results: Accuracy Comparison Between High- and Low-Frequency Entities

Model High-Freq Accuracy Low-Freq Accuracy Gap
GPT-4o ~85% ~55% -30%
Llama-3-70B ~78% ~45% -33%
Qwen-2-72B ~80% ~48% -32%

Robustness Evaluation (After Two-Stage Method)

Model High-Freq Robustness Rate Low-Freq Robustness Rate Description
GPT-4o ~70% ~35% Robustness rate is much lower than accuracy
Average of All Models ~65% ~30% Extremely poor robustness on low-frequency knowledge

Key Findings

  • Frequency is a deterministic factor: Controlled comparison demonstrates a decline of over 30 percentage points on low-frequency entities.
  • Semantic shortcuts are prevalent: A large number of correctly answered questions are actually guessed by exploiting semantic clues in options.
  • Uncertainty is an effective filtering tool: The combination of low uncertainty and high accuracy effectively identifies shortcut-based questions.
  • GPT-4o is no exception: Even the most powerful models exhibit extremely poor robustness on low-frequency knowledge.

Highlights & Insights

  • Shared abstract questions provide an elegant solution for controlled comparison, ensuring the validity of causal inference.
  • The two-stage evaluation method introduces uncertainty into factuality assessment, addressing the blind spots of accuracy-only evaluations.
  • The 283K paired dataset offers rich resources for systematic research into knowledge frequency effects.

Limitations & Future Work

  • Frequency definition relies on the number of relations in DBpedia, which may not perfectly align with the frequencies in actual pre-training data.
  • Abstract questions generated by GPT-4 may introduce bias.
  • Only the MCQ format is evaluated, while open-ended generation scenarios are not covered.
  • vs PopQA: PopQA compares entities of different frequencies using distinct questions, failing to isolate question difficulty.
  • vs SimpleQA: SimpleQA selects questions based solely on adversarial accuracy, overlooking the issue of semantic shortcuts.

Rating

  • Novelty: ⭐⭐⭐⭐ Innovations in both the controlled comparison design and the two-stage evaluation method.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 283K data + multiple models + uncertainty analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation of the problem.
  • Value: ⭐⭐⭐⭐ Provides a more rigorous methodology for factuality evaluation.