Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs¶

Conference: ACL 2025
arXiv: 2401.14640
Area: Graph Learning
Keywords: Attribution Evaluation, Knowledge Graphs, Question Answering Systems, Large Language Models, Benchmark Construction

TL;DR¶

Proposes the CAQA benchmark, which leverages knowledge graphs to automatically generate a large-scale QA attribution evaluation dataset (161K samples) containing four attribution categories (Supporting, Partially Supporting, Contradictory, Irrelevant) and four reasoning complexity levels. Systematically evaluates 25 automatic attribution evaluators, revealing that "partially supporting" identification and complex reasoning scenarios are the core bottlenecks of current evaluators.

Background & Motivation¶

Core Problem: Attributed Question Answering (AQA) aims to alleviate hallucinations by prompting models to provide citation evidence while generating answers. However, even state-of-the-art systems like Bing Chat and Perplexity still frequently produce erroneous attributions, creating an urgent need for reliable automatic attribution evaluation methods.

Three Major Flaws of Existing Benchmarks:

Flaw	Specific Manifestation	Representative Benchmarks
Incomplete Attribution Categories	Most only distinguish binary "Support/Non-support", while a few include "Partially Support" but are small-scale and rely on manual labor	HAGRID (2.6K), ExpertQA (2.2K)
Neglecting Attribution Complexity	Ignores complex scenarios requiring multiple pieces of evidence and multi-step reasoning to verify the answer	ALCE (800 samples)
Dependence on Manual Annotation	Manual annotation is costly and inefficient, making it difficult to scale to a large size	AttrEval-Gen (242 samples)

Key Observation: By analyzing the outputs of real-world AQA systems, the authors find that erroneous attributions can be subdivided into three categories: partially supporting (evidence lacks certain facts), contradictory (evidence conflicts with the answer), and irrelevant (evidence is unrelated to the answer). Moreover, real-world scenarios often require logical reasoning such as union, intersection, and concatenation across multiple pieces of evidence—dimensions that are entirely missing in existing benchmarks.

Core Idea: Utilizing the structured facts of Knowledge Graphs (KGs) and pre-existing query-answer pairs in KGQA datasets, the authors automatically generate four attribution categories via subgraph editing strategies, and introduce four levels of reasoning complexity via query expansion. This constructs a large-scale, annotation-free attribution evaluation benchmark.

Method¶

Overall Architecture¶

The construction process of CAQA consists of four steps: (1) collecting basic logical queries from KGQA datasets; (2) extending query complexity using intersection/union operations; (3) grounding the queries in the KG and generating four types of attributions through subgraph editing; (4) converting structured subgraphs into natural language citation texts using ChatGPT.

Key Designs¶

1. Strategy for Generating Four Attribution Categories Based on KG Subgraph Editing

Three types of basic queries (single triple, path, tree-like) are collected from KGQA datasets (GrailQA, WebQuestionsSP). The extended queries are grounded in the Freebase KG to obtain the complete subgraph \(\mathcal{G}\) as the supporting attribution. Subsequently, negative attributions are generated through three editing strategies: Partially Supporting—partially deleting the subgraph (randomly deleting a triple for path queries, deleting a path for tree-like queries) to make the evidence incomplete; Contradictory—replacing the answer entity with a non-answer entity of the same type, causing the reasoning result to conflict with the answer; Irrelevant—selecting structurally similar but entity-unrelated subgraphs from the KG, while keeping only the subject entity.

2. Four-Level Reasoning Complexity Based on Query Expansion

Four levels of attribution complexity are defined to decouple the reasoning capability of the evaluator: Single (verified by a single citation), Union (answer derived from the union of multiple independent citations), Intersection (answer derived from the intersection of multiple citations sharing entities), and Concatenation (answer derived from chained citations). The query expansion rules are: single-triple queries use union expansion (retrieving entities with the same name to generate a union query); path queries and tree-like queries use intersection expansion (appending new constraint triples or target constraints); each operation is applied with equal probability.

3. Automated Data Generation Pipeline

ChatGPT is employed to execute three conversions: transforming the edited KG subgraphs into natural language citation texts, translating the extended logical queries into natural language questions, and rewriting the answer entities into complete answer statements. Each ultimately generated sample contains five fields: question \(q\), answer statement \(\tilde{a}\), citation text \(c\), attribution category label \(t\), and complexity label \(r\), achieving end-to-end automation from structured knowledge to natural language attribution evaluation data.

Key Experimental Results¶

Dataset Statistics¶

Dimension	Category	Train Set	Test Set	Total
Attribution Category	Support (Sup.)	39,489	6,668	46,157
	Partially Support (Par.)	28,868	5,065	33,933
	Contradictory (Con.)	36,620	6,423	43,043
	Irrelevant (Irr.)	32,234	5,807	38,041
Complexity	Single	73,795	10,443	84,238
	Concatenation	46,783	8,455	55,238
	Union	5,347	886	6,233
	Intersection	11,286	4,179	15,465
Total	—	137,211	23,963	161,174

Main Results¶

Zero-shot F1 Scores by Category:

Evaluator	Support	Partially Support	Contradictory	Irrelevant	Overall
GPT-4	0.771	0.456	0.745	0.473	0.630
GPT-4o	0.769	0.445	0.598	0.626	0.630
Qwen-2.5 (72B)	0.629	0.266	0.701	0.471	0.571
Gemma-2 (27B)	0.653	0.184	0.569	0.646	0.566
LLaMA-3.1 (70B)	0.688	0.168	0.547	0.609	0.544
LLaMA-3.1 (8B)	0.544	0.049	0.130	0.017	0.318
AutoIS (11B)	0.609	—	—	—	—
AttrScore (13B)	0.687	—	0.523	0.541	0.521

Fine-tuning Settings F1 Scores:

Evaluator	Support	Partially Support	Contradictory	Irrelevant	Overall
LLaMA-3 (8B)	0.935	0.901	0.935	0.928	0.926
LLaMA-3.1 (8B)	0.946	0.919	0.944	0.934	0.941
Mistral-v0.3 (7B)	0.944	0.921	0.947	0.935	0.942
Vicuna (13B)	0.942	0.923	0.939	0.923	0.933

Key Findings¶

Partially Supporting is the most difficult category to identify: Even the strongest GPT-4 achieves only 0.456 F1 under zero-shot settings; evaluators tend to misclassify "Partially Supporting" as "Support".
Keyword co-occurrence leads to misclassification: Evaluators often ignore differences in semantic relations due to keyword overlap (e.g., co-occurrence of "video game" and entity names), misclassifying irrelevant or partially supporting instances as support.
Complex reasoning scenarios pose a greater challenge: GPT-4 scores 0.685 on Single but drops sharply to 0.451 on Concatenation; non-GPT models similarly show significant decreases in union/intersection scenarios.
Few-shot benefits large models but provides limited gain for small models: Models ≥70B and the GPT series improve by 4.84% on average under few-shot settings, whereas small models show almost no improvement or even decline.
High consistency between automatic and manual annotations: The Pearson correlation coefficient between automatically generated categories and manual annotations reaches 0.97.
Out-of-distribution generalization: On the OOD test set ALCE-FineGrained, Vicuna-13B fine-tuned on CAQA achieves an overall F1 of 0.52, outperforming AttrScore's 0.36.

Highlights & Insights¶

Automated method based on KG subgraph editing utilizes the KGQA dataset as a structured skeleton for attribution generation, avoiding manual annotation costs while naturally guaranteeing label correctness.
The "Partially Supporting" category fills a key gap in existing benchmarks—a large number of errors in real-world systems fall under "incomplete but not contradictory evidence", which existing binary classification benchmarks fail to capture.
The introduction of the complexity dimension decoouples attribution evaluation from reasoning complexity for the first time, revealing the fundamental weaknesses of evaluators in multi-step reasoning scenarios.
Fine-tuned 7-8B small models can achieve 90%+ F1, demonstrating that attribution evaluation capabilities can be efficiently learned and do not solely rely on model scale.
The scale of 161K provides the largest training/testing resource to date for attribution evaluation research.

Limitations & Future Work¶

Based on Freebase KG, it mainly covers factual knowledge QA, with insufficient coverage of scenarios involving opinion, temporal reasoning, and mathematical reasoning.
The natural language conversion relies on ChatGPT, and the generated citation texts may exhibit templated patterns, presenting a gap with the diversity of real-world web citations.
The partially supporting category cannot be generated under single-triplet queries (deleting the only triple makes it irrelevant), leading to insufficient coverage of this category under Single complexity.
Only interesection and union logical operations are used, without considering more complex logical operations such as negation.

Attributed Question Answering: Menick et al. (2022) train attribution models; Gao et al. (2023) propose the ALCE benchmark; RAG systems enhance attribution through retrieval.
Attribution Evaluation: Automatic evaluators represented by AutoIS (Honovich et al., 2022) and AttrScore (Yue et al., 2023); benchmarks such as HAGRID, ExpertQA, AttributionBench.
Knowledge Graph Question Answering: GrailQA and WebQuestionsSP provide structured query-answer pairs.
Hallucination Detection: FActScore (Min et al., 2023) proposes a sub-fact level evaluation framework.

Rating¶

Novelty: ★★★★☆ — First benchmark to combine KG with automatic generation of four attribution categories + four levels of complexity, with significant methodological innovation.
Technical Depth: ★★★★☆ — Query expansion and subgraph editing strategies are delicately designed, with rigorous and complete logic.
Experimental Thoroughness: ★★★★★ — 25 evaluators, three settings, OOD tests, and manual consistency verification, making it extremely comprehensive.
Value: ★★★★☆ — The 161K dataset provides a crucial resource for attribution evaluation research, and the fine-tuning scheme can be directly deployed.