Towards Effective Extraction and Evaluation of Factual Claims¶

Conference: ACL 2025
arXiv: 2502.10855
Code: None
Area: LLM Safety
Keywords: Factual Claim Extraction, Fact-Checking, Evaluation Framework, Decontextualization, LLM-Generated Content Validation

TL;DR¶

Proposes a standardized framework for evaluating factual claim extraction quality (including metrics like coverage and decontextualization), and develops Claimify—an LLM-based method that handles ambiguity and extracts claims under high confidence, outperforming existing methods within this framework.

Background & Motivation¶

Background: With the ubiquity of long-form text generated by LLMs, fact-checking has become increasingly critical. The dominant strategy currently decomposes long-form text into simple, independently verifiable claims, which are then verified individually. This pipeline of "extracting claims first and then verifying them one by one" has become the fundamental paradigm of fact-checking.

Limitations of Prior Work: The quality of claim extraction directly dictates the effectiveness of fact-checking; if the extracted claims are inaccurate or incomplete, subsequent verification results will suffer. However, there is a lack of a standardized evaluation framework to systematically measure and compare different claim extraction methods. Evaluation standards vary across methods, making fair comparison difficult.

Key Challenge: While seemingly simple, claim extraction involves multiple quality dimensions: Do the extracted claims cover all key information in the source text (coverage)? Is each claim sufficiently atomic and independently verifiable (atomicity)? Are the claims still understandable when removed from the original context (decontextualization)? Trade-offs may exist among these dimensions, and automated measurement methods are currently lacking.

Goal: (1) Establish a standardized evaluation framework for claim extraction covering key quality dimensions; (2) Propose automated, scalable, and reproducible evaluation methods; (3) Develop a claim extraction method capable of handling ambiguity effectively.

Key Insight: The authors observe that prior evaluations of claim extraction methods often neglect coverage and decontextualization, lacking automated metrics for both. Systematic improvement of the entire pipeline can be achieved by starting from well-defined quality dimensions.

Core Idea: Propose automated evaluation methods for coverage and decontextualization, and design Claimify—a method that actively handles ambiguity during extraction and outputs claims only under high confidence, fundamentally improving claim quality.

Method¶

Overall Architecture¶

The contributions of this work are twofold: (1) An evaluation framework that defines quality dimensions for claim extraction in fact-checking contexts and introduces automated evaluation methods; (2) Claimify, an LLM-based claim extraction method guided by carefully designed prompts to extract claims under high confidence. The overall workflow takes a long text input, decomposes it into a list of simple, self-contained, and independently verifiable factual claims using Claimify, while measuring the extraction quality using the evaluation framework.

Key Designs¶

多维度评估框架:
- Function: Systematically evaluate the quality of claim extraction
- Mechanism: Defines several key quality dimensions: coverage measures whether the extracted claims cover all verifiable information from the source text; atomicity measures whether each claim is sufficiently simple to contain only one verifiable fact; decontextualization measures whether a claim remains understandable and unambiguous when detached from the original text; minimality measures whether claims are concise and free of redundant information. The core innovation of this framework lies in the automated measurement of coverage and decontextualization. Coverage is measured by an LLM judging whether information in the original text is captured by the set of claims, while decontextualization is assessed by checking for unresolved pronouns or implicit contextual dependencies in the claims.
- Design Motivation: Existing evaluation methods either rely on manual annotation (not scalable) or focus on a single dimension. A standardized framework enables fair comparisons across different methods.
Claimify声明抽取方法:
- Function: Extract high-quality factual claims from long text
- Mechanism: Claimify is implemented based on LLMs, guiding the model through a carefully designed multi-step prompt for claim extraction. Its key characteristics are: (a) Explicitly handling ambiguity during extraction—when the source text is vaguely phrased or has multiple interpretations, Claimify avoids forcing claim generation and only outputs claims when the correct interpretation can be determined with high confidence; (b) Decontextualizing each extracted claim to ensure it remains understandable when isolated, supplementing necessary background information (e.g., resolving pronouns in "He worked there for five years" into concrete entities); (c) Atomizing the claims to ensure each contains only a single verifiable factual point.
- Design Motivation: Existing methods often extract claims too aggressively in pursuit of high coverage, which leads to incorrect claims under ambiguous circumstances. Through a "rather miss than make a mistake" strategy, Claimify significantly improves claim accuracy while maintaining relatively high coverage.
自动化覆盖率与去语境化度量:
- Function: Evaluate claim quality automatically in a scalable and reproducible manner
- Mechanism: Coverage is measured using an LLM-as-judge setup, where the judge model is provided with the original text and the extracted claims to determine if each verifiable information point in the source is covered by at least one claim. Decontextualization evaluates whether each claim contains elements requiring context, such as unresolved pronouns or vague temporal/spatial references. Neither method requires manual annotation, enabling automated large-scale applications.
- Design Motivation: While accurate, human evaluation does not scale. Automated evaluation methods make large-scale comparative experiments possible, while validation against human evaluations ensures reliability.

Loss & Training¶

Claimify is implemented via in-context learning with LLMs, involving no additional training or loss function design. Instead, it relies on carefully crafted prompt engineering to guide the model's behavior.

Key Experimental Results¶

Main Results¶

Method	Coverage	Atomicity	Decontextualization	Minimality	Overall
Claimify	Highest	Highest	Highest	High	Best
AFV (Automated Fact Verification)	Medium	Medium	Low	Medium	Medium
SAFE	Higher	Higher	Medium	High	Better
Baseline (Sentence Splitting)	Low	High	Low	High	Worse

Ablation Study¶

Configuration	Coverage	Decontextualization	Note
Claimify (Full)	Highest	Highest	Full method
w/o Ambiguity Handling	Similar	Decreased	Removing ambiguity handling leads to more errors in claims
w/o Decontextualization Step	Similar	Significantly Decreased	Claims retain a large number of unresolved references
w/o Atomization	Decreased	Similar	Compound claims reduce verifiability

Key Findings¶

Claimify outperforms existing methods across all evaluation dimensions, showing the most significant advantage in decontextualization, which demonstrates the high effectiveness of its ambiguity handling mechanism.
There is a subtle trade-off between coverage and decontextualization: overly aggressive claim extraction can increase coverage but introduces more ambiguous claims.
The high consistency between automated and human evaluation validates the reliability of the framework.
Differences in atomicity and minimality are minor across methods, suggesting these two dimensions are relatively easier to satisfy.

Highlights & Insights¶

Ambiguity-aware claim extraction is the most prominent highlight of this paper—choosing not to extract when uncertain rather than forcing the generation of potentially incorrect claims. This "rather miss than make a mistake" design philosophy is highly practical in real-world applications, as the harm of an incorrect claim far outweighs that of an omitted one.
Transferability of the framework: The proposed evaluation framework is not only applicable to claim extraction but can also be generalized to evaluate the quality of other information extraction tasks.
Automated measurement resolves a long-standing evaluation bottleneck in the fact-checking domain, enabling large-scale systematic comparisons.

Limitations & Future Work¶

The paper primarily focuses on English contexts; claim extraction and evaluation in multilingual environments remain unexplored.
Claimify relies on LLMs for extraction, which may face knowledge insufficiency when dealing with highly specialized domains (e.g., medicine, law).
The automated evaluation methods themselves rely on an LLM-as-judge setup, potentially inheriting the biases and limitations of the LLMs.
Future work can extend the evaluation framework to multimodal scenarios and explore combining Claimify with retrieval-augmented methods to handle domain-specific content.

vs SAFE: SAFE also utilizes LLMs for fact-checking, but its claim extraction step is relatively simple and does not handle ambiguity. Claimify is clearly superior in terms of claim quality.
vs AFV (Automated Fact Verification): AFV focuses on the end-to-end fact verification pipeline, where claim extraction is merely one step, without separate optimization of extraction quality.
vs FActScore: FActScore proposes the concept of atomic facts to evaluate the accuracy of LLM-generated content, but its claim decomposition method is relatively straightforward and lacks decontextualization processing.

Rating¶

Novelty: ⭐⭐⭐⭐ The evaluation framework and ambiguity-aware extraction are novel, but the underlying method remains prompt engineering.
Experimental Thoroughness: ⭐⭐⭐⭐ The multi-dimensional evaluation is comprehensive, and the consistency validation against human evaluation enhances persuasion.
Writing Quality: ⭐⭐⭐⭐⭐ Clear problem definitions and rigorous framework logic; high-quality work from Microsoft Research.
Value: ⭐⭐⭐⭐ Provides a standardized evaluation tool for the fact-checking community, with high practical utility.