Skip to content

Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

Conference: ICML 2026
arXiv: 2604.17487
Code: Not provided in the cached body
Area: LLM Agent / Uncertainty Calibration
Keywords: claim-level fact-checking, specificity control, selective calibration, over-claiming, long-form factuality

TL;DR

This paper models the problem of "specifying too much detail with insufficient evidence" in agentic systems as a claim-level over-claiming issue. It proposes calibrated CSS: performing calibrated selection among precise statements, coarse-grained backoff, and omission for each atomic claim. In full LongFact experiments, it improves OAU from 0.8460 (without post-processing) to 0.9130, while retaining 0.9381 specificity.

Background & Motivation

Background: Modern LLM agents often do not generate an answer in one go; instead, they retrieve evidence, call tools, aggregate multiple facts, and then deliver the result to the user or downstream modules. Such outputs are naturally composed of many claims: some are strongly supported by evidence, some are only true in a broad sense, and some specific details are not supported by evidence at all.

Limitations of Prior Work: Traditional answer-level uncertainty handling is too coarse. Rejecting the entire answer results in the loss of much useful, supported content. Conversely, providing the answer as-is risks the system being overly certain about details such as dates, numbers, or entity relations. Research in long-form factuality has shown that an answer often contains a mix of correct and incorrect content, so providing a single confidence score for the entire paragraph cannot guide the system on which information to retain.

Key Challenge: The tension between reliability and informativeness is reflected not just in "to answer or not," but in "at what semantic granularity to answer." A coarse-grained version of a claim might be supported by evidence, while the original fine-grained version is not. Deleting the entire claim is too conservative, while retaining the original sentence constitutes over-claiming.

Goal: The authors aim to construct a black-box post-processing layer that does not require retraining the upstream model or modifying the retrieval stack. Instead, it decides the most appropriate semantic precision claim-by-claim based on a fixed draft. It performs three tasks: identifying original claims, generating available coarse-grained backoffs, and using calibration rules to select between fine, coarse, or omit.

Key Insight: The critical observation is that uncertainty can be expressed as local semantic backoff rather than vague phrasing or global rejection. For example, if evidence only supports "the agreement was signed in Geneva" but not "signed in a specific year," the system should back off to the former instead of deleting the sentence or retaining the incorrect year.

Core Idea: Use a calibrated claim-level selector to choose the highest precision allowed by the evidence among three levels: "original precise claim / coarse-grained claim / omission," thereby ensuring the agentic system only speaks at a granularity supported by evidence.

Method

The proposed CSS (compositional selective specificity) can be viewed as a semantic precision controller placed after the generator. The inputs are the prompt, available evidence, and a draft answer generated by the upstream LLM; the output is not a re-generated answer but the result of local edits to each claim within the draft.

The focus of this design is not to train a new verifier but to transform existing support scores into deployable selection strategies. The support estimator itself is composed of LLM-based support judgments and lightweight lexical/entity features, fixed during a single run; the paper's true contribution lies in how to use these noisy scores to select the output granularity for claims.

Overall Architecture

The overall workflow is divided into four steps.

The first step is draft generation: the upstream language model generates an initial answer based on the prompt and evidence. All comparison strategies are based on the same set of fixed drafts; thus, the experiment compares post-processing selection strategies rather than differences between generators.

The second step is claim extraction and backoff proposal: the system decomposes the draft into atomic claims \(c_1, \ldots, c_m\). For each original claim \(c_i\), the system generates a coarse-grained version \(\tilde{c}_i\), aiming to retain the central meaning while removing details that might not be supported.

The third step is support scoring: support scores \(s_i^{\mathrm{fine}}\) and \(s_i^{\mathrm{coarse}}\) are estimated for the fine and coarse claims, respectively. Binary support labels \(y_i^{\mathrm{fine}}\) and \(y_i^{\mathrm{coarse}}\) are available during offline evaluation, but these labels are used only for scoring and oracle upper bounds, not provided to the deployable selector.

The fourth step is claimwise selection: the selector takes an action \(\pi_i \in \{\mathrm{fine}, \mathrm{coarse}, \mathrm{omit}\}\) for each claim. If the fine score exceeds a threshold, the original claim is retained; if the fine score fails but the coarse score exceeds a threshold, the coarse-grained version is output; if neither passes, the claim is omitted.

Key Designs

  1. Three-tier Semantic Specificity Ladder:

    • Function: Restricts the output space of each claim to fine, coarse, or omit actions, allowing the system to locally reduce semantic precision.
    • Mechanism: Fine is the original detailed claim, coarse is a rewrite removing unstable details, and omit produces no output. The paper characterizes the value of the three tiers using specificity weights: \(w(\mathrm{omit})=0\), \(w(\mathrm{coarse})=\gamma\), and \(w(\mathrm{fine})=1\), with \(\gamma=0.6\) in experiments.
    • Design Motivation: This action space is more granular than "retain/delete" because many errors do not stem from a claim being entirely invalid, but from the original expression being too specific. Coarse backoff allows the system to retain supported core content.
  2. Over-claiming-aware Evaluation Metrics:

    • Function: Uses a set of claim-level metrics to simultaneously measure support precision, information retention, and the cost of unsupported emissions.
    • Mechanism: The paper reports support precision, specificity retention, supported specificity, and OAU. OAU gives a positive reward for supported specificity and a negative penalty for claims that are output but unsupported, formally approximated as the average of \(w(\pi_i)y_i^{\mathrm{sel}} - e_i(1-y_i^{\mathrm{sel}})\) for each claim.
    • Design Motivation: If only precision is considered, the system might achieve a high score through excessive deletion or rejection; if only retention is considered, the system might preserve unsupported details. OAU places "useful retention" and "erroneous over-claiming" into a single objective for comparison.
  3. Calibration Threshold Selection under Clopper-Pearson Constraints:

    • Function: Converts noisy support scores into deployable fine/coarse thresholds instead of using manually fixed thresholds.
    • Mechanism: Threshold pairs \((\tau_{\mathrm{fine}}, \tau_{\mathrm{coarse}})\) are enumerated on a held-out calibration split. For each pair, the number of unsupported emitted claims \(k\) and total emitted claims \(n\) in the calibration set are counted to calculate a one-sided Clopper-Pearson upper confidence bound. Only threshold pairs satisfying \(\mathrm{CPUpper}(k,n;\delta) \leq \alpha\) are retained; experiments use \(\alpha=0.10\) and \(\delta=0.05\). The pair with the highest calibration OAU is selected from the valid candidates.
    • Design Motivation: Fixed thresholds can be too conservative, especially when score distributions change with datasets, models, or runs. Calibration allows the selector to automatically move to a higher utility operating point while controlling the risk of unsupported emissions.

Loss & Training

CSS is not an end-to-end training method, so it lacks a traditional training loss. Its "training/selection" occurs at two levels: first, the support scorer is fitted once per run and fixed; second, calibrated CSS selects threshold pairs on the calibration split. The full LongFact experiment uses five-fold out-of-fold evaluation: each prompt is evaluated on a held-out fold, while scorer fitting and threshold calibration are performed on the remaining folds.

The paper also emphasizes that this calibration step is not a full conformal guarantee. Because the same calibration split is used both to filter threshold pairs and to maximize OAU, the authors refer to it as a conservative calibration rule rather than an end-to-end distribution-free guarantee.

Key Experimental Results

Main Results

The main experiment uses the full LongFact set of 2,280 prompts, extracting a total of 11,705 claims. All methods are compared on the same set of GPT-5.4 fixed drafts. Metrics are calculated according to a claimwise protocol rather than the official SAFE/F1@K leaderboard metrics for LongFact.

Dataset / Run Strategy Samples Output claims / Total claims Support precision Specificity retention Supported specificity OAU
LongFact full / GPT-5.4 No CSS 2280 11705 / 11705 0.9230 1.0000 0.9230 0.8460
LongFact full / GPT-5.4 Whole abstain 2280 7365 / 11705 0.9825 0.6292 0.6182 0.6072
LongFact full / GPT-5.4 Claim-drop 2280 10823 / 11705 0.9877 0.9246 0.9133 0.9019
LongFact full / GPT-5.4 Uncalibrated CSS 2280 10948 / 11705 0.9934 0.8633 0.8583 0.8521
LongFact full / GPT-5.4 Calibrated CSS 2280 11085 / 11705 0.9865 0.9381 0.9258 0.9130
LongFact full / GPT-5.4 Oracle CSS 2280 11056 / 11705 1.0000 0.9359 0.9359 0.9359

The core conclusion of this table is that calibrated CSS does not simply pursue the highest precision. Compared to No CSS, it raises support precision from 0.9230 to 0.9865 and OAU from 0.8460 to 0.9130. Compared to uncalibrated CSS, it sacrifices a small amount of precision but raises specificity retention from 0.8633 to 0.9381, resulting in a 0.0609 higher final OAU.

Ablation Study

The key ablation in the paper is the comparison between fixed-threshold CSS and calibrated CSS, reproduced on LongFact pilot and HotpotQA pilot datasets. Each pilot uses 200 samples, divided into 30 fit, 30 calibration, and 140 test samples.

Dataset / Model Strategy Test Samples claims Support precision Specificity retention Supported specificity OAU
LongFact pilot / GPT-5.4 Uncalibrated CSS 140 721 / 757 0.9958 0.5900 0.5876 0.5836
LongFact pilot / GPT-5.4 Calibrated CSS 140 724 / 757 0.9931 0.9411 0.9355 0.9289
HotpotQA pilot / GPT-5.4 Uncalibrated CSS 140 409 / 470 0.9829 0.6209 0.6111 0.5962
HotpotQA pilot / GPT-5.4 Calibrated CSS 140 429 / 470 0.9790 0.9000 0.8809 0.8617
HotpotQA pilot / Claude Sonnet 4.6 Uncalibrated CSS 140 622 / 648 0.9936 0.7191 0.7142 0.7080
HotpotQA pilot / Claude Sonnet 4.6 Calibrated CSS 140 628 / 648 0.9904 0.9475 0.9395 0.9302

Key Findings

  • Uncalibrated (fixed-threshold) CSS proves to be "extremely safe but too conservative" across multiple pilots: precision is near 0.99, but specificity retention is significantly low (0.5900 in LongFact pilot, 0.6209 in HotpotQA GPT-5.4 pilot).
  • The gains from calibration come primarily from selecting an operating point rather than changing the verifier. Across three pilots, calibrated CSS only loses about 0.0027 to 0.0039 in precision while substantially increasing retention and OAU.
  • Claim-drop is a strong baseline, but it lacks coarse backoff and can only delete fine claims that fail to meet the threshold. Calibrated CSS outperforms claim-drop on LongFact full, indicating that "local precision reduction" retains more useful information than "detail deletion."
  • The full-run OAU for Oracle CSS is 0.9359, while calibrated CSS achieves 0.9130. A gap exists but is small, suggesting the current selection layer captures most available gains. Further improvements likely depend on stronger claim extraction, backoff generation, and support estimation.

Highlights & Insights

  • The paper reframes the "hallucination" problem as an "over-claiming" problem, which is a practical perspective. Many agent outputs are not entirely wrong but provide excessively specific dates, numbers, or relations when evidence is insufficient; CSS specifically targets this gray area.
  • Coarse backoff is a more suitable uncertainty interface for agentic pipelines than abstention. Instead of simply saying "I don't know," it explicitly states what the system does know, allowing downstream modules to decide whether to re-retrieve, upgrade the verifier, or hand the task to a human.
  • The design of OAU avoids "false victories" of high precision with no content. Both whole-answer abstention and uncalibrated CSS reduce errors, but only calibrated CSS maintains high support precision while retaining useful specificity.
  • The most transferable insight is "transforming uncertainty into local editing strategies." Similar ideas could be applied to code generation for backing off uncertain API details, diagnostic level control in medical reports, or granularity control for time, place, and entity relations in multi-hop QA.

Limitations & Future Work

  • The full LongFact experiments use a claimwise protocol defined by the authors rather than the official SAFE/F1@K pipeline. Thus, these numbers illustrate claim-level post-processing effects and should not be directly compared with LongFact leaderboard scores.
  • The HotpotQA and LongFact pilots only have 200 samples each (140 for testing). While they demonstrate calibration trends, they are insufficient to cover complex retrieval failures, tool call failures, and multi-turn agent interactions.
  • Oracle CSS represents an upper bound, not a deployable method. Real systems will still inherit errors from claim extraction, coarse backoff generation, and support estimation; if the backoff itself contains distortions, even a calibrated selector cannot guarantee semantic correctness.
  • The calibration rules currently lack a full distribution-free guarantee. The paper uses the Clopper-Pearson upper bound to constrain unsupported emissions, but the same calibration set is also used to select the threshold pair for maximum OAU, necessitating a more rigorous finite-sample validity layer in the future.
  • The current method primarily handles post-processing of fixed drafts and has not yet explored continuous monitoring or sequential agent scenarios. Real agents might re-retrieve, re-call tools, or adjust task plans based on downgraded claims, introducing new feedback loops.
  • vs. selective prediction / conformal prediction: Traditional selective prediction often accepts or rejects a whole sample; conformal prediction provides distribution-free uncertainty sets. This paper draws on these risk control ideas but applies them to the semantic granularity of each claim within an answer.
  • vs. FActScore / LongFact: FActScore and LongFact focus on factuality evaluation of atomic facts in long-form output; CSS goes a step further by asking "how should the output be edited if a fact is not fully supported," moving from evaluation to an executable control layer.
  • vs. SelfCheckGPT / Chain-of-Verification: These methods emphasize posterior verification and hallucination reduction, typically outputting verifier judgments or rewrite prompts; this paper explicitly links verifier scores to fine/coarse/omit decisions, forming a more structured output strategy.
  • vs. Conformal Linguistic Calibration / Selective Abstraction: These works also focus on the trade-off between factuality and specificity; this paper emphasizes agentic deployment, treating output as a claim-level uncertainty interface for downstream auditing, upgrading, and re-retrieval.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Explicitly modeling over-claiming as claim-level specificity control and combining coarse backoff with calibrated selection provides a very clear problem definition.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ The full 2,280 LongFact prompts plus multiple pilots support the main conclusions, though pilot sizes are small and the evaluation protocol differs from official SAFE/F1@K.
  • Writing Quality: ⭐⭐⭐⭐☆ The paper structure is clear, and the explanation of metrics and baselines is thorough, particularly the careful distinction between deployable selectors and oracle ceilings.
  • Value: ⭐⭐⭐⭐⭐ Highly instructive for practical agent systems as it provides a reliability interface that is more granular than rejection and safer than as-is responses.