VeriTrail: Closed-Domain Hallucination Detection with Traceability¶

Conference: ICLR 2026 arXiv: 2505.21786 Code: None (dataset to be released at https://aka.ms/veritrail-datasets) Area: LLM Safety Keywords: hallucination detection, faithfulness evaluation, traceability, multi-step generation, DAG

TL;DR¶

This paper proposes VeriTrail, the first closed-domain hallucination detection method designed for multi-step generation (MGS) pipelines. By modeling the generation process as a DAG and verifying claims layer by layer along the graph, VeriTrail achieves full traceability encompassing hallucination detection, provenance tracking, and error localization. It substantially outperforms all baselines on two newly introduced datasets.

Background & Motivation¶

Language models frequently produce content unfaithful to source documents (closed-domain hallucination) when generating grounded outputs. The widespread adoption of multi-step generation pipelines (MGS), such as hierarchical summarization and GraphRAG, amplifies this risk—each step may introduce or propagate errors.

Limitations of Prior Work: 1. Existing hallucination detection methods directly compare the final output against source documents, without distinguishing intermediate outputs from the final output. 2. For single-step generation (SGS), this is acceptable; however, for MGS, merely detecting whether a hallucination exists is insufficient—it is equally important to know where the hallucination was introduced (error localization) and how faithful content was derived (provenance). 3. Naïve approaches (comparing each intermediate output individually) are computationally infeasible when the number of intermediate outputs is large (>100K), and they cannot handle cases where multiple intermediate outputs jointly support a single claim.

Key Insight: Model the generation process as a directed acyclic graph (DAG) and design a layer-wise backtracking verification algorithm that simultaneously detects hallucinations, constructs evidence chains, and localizes erroneous stages.

Method¶

Overall Architecture¶

The generation process is modeled as a DAG \(G=(V,E)\): - Nodes \(v \in V\): represent text segments (source documents / intermediate outputs / final output) - Edges \((u,v) \in E\): indicate that \(u\) was used as input to generate \(v\) - Root nodes \(V_0\): source documents; terminal node \(v^*\): final output

For each claim extracted from the final output, VeriTrail independently executes the following iterative procedure, backtracking from the terminal node toward the root nodes.

Key Designs¶

Sub-Claim Decomposition:
- Claimify is used to decompose composite claims into independently verifiable sub-claims.
- Example: "Company X acquired two healthcare startups in 2020" → (1) acquired two startups; (2) in 2020; (3) as part of a healthcare expansion strategy.
- Sub-claims are retained as context for subsequent verification but are not verified directly.
Evidence Selection:
- The source nodes of the terminal node \(src(v^*)\) are retrieved, segmented into sentences, and assigned unique IDs.
- An LLM selects sentence IDs that strongly support or contradict the claim.
- Critically, the returned IDs must match the programmatically assigned IDs, ensuring that selected sentences cannot themselves be hallucinated.
- Parallel processing is supported to reduce latency.
Verdict Generation:
- Based on the selected evidence sentences, an LLM produces a three-way verdict: Fully Supported / Not Fully Supported / Inconclusive.
- To avoid redundant context: root nodes retain their full text, while intermediate nodes use summaries generated by the Evidence Selection step.
- When the context limit is exceeded, Evidence Selection is automatically re-run to compress the input.
Candidate Node Selection & Termination Conditions:
- Fully Supported / Inconclusive → continue verifying the source nodes of the evidence nodes.
- Not Fully Supported → broaden the search (verify all source nodes of previously examined nodes) to reduce false positives.
- Termination conditions: (1) only verified root nodes remain; (2) no candidate nodes exist; (3) \(q\) consecutive Not Fully Supported verdicts.
- Hyperparameter \(q\) controls detection strictness: \(q=1\) favors high recall; \(q=3\) favors high precision.

Traceability Outputs¶

Provenance: For Fully Supported claims, a complete evidence chain from the terminal node to the root nodes is returned.
Error Localization: For Not Fully Supported claims, the DAG stage at which the hallucination was most likely introduced is identified.

Key Experimental Results¶

Main Results (Hallucination Detection Performance)¶

Evaluated on FABLES+ (hierarchical book summarization, 734 claims) and DiverseSumm+ (GraphRAG news QA, 560 claims):

Method	Macro F1 (F)	Macro F1 (D)	Bal. Acc (F)	Bal. Acc (D)
VeriTrail (q=3)	84.5	79.5	83.6	76.3
VeriTrail (q=1)	74.0	76.6	84.6	83.0
RAG (k=15)	69.6	75.1	76.5	74.0
Bespoke-MiniCheck-7B	62.2	72.1	69.0	69.4
Gemini 1.5 Pro	61.1	49.8	60.8	57.6
GPT-4.1 Mini	60.7	62.9	58.2	61.5
AlignScore	59.6	60.4	67.5	62.7
INFUSE	40.5	20.0	59.5	50.1

VeriTrail achieves Macro F1 scores 5–15 percentage points above the strongest baseline on both datasets.
Long-context LLMs (Gemini 1.5 Pro, GPT-4.1 Mini) perform poorly, demonstrating that naïvely feeding long documents into a model is not an effective strategy.

Ablation Study (Component Contributions)¶

Detailed ablations are reported in Appendix E.1:

Component	Change in Macro F1 upon Removal (FABLES+)
Remove sub-claim decomposition	Decrease
Remove DAG structure (verify terminal vs. source only)	Significant decrease (degenerates to RAG)
Remove Not Fully Supported expanded search	Increase in false positives
q=1 vs. q=3	q=3 improves precision; q=1 improves recall

Key Findings¶

MGS-Specific Value: On FABLES+ (hierarchical summarization with >100K intermediate outputs), the DAG backtracking mechanism of VeriTrail plays a critical role.
Cost Efficiency: Despite substantially higher verification overhead, VeriTrail's total cost remains within a reasonable range.
Error Stage Distribution: Analysis reveals that different pipeline stages have different hallucination introduction probabilities, providing guidance for optimizing MGS workflows.

Highlights & Insights¶

First systematic treatment of hallucination detection in MGS: fills an important gap in the literature regarding traceability in multi-step generation pipelines.
Elegance of DAG modeling: unifies heterogeneous generation processes under a single DAG representation, yielding a highly generalizable algorithm.
Engineering ingenuity in evidence IDs: by programmatically assigning IDs and requiring the LLM to return matching IDs, the method cleverly eliminates the risk of secondary hallucination inherent in "using an LLM to verify an LLM."
Flexibility via the \(q\) parameter: users can adjust detection behavior according to the use case (high recall vs. high precision).

Limitations & Future Work¶

Only two MGS pipelines are evaluated (hierarchical summarization and GraphRAG); generalizability to broader settings remains to be validated.
Reliance on an LLM as the evaluator may introduce systematic bias.
Handling of the Inconclusive verdict is coarse-grained (excluded from experiments).
The dataset scale is relatively small (~1,300 claims), limiting statistical significance.
Reference-free methods (e.g., attention-map-based approaches) are not included in the comparison.

Complementary to NLI-based methods (AlignScore, INFUSE): VeriTrail addresses their inability to handle large-scale MGS pipelines.
Practical implications for RAG systems: when RAG is followed by multi-step processing, full-chain verification should be considered.
Implications for LLM agent systems: the multi-step reasoning chains of agents can also be modeled as DAGs and verified using a similar approach.
Distinction from LangGraph and related work: VeriTrail's nodes represent text segments rather than generation steps, offering finer granularity.

Rating¶

Novelty: ⭐⭐⭐⭐ First traceable hallucination detection method for MGS with an elegant DAG formulation, though the core verification still relies on LLM-as-judge.
Experimental Thoroughness: ⭐⭐⭐⭐ Two new datasets, comprehensive baseline comparisons, ablations, and cost analysis, though dataset scale is limited.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous framework definitions, clear algorithmic descriptions, and intuitive figures.
Value: ⭐⭐⭐⭐ Addresses a practically important problem with direct engineering value for quality assurance in MGS pipelines.