TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification¶

Conference: ACL 2025
arXiv: 2503.15289
Code: GitHub
Area: Others
Keywords: Text Provenance, Sentence Tracing, Relation Classification, Long Document, Multi-Document

TL;DR¶

Proposes the TROVE text provenance challenge, which traces each sentence in the target text back to specific source sentences in the source documents and classifies their fine-grained relationships (quotation, compression, inference, etc.), covering multi-document and long-document scenarios.

Background & Motivation¶

Background: The reliability and traceability of LLM-generated texts have gained significant attention. Existing work on citation generation and fact verification mainly focuses on document-level or coarse-grained provenance.

Limitations of Prior Work: High-risk domains (e.g., law, healthcare) require understanding the source and generation mechanism of each sentence, yet there is a lack of sentence-level fine-grained provenance datasets and evaluation methods.

Key Challenge: Existing works only focus on single-document-level source identification, failing to meet the fine-grained provenance requirements in multi-document and long-document scenarios.

Goal: To provide a fine-grained text provenance dataset and evaluation framework covering multiple scenarios, multiple languages, and various source text lengths.

Key Insight: Constructing provenance data based on three public datasets (LongBench, LooGLE, and CRUD-RAG), combined with a multi-retriever approach and a three-stage annotation using GPT-4o.

Core Idea: Text provenance should not only trace source sentences but also classify the fine-grained target-source relationships (quotation/compression/inference/others).

Method¶

Overall Architecture¶

Constructs a provenance dataset covering 11 scenarios (QA and summarization), bilingual in Chinese and English, with different source text lengths (0-5k, 5k-10k, 10k+), ensuring quality through a three-stage annotation process.

Key Designs¶

Multi-Retriever Sentence Retrieval: Jointly utilizes three retrievers—BM25, Dense Retrieval, and LCS (Longest Common Subsequence). The union of hits from at least two retrievers is selected as candidate source sentences, with top-$k=10$.
GPT-4o Provenance Annotation: Conducts fine-grained annotation based on the candidate source sentences, identifying source sentences and classifying their relations into four categories: Quotation, Compression, Inference, and Others (such as negation).
Manual Provenance Verification: Eight graduate students spent approximately 510 hours reviewing the annotations to verify and supplement the source sentences missed by GPT-4o, costing an average of $0.20 per sentence.

Evaluation Strategy¶

Proposes a system of 13 metrics, including macro/micro averaged Track-P/R/F1 (source tracing), Relation-P/R/F1 (relationship classification), and Overall F1. It supports two evaluation paradigms: direct prompting and retrieval-augmented evaluation.

Key Experimental Results¶

Main Results¶

Model	Direct Prompting	Retrieval-Augmented
GPT-4o	-	Best Closed-Source
DeepSeek-V3 (671B)	-	Best Open-Source
LLama3-8B	4.71	30.96
Qwen2.5-14B	-	Outperforms 7B in the same series
Vicuna-7B	7.08	22.74
ChatGLM-6B	0.02	3.47

Ablation Study¶

Retrieval Method	Track F1 (Macro)	Overall F1
LCS	29.41	14.67
BM25	35.70	17.81
Dense	28.28	14.10
Union (≥2)	46.17	22.82

Key Findings¶

Retrieval augmentation is crucial for provenance; all models perform significantly better under the retrieval-augmented setting compared to direct prompting.
Larger models demonstrate better performance in complex relationship classification.
Closed-source models generally lead, but open-source models exhibit significant potential when combined with retrieval augmentation.
Relationship classification is more challenging than source sentence tracing.

Highlights & Insights¶

First to define text provenance as a dual task of sentence-level tracing and relationship classification, with a granularity far exceeding existing works.
The three-stage annotation pipeline (multi-retriever $\rightarrow$ GPT-4o $\rightarrow$ human) serves as a practical annotation methodology for long documents.
Covers both Chinese and English bilingual data and a variety of source text lengths, providing a comprehensive evaluation.

Limitations & Future Work¶

The dataset size is limited (approx. 5,000 sentences); extending it to more domains and languages is worth exploring.
A sliding window approach is used when the source text exceeds the model's context length, which may result in a loss of cross-window information.
The definition of relationship types is relatively coarse and can be further refined (e.g., classifying inference into inductive/deductive).

Complements citation generation, fact verification, and grounded generation.
Provides a new evaluation perspective for the interpretability and traceability of RAG systems.
The multi-retriever fusion strategy possesses universal value for long-document tasks.
The GPT-4o + human review workflow for data annotation can serve as a general paradigm for long-document annotation.
The provenance task can be extended to scenarios such as code generation and academic writing.

Supplemental Technical Details¶

Dataset distribution: Chinese single-document averages 196 sentences per source document; English single-document averages 637 sentences per source document.
Average number of source sentences per target sentence: 7.04 sentences for Chinese single-document, 1.97 sentences for English multi-document.
Annotation agreement (Fleiss' Kappa): 0.60-0.74 for tracing, 0.48-0.62 for relationship classification, and 0.44-0.70 for GPT-4o correction.
Evaluation utilizes a sliding window to handle ultra-long source texts: dividing the input into chunks such as $0\text{-}M$, $M\text{-}2M$, and $2M\text{-}3M$ to process them independently before merging.
Breakdown of relationship types: Quotation (verbatim/partial copy), Compression (summarization/paraphrase), Inference (expansion/generalization/specialization), Others (negation, etc.).

Rating¶

Novelty: ⭐⭐⭐⭐ Novel task definition, but the data construction methodology is relatively conventional.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 11 models with multi-dimensional analysis, though some results are missing.
Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous task definition.
Value: ⭐⭐⭐⭐ Provides an important foundation for the traceability of LLM-generated content.