Skip to content

TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

Conference: ACL 2025
arXiv: 2503.15289
Code: GitHub
Area: Others
Keywords: Text Provenance, Sentence Tracing, Relation Classification, Long Document, Multi-Document

TL;DR

Proposes the TROVE text provenance challenge, which traces each sentence in the target text back to specific source sentences in the source documents and classifies their fine-grained relationships (quotation, compression, inference, etc.), covering multi-document and long-document scenarios.

Background & Motivation

Background: The reliability and traceability of LLM-generated texts have gained significant attention. Existing work on citation generation and fact verification mainly focuses on document-level or coarse-grained provenance.

Limitations of Prior Work: High-risk domains (e.g., law, healthcare) require understanding the source and generation mechanism of each sentence, yet there is a lack of sentence-level fine-grained provenance datasets and evaluation methods.

Key Challenge: Existing works only focus on single-document-level source identification, failing to meet the fine-grained provenance requirements in multi-document and long-document scenarios.

Goal: To provide a fine-grained text provenance dataset and evaluation framework covering multiple scenarios, multiple languages, and various source text lengths.

Key Insight: Constructing provenance data based on three public datasets (LongBench, LooGLE, and CRUD-RAG), combined with a multi-retriever approach and a three-stage annotation using GPT-4o.

Core Idea: Text provenance should not only trace source sentences but also classify the fine-grained target-source relationships (quotation/compression/inference/others).

Method

Overall Architecture

Constructs a provenance dataset covering 11 scenarios (QA and summarization), bilingual in Chinese and English, with different source text lengths (0-5k, 5k-10k, 10k+), ensuring quality through a three-stage annotation process.

Key Designs

  1. Multi-Retriever Sentence Retrieval: Jointly utilizes three retrievers—BM25, Dense Retrieval, and LCS (Longest Common Subsequence). The union of hits from at least two retrievers is selected as candidate source sentences, with top-\(k=10\).
  2. GPT-4o Provenance Annotation: Conducts fine-grained annotation based on the candidate source sentences, identifying source sentences and classifying their relations into four categories: Quotation, Compression, Inference, and Others (such as negation).
  3. Manual Provenance Verification: Eight graduate students spent approximately 510 hours reviewing the annotations to verify and supplement the source sentences missed by GPT-4o, costing an average of $0.20 per sentence.

Evaluation Strategy

Proposes a system of 13 metrics, including macro/micro averaged Track-P/R/F1 (source tracing), Relation-P/R/F1 (relationship classification), and Overall F1. It supports two evaluation paradigms: direct prompting and retrieval-augmented evaluation.

Key Experimental Results

Main Results

Model Direct Prompting Retrieval-Augmented
GPT-4o - Best Closed-Source
DeepSeek-V3 (671B) - Best Open-Source
LLama3-8B 4.71 30.96
Qwen2.5-14B - Outperforms 7B in the same series
Vicuna-7B 7.08 22.74
ChatGLM-6B 0.02 3.47

Ablation Study

Retrieval Method Track F1 (Macro) Overall F1
LCS 29.41 14.67
BM25 35.70 17.81
Dense 28.28 14.10
Union (≥2) 46.17 22.82

Key Findings

  • Retrieval augmentation is crucial for provenance; all models perform significantly better under the retrieval-augmented setting compared to direct prompting.
  • Larger models demonstrate better performance in complex relationship classification.
  • Closed-source models generally lead, but open-source models exhibit significant potential when combined with retrieval augmentation.
  • Relationship classification is more challenging than source sentence tracing.

Highlights & Insights

  • First to define text provenance as a dual task of sentence-level tracing and relationship classification, with a granularity far exceeding existing works.
  • The three-stage annotation pipeline (multi-retriever \(\rightarrow\) GPT-4o \(\rightarrow\) human) serves as a practical annotation methodology for long documents.
  • Covers both Chinese and English bilingual data and a variety of source text lengths, providing a comprehensive evaluation.

Limitations & Future Work

  • The dataset size is limited (approx. 5,000 sentences); extending it to more domains and languages is worth exploring.
  • A sliding window approach is used when the source text exceeds the model's context length, which may result in a loss of cross-window information.
  • The definition of relationship types is relatively coarse and can be further refined (e.g., classifying inference into inductive/deductive).
  • Complements citation generation, fact verification, and grounded generation.
  • Provides a new evaluation perspective for the interpretability and traceability of RAG systems.
  • The multi-retriever fusion strategy possesses universal value for long-document tasks.
  • The GPT-4o + human review workflow for data annotation can serve as a general paradigm for long-document annotation.
  • The provenance task can be extended to scenarios such as code generation and academic writing.

Supplemental Technical Details

  • Dataset distribution: Chinese single-document averages 196 sentences per source document; English single-document averages 637 sentences per source document.
  • Average number of source sentences per target sentence: 7.04 sentences for Chinese single-document, 1.97 sentences for English multi-document.
  • Annotation agreement (Fleiss' Kappa): 0.60-0.74 for tracing, 0.48-0.62 for relationship classification, and 0.44-0.70 for GPT-4o correction.
  • Evaluation utilizes a sliding window to handle ultra-long source texts: dividing the input into chunks such as \(0\text{-}M\), \(M\text{-}2M\), and \(2M\text{-}3M\) to process them independently before merging.
  • Breakdown of relationship types: Quotation (verbatim/partial copy), Compression (summarization/paraphrase), Inference (expansion/generalization/specialization), Others (negation, etc.).

Rating

  • Novelty: ⭐⭐⭐⭐ Novel task definition, but the data construction methodology is relatively conventional.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 11 models with multi-dimensional analysis, though some results are missing.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous task definition.
  • Value: ⭐⭐⭐⭐ Provides an important foundation for the traceability of LLM-generated content.