Skip to content

In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis

Conference: ACL2026
arXiv: 2505.14838
Code: https://ukplab.github.io/acl2026-generating-impact-summaries
Area: Scientific Literature Analysis / Text Generation
Keywords: Research Impact Summarization, Citation Intent, Time-aware Summarization, citation context, LLM Evaluation

TL;DR

This paper proposes the task of "Scientific Impact Summarization": first identifying fine-grained intents that truly reveal impact from the citation contexts of a paper, and then generating an impact narrative that evolves over time. This approach better illustrates how a paper is adopted, criticized, and transformed by subsequent work than simple citation counts.

Background & Motivation

Background: Research impact is typically measured by citation counts, h-index, or similar bibliometric indicators. In NLP and scientometrics, significant work has also focused on citation intent classification, using coarse-grained labels to indicate whether a citation serves as background, method, result, or motivation.

Limitations of Prior Work: Citation counts only indicate "how many times" a paper is cited, not "why." For two papers with 200 citations each, one might be primarily reused as a method, while another might be criticized for its limitations, and a third might only be cited as background. Existing citation intent classifications often remain at the level of a single citation context, rarely aggregating a large number of citations into a readable impact narrative.

Key Challenge: Genuine scientific impact involves both confirmation and correction. Subsequent papers may adopt a method or point out flaws and propose corrections. Focusing only on positive adoption or coarse labels misses the trajectory of "criticism, correction, and rediscovery" in scientific progress.

Goal: The authors aim to filter "impact-revealing contexts" from all citation contexts of a target paper, identify their fine-grained citation reasons and years, and generate a time-aware impact summary describing how the paper influenced subsequent research across different stages.

Key Insight: Rather than letting an LLM generate freely based on titles and citation counts, the paper decomposes the task into two steps: first, using in-context learning to generate fine-grained citation intents in free-text form and judging if they are impact-revealing; second, providing only the filtered impact-revealing contexts, years, and intents to the LLM for summary generation.

Core Idea: Using "fine-grained citation intent + temporal information" as a structured intermediate layer transforms research impact from static numbers into verifiable, readable, and comparable historical narratives.

Method

The paper first provides four definitions. A citation context is the text surrounding a citation of a target paper; fine-grained citation intent is a free-text description of the citation reason; impact-revealing intent refers to intents that directly reflect the impact of the cited paper, categorized into confirmation and critique/correction; a scientific impact summary describes how a paper was directly used, extended, criticized, or corrected by subsequent work over time.

The entire pipeline follows a "filter evidence first, then summarize" approach. The input is a set of citation contexts and their years for a target paper. The system first generates a fine-grained intent for each context and classifies it as impact-revealing or other. Subsequently, only contexts with impact signals are retained and passed into the generation prompt along with years and intents to produce a semi-structured impact summary. Since no gold summaries exist for this task, the authors designed a reference-free evaluation, using an LLM to assess faithfulness, coverage, citation year compliance, insightfulness, trend awareness, and specificity, validated by human assessment for relevance.

Overall Architecture

The first stage is citation intent generation/classification. The authors manually construct in-context examples covering various citation reasons, including method use, background mention, pointing out unreliable datasets and proposing better ones, and identifying research gaps. For each citation context, the LLM outputs both a free-text intent and an impact-revealing/other category. To support this task, a 4K citation context dataset was constructed: 1K existing positive instances from PST-Bench, 1K impact-revealing contexts filtered from S2AG using confirmation/correction patterns, and 2K non-impact-revealing examples. Human inspection showed 90% label accuracy.

The second stage is impact summary generation. The system filters contexts based on the first stage results and passes the impact-revealing citations, corresponding years, and generated intents to the LLM. The prompt requires the model to summarize the impact trajectory of the target paper by time periods—for instance, early adoption by a specific category of methods, mid-term exposure of certain limitations, and late-stage repurposing by new methods.

The third stage is evaluation. The trustworthiness side includes faithfulness, coverage, and citation year compliance; the informativeness side includes insightfulness, trend awareness, and specificity. Faithfulness splits the summary into impact descriptions for different time periods and asks the evaluator LLM to check if they are supported by citation contexts from the same period. The informativeness side utilizes a G-Eval style LLM-as-a-judge approach.

Key Designs

  1. Impact-revealing citation intent as an intermediate representation:

    • Function: Filters evidence from a large number of citations that truly explains "how this paper impacted subsequent work."
    • Mechanism: Instead of fixed coarse labels, the LLM generates free-text intents and judges whether they are confirmation, correction, or other. For example, "use of minimization methodology" is impact-revealing, while "background about NER methods" is not.
    • Design Motivation: Research impact is often hidden in specific usage; fixed taxonomies are too coarse, and citation counts are too shallow. Free-text intents preserve finer semantics.
  2. Generating summaries using only impact-revealing contexts:

    • Function: Reduces noise from common background citations, making the generated results more faithful to direct impact.
    • Mechanism: The authors compared input settings including no citations, all citations, all citations + intents, only impact-revealing citations, and impact-revealing citations + intents.
    • Design Motivation: Longer context is not necessarily better; high volumes of incidental citations can mislead LLMs into hallucinating background mentions as impact stories.
  3. Reference-free evaluation framework for new tasks:

    • Function: Measures whether a summary is trustworthy and informative in the absence of a gold impact summary.
    • Mechanism: Faithfulness checks if summary statements are supported by citation contexts of the same period; coverage measures how many impact intents the summary covers; trend awareness, insightfulness, and specificity evaluate if the summary captures temporal changes and specific impacts.
    • Design Motivation: Traditional ROUGE cannot be used as there is no standard answer; there is a need to evaluate "support by evidence" and "actual explanation of impact."

Loss & Training

This work does not train a new model but primarily uses a prompt-based LLM pipeline. Intent classification uses GPT-4o-mini with ICL, employing \(K=50\) shots in comparisons; each test sample was run 3 times with majority voting, showing a 72% perfect consistency rate. Summary generation and automatic evaluation mainly use GPT-4o; the appendix also uses Qwen-2.5-72B and Gemini-2.5-flash to check cross-model robustness. A human study invited 9 university professors to evaluate the impact summaries of their own papers.

Key Experimental Results

Main Results

Task Method Precision Recall F1 Accuracy
Impact-revealing Classification random baseline 0.54 0.51 0.52 0.50
Impact-revealing Classification always-impact-revealing 0.53 1.00 0.69 0.53
Impact-revealing Classification Structural Scaffolds 0.55 0.44 0.49 0.51
Impact-revealing Classification Meaningful Citations 0.72 0.46 0.56 0.62
Impact-revealing Classification Multi-cite 0.59 0.41 0.48 0.53
Impact-revealing Classification Ours 0.74 0.65 0.69 0.69

Ours achieves the best overall performance in precision, recall, F1, and accuracy, particularly with a recall 19 percentage points higher than the next best existing method. For generating impact summaries, high recall is crucial, as missing impactful citations directly results in missing key impact trajectories.

Ablation Study

Summary Input Provide intents? Faithfulness Coverage Coverage@3 Citation Year Compliance Insightfulness Trend Awareness Specificity
No citations No 0.77 0.25 0.58 n/a 0.70 0.94 0.75
All citations No 0.83 0.32 0.74 0.55 0.80 0.95 0.85
All citations Yes 0.84 0.32 0.73 0.48 0.80 0.97 0.86
Impact-revealing only No 0.87 0.33 0.73 0.59 0.80 0.96 0.87
Impact-revealing only Yes 0.88 0.34 0.75 0.56 0.83 0.98 0.88

Key Findings

  • After subdividing into confirmatory and correction citations, the F1 of this method reached 0.88 and 0.98 respectively, significantly stronger than existing intent classifiers, indicating a particular strength in identifying signal impacts that "point out limitations and improve."
  • The best summary input is impact-revealing citations + intents, which achieves the highest or tie-for-highest scores in faithfulness, coverage, Coverage@3, insightfulness, trend awareness, and specificity.
  • In professor human evaluations, the summaries from this paper were selected 63% of the time for relevance and 75% for insightfulness compared to a zero-knowledge baseline. Approximately 60% of professors felt the summary details were appropriate and provided new, non-obvious insights; this ratio rose to 75% for papers in the top 10% of impact-revealing citations.

Highlights & Insights

  • The greatest highlight is the redefinition of "impact": impact is not the number of citations, but how subsequent work uses, extends, questions, and corrects the original paper. This perspective is more suitable for researchers to quickly judge the true historical role of a paper than bibliometrics.
  • Free-text intents are highly valuable. They avoid the information loss of coarse taxonomies and provide the summary generation with an intermediate layer akin to "evidence labels," reducing the risk of LLMs hallucinating stories.
  • The evaluation framework itself is reusable. The combination of faithfulness, coverage, year compliance, and trend awareness can be transferred to tasks like literature review generation, related work writing, and research trend analysis.

Limitations & Future Work

  • The authors only processed English papers; cross-lingual citation contexts and writing habits in different disciplines may affect intent expression. Research impact summarization in Chinese, German, or multiple languages still needs separate validation.
  • The scale of human evaluation was limited to 9 professors. While expert evaluation is high quality, the sample pool is small, and authors may only be familiar with a portion of their papers' impact.
  • The full coverage for the best setting is only 0.34, and citation year compliance is only around 0.56 to 0.59, suggesting LLMs still miss long-tail impact themes and are distracted by other year figures in citation contexts.
  • The paper primarily tested the GPT-4o series. Although Qwen and Gemini were added in the appendix, systematic model comparison is insufficient; using the same LLM for both generation and evaluation may introduce bias due to consistency in task interpretation.
  • Currently, impact is primarily operationalized as confirmation and correction. Real scientific impact includes parallel development, standardization, educational dissemination, and cross-domain transfer, which could expand the intent space in the future.
  • vs citation count / h-index: Traditional metrics are simple and scalable but cannot explain the reason for the citation; this paper extracts impact paths from citation contexts, distinguishing between method reuse, critique of limitations, and subsequent corrections.
  • vs citation intent classification: Existing methods mostly perform coarse classification of single citations; this paper uses intent classification as an intermediate step, with the ultimate goal being a multi-document, time-aware impact summary.
  • vs query-focused scientific summarization: General scientific summarization focuses on paper content or related work; the query in this paper is "how did this paper impact subsequent research," and the input evidence is subsequent citation contexts, making it more like an automated draft of the history of scientific ideas.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The task definition is very fresh, combining citation intent, timelines, and impact summarization into a clear new problem.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Automatic evaluation, ablation, expert evaluation, and cross-model supplements are relatively complete, but human evaluation scale and coverage remain limited.
  • Writing Quality: ⭐⭐⭐⭐☆ The paper structure is clear, and definitions and evaluation metrics are explained in detail; some high-density tables require patient alignment by the reader.
  • Value: ⭐⭐⭐⭐⭐ Highly practical for literature review, academic evaluation, survey writing, and research trend analysis, especially when integrated with paper retrieval systems.