Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs¶
Conference: ACL 2025
arXiv: 2410.11001
Code: https://github.com/ulab-uiuc/GoR
Area: Information Retrieval
Keywords: RAG, Graph Neural Networks, Long-Context Summarization, Historical Response Utilization, BERTScore Objective
TL;DR¶
Proposes Graph of Records (GoR), which constructs a graph structure from LLM historical responses and retrieved text chunks. It utilizes GNNs to learn semantic and logical associations among nodes, combined with a self-supervised BERTScore training objective, improving ROUGE scores by 8-19% over retrieval baselines across four long-context global summarization datasets.
Background & Motivation¶
Background: RAG replaces extended context windows in long-text summarization by retrieving relevant text chunks. However, after each query and generation step, the historical responses generated by LLMs are discarded.
Limitations of Prior Work: (a) Historical responses contain valuable task-related information that remains unutilized; (b) Complex logical and semantic associations (e.g., causal, temporal) exist between text chunks, but standard retrieval only focuses on semantic similarity to the query, failing to capture relationships between chunks.
Key Challenge: Global summarization requires integrating information from the entire document, but RAG retrieves only local text chunks each time—how to move from local retrieval to global understanding?
Goal: Utilize historical responses as "records" to build a graph structure, capturing global associations through graph learning to enhance long-text summarization capabilities of RAG.
Key Insight: Organize products of the retrieval-generation process (historical responses + corresponding text chunks) into a graph, where retrieved chunks and responses are nodes, and retrieval relationships are edges. GNNs are used to learn better node representations on this graph.
Core Idea: Organize RAG historical records using a graph structure, allowing GNNs to learn global associations beyond simple semantic similarity.
Method¶
Overall Architecture¶
(1) Graph Construction: Simulate multiple queries on the text chunks of a long document, use RAG to generate historical responses, and construct a bipartite graph of text chunks and responses; (2) Self-Supervised Training: Learn node embeddings using a GNN, optimized via a BERTScore-based ranking objective; (3) Retrieval and Summarization: When a new query arrives, use the learned node embeddings to retrieve the most relevant chunks and historical responses.
Key Designs¶
-
Graph Construction:
- Function: Organizes text chunks and historical responses into a graph.
- Mechanism: Generates simulated queries for any text chunk using an LLM, retrieves top-K chunks for each query to generate a response, and establishes edges between the retrieved chunks and the corresponding response.
- Design Motivation: Edges encode "which text chunks were co-used to answer the same question"—this co-occurrence implies logical associations.
-
BERTScore Self-Supervised Objective:
- Function: Optimizes GNN node embeddings.
- Mechanism: For simulated queries, computes the semantic similarity between each node and the source text chunk of the query using BERTScore, serving as ranking labels. The node embeddings are optimized using contrastive loss combined with pairwise ranking loss.
- Design Motivation: Global summarization lacks local annotated labels. BERTScore provides an imprecise but useful indirect supervision signal.
-
Graph-Augmented Retrieval:
- Function: Performs retrieval using the learned node embeddings.
- Mechanism: Upon arrival of a new query, matches the query embedding with the graph node embeddings to retrieve the most relevant text chunks and historical responses.
- Design Motivation: GNN aggregates neighborhood information, so node embeddings incorporate global context beyond local semantics.
Loss & Training¶
- Contrastive loss + pairwise ranking loss.
- GNN is trained on the text-chunk-to-response graph.
- LLM input is ~1.5K tokens (6×256 chunks).
Key Experimental Results¶
Main Results¶
| Dataset | Best Retrieval Baseline (R-L) | GoR (R-L) | Gain |
|---|---|---|---|
| WCEP | ~0.21 | ~0.24 | +15% |
| QMSum | ~0.18 | ~0.20 | +11% |
| BookSum | - | Best | - |
| AcademicEval | - | Best | - |
Ablation Study¶
| Configuration | Effect | Description |
|---|---|---|
| w/o GNN | Degrades to ordinary retrieval | GNN graph learning is core |
| w/o Historical Response Nodes | Performance drops | Historical responses provide valuable information |
| w/o BERTScore Objective | Training does not converge | Self-supervised signal is necessary |
| w/o Ranking Loss | Slight performance drop | Both ranking and contrastive loss are important |
Key Findings¶
- GoR consistently outperforms all sparse/dense/hybrid retrieval baselines, demonstrating that the graph structure effectively captures association between chunks.
- LLM evaluation (using DeepSeek-R1 as a judge) also confirms that GoR's summaries are more comprehensive, diverse, and informative.
- GNN is the most critical component—removing it degrades performance to the level of standard retrieval.
- Inference efficiency is acceptable—GNN inference overhead is far smaller than LLM generation overhead.
Highlights & Insights¶
- The insight that "historical responses are assets rather than waste" is highly valuable. RAG systems generate a large volume of responses daily; GoR demonstrates how to recycle this neglected information.
- Organizing RAG records with a graph structure is natural and effective, capturing inter-chunk logical and semantic relationships (e.g., causal, temporal, co-occurrence) that standard retrieval cannot capture.
- The self-supervised BERTScore objective cleverly addresses the "chicken-or-egg" problem of lacking local annotation labels in global summarization by replacing non-existent direct labels with indirect signals.
- GNN neighborhood aggregation is inherently suited for capturing long-range associations between text chunks, identifying deeper relationships than similarity-based re-ranking.
- The memory utilization of 75%+ far exceeds the 30% of traditional RAG, indicating that the graph structure organizes information more efficiently.
Limitations & Future Work¶
- Graph construction requires a large number of simulated queries and LLM calls, incurring high initialization costs (requiring several queries and historical responses to be generated for each document).
- GNN scalability on extremely long documents (e.g., whole books with tens of thousands of chunks) may become a bottleneck, as the increasing node count leads to an exponential surge in graph convolution computation.
- Validated only on summarization tasks; performance on other RAG tasks such as QA remains unexplored.
- BERTScore as an indirect supervision signal may not be sufficiently accurate for certain types of text.
- Dynamic graph updating was not explored—specifically, how to incrementally update the graph structure when new queries and responses are generated.
Related Work & Insights¶
- vs. Standard RAG: Standard RAG does not utilize historical responses, whereas GoR translates history into a learnable graph structure.
- vs. GraphRAG (Microsoft): GraphRAG constructs entity-relation graphs, while GoR builds retrieval record graphs, representing distinct approaches.
Rating¶
- Novelty: ⭐⭐⭐⭐ Innovative graph structure and utilization of historical responses.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets + 12 baselines + ablation + efficiency analysis.
- Writing Quality: ⭐⭐⭐⭐ Exceptionally clear method description.
- Value: ⭐⭐⭐⭐ Highly valuable improvement for long-context RAG.