VGEnt: Graph-Based Retrieval-Reasoning-Augmented Generation for Long Video Understanding¶

Conference: NeurIPS 2025 arXiv: 2510.14032 Code: GitHub Area: Video Understanding Keywords: long video understanding, graph RAG, structured reasoning, retrieval-augmented generation, video language model

TL;DR¶

This paper proposes VGEnt, a graph-based retrieval-reasoning-augmented generation framework that constructs a video knowledge graph to preserve cross-segment semantic relationships, and introduces structured reasoning steps to filter noise and aggregate information. VGEnt consistently improves open-source LVLMs by 3.0%–5.4% across multiple long video understanding benchmarks and outperforms existing video RAG methods by 8.6%.

Background & Motivation¶

Core challenges in long video understanding:

Context window limitations: A 30-minute video can exceed 200K tokens, surpassing the context limits of most models. Existing approaches address this via sparse sampling or token compression, but inevitably lose fine-grained temporal information.

Limitations of naive RAG: Naive RAG splits videos into independent segments for retrieval, disrupting entity continuity and temporal dependencies. In approximately 40% of failure cases, the correct segments are successfully retrieved, yet the model still produces incorrect answers—due to interference from irrelevant information.

Dependence on closed-source models: Methods such as VideoAgent and DrVideo rely on GPT-4 for multi-round interaction, incurring high costs and limited flexibility.

Method¶

Overall Architecture¶

VGEnt comprises four stages: (1) offline video graph construction; (2) graph-based retrieval; (3) structured reasoning; and (4) multimodal augmented generation. The entire pipeline is training-free and can be directly applied to any open-source LVLM.

Key Designs¶

Video Graph Construction:
- The video is segmented into clips of \(K=64\) frames each, with each clip serving as a node in the graph.
- An LVLM extracts key entities (subjects, actions, scenes) and their descriptions from each clip.
- Cross-segment entity merging is performed via text embedding similarity (threshold \(\tau=0.7\)): semantically equivalent entities are unified, and edges are established between nodes sharing the same entities.
- Graph construction is offline and query-agnostic: once built, it can be reused for multiple questions on the same video.
Graph-based Retrieval:
- Keywords \(\mathcal{K}\) are extracted from the user query.
- The similarity between each keyword and every entity description in the global entity set \(\mathcal{U}\) is computed; all nodes associated with entities exceeding threshold \(\theta=0.5\) are treated as candidates.
- Top-\(N\) (\(N=20\)) most relevant segments are selected via reranking.
- Compared to naive RAG's independent per-segment retrieval, the graph structure naturally preserves temporal associations among entities.
Structured Reasoning:
- Core finding: in approximately 40% of failure cases, correct segments are retrieved but the model still answers incorrectly (information overload problem).
- Divide-and-verify: the LVLM generates structured sub-queries (yes/no or numerical), and each retrieved segment is verified against them individually.
- Noise filtering: only segments passing at least one sub-query verification are retained (up to \(r=5\)), effectively eliminating hard negatives.
- Information aggregation: for the filtered segments, all sub-query results are aggregated to produce auxiliary context.

Loss & Training¶

VGEnt is a training-free framework with no additional fine-tuning or loss functions. Graph construction uses BAAI/bge-large-en-v1.5 embeddings for similarity computation, and caption extraction uses openai/whisper-large.

Key Experimental Results¶

Main Results¶

Model	Size	MLVU Gain	VideoMME (w/ sub.) Gain	LVB Gain
InternVL2.5 + VGEnt	2B	+4.4	+1.6	+2.8
Qwen2.5-VL + VGEnt	3B	+4.2	+2.0	+3.6
LongVU + VGEnt	7B	+5.4	+2.8	+2.5
Qwen2-VL + VGEnt	7B	+4.6	+2.0	+2.8
LLaVA-Video + VGEnt	7B	+3.0	+1.9	+2.9
Qwen2.5-VL + VGEnt	7B	+3.3	+3.2	+3.7

Highlight: Qwen2.5-VL (3B) + VGEnt achieves 70.4% on MLVU, surpassing its 7B counterpart (68.8%).

Ablation Study¶

Configuration	MLVU	VideoMME	LVB	Notes
Qwen2.5-VL baseline	68.8	71.1	56.0	No RAG
+ NaïveRAG	65.4	68.3	56.2	Naive RAG degrades MLVU
+ GraphRAG	69.5	72.7	57.1	Graph retrieval outperforms naive RAG
+ NaïveRAG + SR	68.6	69.8	57.3	Structured reasoning provides limited help
+ GraphRAG + SR	72.1	74.3	59.7	Synergy of both yields best results

Key Findings¶

Naive RAG can be harmful: MLVU drops from 68.8 to 65.4, indicating that inappropriate retrieval introduces more noise than no retrieval at all.
GraphRAG vs. NaïveRAG: average improvement of 2.9%, with a gap of 4.1% on MLVU.
Necessity of structured reasoning: GraphRAG + SR improves over GraphRAG alone by approximately 2.6%.
Gains are most significant on long video scenarios (VideoMME long subset), reaching 5.4%.

Highlights & Insights¶

The graph structure's advantage lies in preserving entity-level temporal dependencies—something chunk-based RAG fundamentally cannot achieve.
The divide-and-verify strategy in structured reasoning is particularly elegant: it decomposes complex questions into simple yes/no sub-problems, allowing the model to respond segment by segment, thereby reducing the reasoning burden on the LVLM.
The modular design of the framework enables plug-and-play integration with any open-source LVLM, with graph construction incurring only a one-time offline cost.

Limitations & Future Work¶

Graph construction relies on an LVLM to extract entities and descriptions, placing demands on the model's visual comprehension capability.
Entity merging uses a fixed similarity threshold, which may cause synonymous entities with different surface forms to remain unmerged.
Structured reasoning introduces multiple rounds of LVLM calls, resulting in higher inference latency.
For abstract reasoning tasks lacking explicit entities (e.g., sentiment or style analysis), the advantages of graph structure may be limited.

Relation to GraphRAG (NLP): The paper transfers graph-augmented RAG ideas from the text domain to video, with the addition of visual entity merging and multimodal verification.
Relation to VideoAgent: Addresses a similar problem without relying on closed-source APIs, achieving a self-contained open-source solution.
Inspiration: The post-hoc structured verification paradigm can be generalized to other multimodal RAG scenarios, such as document understanding and multi-image reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of graph RAG and structured reasoning represents a novel attempt in the video domain.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 models × 3 benchmarks × comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-integrated figures and text.
Value: ⭐⭐⭐⭐ Plug-and-play framework with strong practical utility; the result of a 3B model surpassing a 7B model is highly compelling.