Skip to content

Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

Conference: CVPR 2026 arXiv: 2601.13719 Code: N/A Area: LLM Agent Keywords: Long video understanding, audiovisual entity cohesion, hierarchical indexing, agentic search, speaker identification

TL;DR

This paper proposes the HAVEN framework, which achieves 84.1% accuracy on LVBench through audiovisual entity cohesion and a four-level hierarchical video index (global–scene–clip–entity), coupled with an agentic search mechanism, attaining 80.1% on the reasoning category.

Background & Motivation

  1. Background: Long video understanding faces the challenge of extremely long context windows. RAG-based methods and agentic frameworks have made progress, but suffer from information fragmentation and loss of global coherence.
  2. Limitations of Prior Work: Existing retrieval-driven methods retrieve based on isolated signals (clip-level captions), and fragmented or redundant evidence severely undermines global narrative coherence. The absence of hierarchical video representations leaves agents without the structured context required for multi-level reasoning.
  3. Key Challenge: Flat databases (frames, captions, entities) require extensive iterative retrieval to recover cross-clip continuity, introducing unnecessary complexity and computational cost.
  4. Goal: To shift from fragmented retrieval toward coherent, structured understanding.
  5. Key Insight: Leveraging speaker identification as a strong signal for entity cohesion—speaker identity remains informative when visual cues degrade (occlusion, viewpoint change, etc.).
  6. Core Idea: Audiovisual entity cohesion + four-level hierarchical indexing + goal-driven agentic search.

Method

Overall Architecture

An offline phase constructs a four-level database \(\mathcal{D} = \{\tilde{\mathcal{C}}, \tilde{\mathcal{E}}, \tilde{\mathcal{S}}, \tilde{\mathcal{G}}\}\) (clip, entity, scene, global). At inference time, the agent navigates across levels via a think-act-observe loop to retrieve and reason.

Key Designs

  1. Audiovisual Entity Cohesion:

  2. Function: Maintains semantic consistency of entities across time and modalities.

  3. Mechanism: WhisperX is used for ASR and speaker diarization to obtain timestamped transcriptions and consistent speaker labels. After entity extraction, a two-stage merging process is applied: (1) embedding clustering to form candidate groups; (2) LLM review of each cluster for normalization or splitting. When multiple clips share the same speaker label, the corresponding character entities are preferentially merged.
  4. Design Motivation: Speaker identity remains reliable when visual cues degrade (occlusion, shot changes, appearance variation) and serves as a "glue" for cross-clip entity association.

  5. Four-Level Hierarchical Database:

  6. Function: Supports flexible retrieval at multiple granularities.

  7. Mechanism: (1) Clip level: fixed 30-second windows with textual descriptions and visual embeddings; (2) Entity level: normalized entities with re-descriptions of their associated clips; (3) Scene level: LLM adaptively aggregates semantically related clips into scene summaries; (4) Global level: a global summary generated from the set of scenes.
  8. Design Motivation: Different query types require information at different granularities—"what is the video about" requires the global level, while "what happened at 12:00" requires clip-level detail.

  9. Agentic Search with Multi-Granularity Toolset:

  10. Function: Query-driven adaptive multi-level retrieval and reasoning.

  11. Mechanism: Five tools are defined: global scene browsing \(T_{scene}\), clip caption search \(T_{caption}\), clip visual search \(T_{visual}\), entity search \(T_{entity}\), and an inspection tool \(T_{inspect}\) (with both text and visual modes). The agent is initialized with the global summary and dynamically selects tools across multiple iterations.
  12. Design Motivation: Low-cost text retrieval is prioritized; high-cost visual inspection is invoked only when necessary.

Loss & Training

HAVEN is a purely inference-based framework with no training. The database is constructed offline, and the agent performs online search at inference time.

Key Experimental Results

Main Results

Method LVBench Overall Reasoning Category
HAVEN (2fps) 84.1 80.1
DVD w/ subtitle 76.0 68.7
Seed1.5-VL-200B 64.6 63.7
OpenAI o3 57.1 50.8

Ablation Study

Configuration LVBench Overall Note
Full HAVEN 84.1 Best
w/o audio entity cohesion Drops Entity fragmentation
w/o hierarchical indexing Drops Lower retrieval efficiency

Key Findings

  • HAVEN achieves 80.1% on the most challenging reasoning category, substantially outperforming DVD (68.7%).
  • Speaker identity is a critical factor—Figure 3 demonstrates that characters with drastic appearance changes are correctly associated via speaker labels.
  • Performance improves from 81.0 to 84.1 at 2fps, as denser sampling provides more visual evidence.

Highlights & Insights

  • Speaker Identity as a Cross-Modal Glue: The framework elegantly exploits speaker consistency in the audio signal, a signal previously underutilized.
  • Offline–Online Decoupling: The hierarchical database is constructed offline, and inference requires only lightweight tool calls.
  • Strong Practicality: The approach is particularly effective for dialogue-intensive content such as documentaries, TV series, and vlogs.

Limitations & Future Work

  • The framework depends on the accuracy of ASR and speaker diarization, and may degrade in noisy audio environments.
  • Fixed 30-second clip segmentation may not be suitable for all video types.
  • Cached content is limited; further experimental details require consulting the full paper.
  • The computational cost and storage overhead of database construction are not analyzed in detail.
  • vs. DVD: DVD relies on a flat database requiring extensive iterations, whereas HAVEN reduces the number of iterations through hierarchical database design.
  • vs. VideoRAG: VideoRAG depends on graph-based semantic retrieval but lacks hierarchical structure.

Rating

  • Novelty: ⭐⭐⭐⭐ Audiovisual entity cohesion and the use of speaker identity constitute novel contributions; the hierarchical indexing design is systematic.
  • Experimental Thoroughness: ⭐⭐⭐ Limited cached details; LVBench results are strong but results on other benchmarks are incomplete.
  • Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear, case studies are intuitive, and method descriptions are well-organized.
  • Value: ⭐⭐⭐⭐ A practical framework for long video understanding; the speaker identity utilization strategy is transferable to other multimodal scenarios.