Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction¶

Conference: AAAI 2026 arXiv: 2511.10134 Code: None Area: Video Understanding Keywords: Dense Video Captioning, Cross-modal Retrieval, Temporal Clustering, Feature Enhancement, Retrieval-Augmented Generation

TL;DR¶

This paper proposes the CACMI framework, which addresses two fundamental limitations in dense video captioning (insufficient temporal modeling and modality gap) through explicit temporal-semantic modeling. It employs Cross-modal Frame Aggregation (CFA) to extract temporally coherent event semantics, and Context-aware Feature Enhancement (CFE) to bridge the visual-textual modality gap, achieving state-of-the-art performance on ActivityNet Captions and YouCook2.

Background & Motivation¶

Dense Video Captioning (DVC) requires simultaneously localizing and describing all salient events with precise temporal boundaries in untrimmed videos. Recent retrieval-augmented generation (RAG)-based methods (e.g., CM2) have begun incorporating external semantic knowledge to enhance understanding and generation capabilities.

Limitations of Prior Work: Existing memory-based methods rely on implicit RAG frameworks that use manually designed fixed windows for cross-modal retrieval, leading to two fundamental limitations:

Insufficient Temporal Modeling: Fixed-window visual features focus only on local segments, resulting in discontinuous semantic retrieval that fails to capture temporal coherence across event sequences.

Modality Gap: Retrieved semantic features are fused with visual representations via simple operations (concatenation or basic attention), which is insufficient to bridge the inherent gap between visual and textual modalities.

Key Challenge: Effective retrieval-augmented DVC requires exploiting the temporal structure and rich semantic information inherent in video data, yet current methods naively concatenate frame-level or fragmented textual information, neglecting temporal coherence.

Key Insight: Adjacent frames share similar visual and temporal contexts and typically represent the same semantic event. Based on this observation, the paper introduces explicit temporal-semantic modeling via pseudo-events, endowing retrieved textual semantics with temporal properties.

Method¶

Overall Architecture¶

CACMI follows a RAG paradigm: a CLIP image encoder extracts frame-level features → the CFA module aggregates temporally coherent frames and retrieves event-aligned text → the CFE module fuses visual and textual features → a Deformable Transformer with multi-task heads produces event localization and captioning outputs.

Key Designs¶

Cross-modal Frame Aggregation (CFA):
- Event Context Clustering:
  - Function: Aggregates frame-level visual features into temporally coherent pseudo-event representations.
  - Mechanism: Applies agglomerative clustering (Euclidean distance + Ward linkage) to CLIP frame features, augmented with a temporal aggregation constraint (the temporal gap between any two frames within the same cluster must not exceed \(t_{\max}\)), ensuring that semantically similar and temporally contiguous frames are grouped together.
  - Design Motivation: Agglomerative clustering assumes no fixed cluster shape, making it well-suited for discovering flexible patterns in feature space; the temporal constraint enforces event-level temporal coherence.
  - Output: \(c\) cluster-level feature vectors \(F^c\), each representing a pseudo-event, computed via boundary-enhanced weighted averaging (inverted bell-shaped weights that assign higher weights to boundary frames).
- Event Semantic Retrieval:
  - Function: Retrieves the most relevant textual descriptions from a sentence pool for each pseudo-event.
  - Mechanism: A CLIP text encoder preprocesses the sentence pool; cosine similarity matrices between pseudo-event features and all text features are computed; top-\(k\) candidates per pseudo-event are retrieved and average-pooled.
  - Design Motivation: The core innovation lies in performing retrieval at the event granularity rather than at the frame or fixed-window level, preserving the integrity of temporal structure.
Context-aware Feature Enhancement (CFE):
- Function: Fine-grained cross-modal fusion that uses textual queries to guide visual feature enhancement.
- Mechanism: Computes a similarity matrix \(M\) between frame-level visual features \(F^v\) and event-level textual queries \(F^q\); applies dual attention (column-wise and row-wise softmax) to obtain cross-attention features \(F^{v'}\) and \(F^{q'}\); concatenates these with the original features and projects them; finally incorporates a global text vector via 1D convolution for fusion.
- Design Motivation: CM2 uses shared self-attention weights for feature enhancement, and this parameter-sharing scheme is insufficient to bridge the semantic gap between modalities. Query-guided fusion selectively suppresses irrelevant visual elements and enhances semantically aligned regions.
Multi-task Prediction Heads:
- Localization Head: An MLP that regresses event centers and temporal spans.
- Captioning Head: An LSTM with deformable soft attention for word-by-word caption generation.
- Event Counter: Max-pooling + FC layers to predict the number of events in the video.
- Hungarian algorithm is used for matching predictions to ground truth.

Loss & Training¶

Matching loss: \(L_{\text{match}} = L_{\text{cls}} + \alpha \cdot L_{\text{loc}}\) (focal classification loss + generalized IoU loss)
Total loss: \(L = \alpha_{\text{cls}} \cdot L_{\text{cls}} + \alpha_{\text{loc}} \cdot L_{\text{loc}} + \alpha_{\text{count}} \cdot L_{\text{count}} + \alpha_{\text{cap}} \cdot L_{\text{cap}}\)
Frame sampling: 1 FPS; 100 frames fixed for ActivityNet, 200 for YouCook2.
Event queries: 10 for ActivityNet, 100 for YouCook2.
Cluster count: 10 for ActivityNet, 20 for YouCook2.
Retrieval top-\(k\) = 40.

Key Experimental Results¶

Main Results (Captioning Performance)¶

ActivityNet Captions (comparison with non-pretrained methods):

Method	BLEU4↑	METEOR↑	CIDEr↑	SODA_c↑
PDVC (ICCV'21)	2.21	8.06	29.97	5.92
CM2 (CVPR'24)	2.38	8.55	33.01	6.18
E2DVC (CVPR'25)	2.43	8.57	33.63	6.13
CACMI (Ours)	2.44	8.68	33.80	6.39

YouCook2:

Method	BLEU4↑	METEOR↑	CIDEr↑	SODA_c↑
PDVC	1.40	5.56	29.69	4.92
CM2	1.63	6.08	31.66	5.34
CACMI (Ours)	1.70	6.21	34.83	5.57

Event Localization Performance¶

Method	ActivityNet F1↑	YouCook2 F1↑
PDVC	54.78	26.81
CM2	55.21	28.43
E2DVC	56.42	28.87
CACMI (Ours)	57.10	29.34

Ablation Study¶

CFA	CFE	CIDEr	SODA_c	F1
✗	✗	33.01	6.18	55.21
✓	✗	33.62	6.26	56.07
✗	✓	33.48	6.31	56.95
✓	✓	33.80	6.39	57.10

Cluster Count	CIDEr	F1	Note
3	32.84	54.91	Too coarse
10	33.80	57.10	Optimal
15	32.98	55.15	Over-segmented

Top-\(k\)	CIDEr	F1	Note
10	32.20	55.95	Insufficient semantic diversity
40	33.80	57.10	Optimal balance
80	32.57	56.15	Redundant information dilution

Key Findings¶

CFE contributes more to localization: CFE alone improves F1 from 55.21 to 56.95 (+1.74), while CFA alone yields +0.86, indicating that cross-modal fusion is critical for temporal boundary prediction.
CFA contributes more to captioning quality: CFA improves CIDEr by 0.61 vs. 0.47 for CFE alone, demonstrating that event-level semantic retrieval enriches caption content.
Complementary effect of both modules: The combination achieves the best performance across all metrics.
Most pronounced gains on SODA_c: The proposed method surpasses all baselines most notably on SODA_c, which evaluates narrative coherence, confirming that explicit temporal modeling effectively captures inter-event temporal dependencies.

Highlights & Insights¶

Explicit vs. implicit temporal modeling: Discovering natural event boundaries via clustering, rather than manually designing fixed windows, better reflects the intrinsic structure of videos.
Boundary-weighted event representation: The inverted bell-shaped weights assign higher importance to boundary frames, facilitating precise temporal localization.
Effectiveness of query-guided fusion: More effectively bridges the modality gap compared to CM2's shared self-attention mechanism.
No large-scale pretraining required: Surpasses certain pretrained methods without additional video pretraining data.

Limitations & Future Work¶

The number of clusters is a hyperparameter that may require adaptive tuning for videos of varying lengths and complexity.
The construction and quality of the sentence pool directly affect retrieval performance, yet this aspect is not thoroughly discussed in the paper.
A performance gap relative to the pretrained Vid2Seq remains on YouCook2, likely due to limited domain coverage of the training videos.
The captioning head still employs an LSTM; stronger generative models (e.g., LLM decoders) have not been explored.
Euclidean distance used for clustering may not be the optimal metric in the high-dimensional CLIP feature space.
Dynamic top-\(k\) or adaptive retrieval strategies have not been investigated.

CM2 as a pioneering work: CM2 first introduced a memory retrieval mechanism into DVC; the present work improves upon it in retrieval granularity and fusion design.
Adapting the RAG paradigm to video: Extending text-retrieval-augmented generation from NLP to video understanding requires preserving temporal structure as a key consideration.
Influence of PDVC's design: The parallel decoding structure allows localization and captioning subtasks to share intermediate representations.
Insight: In any RAG system involving temporal data, performing retrieval at the granularity of semantically coherent segments—rather than fixed windows—may be a superior strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ (Clear motivation for explicit temporal-semantic modeling; theoretically grounded CFA+CFE design)
Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive evaluation on two datasets, thorough ablation studies, convincing visualizations)
Writing Quality: ⭐⭐⭐⭐ (Well-structured, sufficiently motivated, mathematically rigorous)
Value: ⭐⭐⭐⭐ (Provides new state-of-the-art results and a meaningful methodological contribution to DVC)