Aha: Predicting What Matters Next — Online Highlight Detection Without Looking Ahead¶

Conference: NeurIPS 2025 arXiv: 2509.16421 Code: GitHub Area: Autonomous Driving / Video Understanding Keywords: Online Highlight Detection, Streaming Video, Autoregressive, Video-Language Model, Uncertainty Modeling

TL;DR¶

Aha proposes the first autoregressive framework for Online Highlight Detection (OHD), featuring a decoupled multi-objective prediction head (relevance / informativeness / uncertainty) and a novel Dynamic SinkCache memory mechanism. Under strict causal constraints with no access to future frames, Aha surpasses prior offline methods on TVSum and Mr.Hisum benchmarks by +5.9% and +8.3% mAP, respectively.

Background & Motivation¶

Background Highlight Detection (HD) aims to identify key segments from videos. Nearly all modern Transformer-based HD methods rely on offline, full-sequence access. Streaming Video-LLMs can process streaming video, but HD as an auxiliary function yields limited performance.

Limitations of Prior Work (1) Offline methods require complete videos and are unsuitable for real-time decision-making (autonomous driving / surveillance / search-and-rescue); (2) Existing Video-LLMs performing HD either require benchmark modification or post-processing smoothing that violates online constraints; (3) No robust method has been specifically designed for OHD.

Key Challenge Effective HD requires understanding of temporal context, yet online constraints mandate the exclusive use of past and current information — the central challenge is achieving high-accuracy frame-level scoring under strict causal constraints.

Goal Design a real-time, task-conditioned online highlight detection framework that uses neither future frames nor modifications to standard benchmarks.

Key Insight Establish an autoregressive scoring framework that captures three complementary dimensions — "Is it relevant? Is it novel? Is it certain?" — via a decoupled multi-objective head, coupled with a task-aware memory mechanism for unbounded streaming inference.

Core Idea Formalize online HD as an autoregressive multi-objective scoring problem, and employ Dynamic SinkCache to support unbounded-length inference with constant memory overhead.

Method¶

Overall Architecture¶

Aha consists of four components: (1) a frozen SigLIP visual encoder for frame feature extraction; (2) a single-layer linear projection mapping features to the LLM space; (3) a Qwen2-based autoregressive decoder processing interleaved text and visual token sequences; and (4) three multi-objective prediction heads plus an auxiliary LM head.

Key Designs¶

Decoupled Multi-Objective Prediction Heads:
- Function: Predict three complementary signals from decoder hidden state \(h_t\) — relevance, informativeness, and uncertainty.
- Mechanism: The relevance head \(\hat{r}_t = W_r h_t\) is supervised with Smooth L1 loss plus total variation regularization (\(\mathcal{L}_{rel-total} = \mathcal{L}_{rel} + \lambda_{TV}\mathcal{L}_{TV}\)); the informativeness head predicts whether a frame introduces new information (supervised with BCE); the uncertainty head predicts Gaussian variance (supervised with NLL plus a variance diversity penalty to prevent mode collapse).
- Design Motivation: Effective HD requires not only task relevance but also informational novelty and prediction reliability. The decoupled design allows each head to independently optimize its complementary objective.
Dynamic SinkCache:
- Function: Enable unbounded streaming inference with constant memory overhead.
- Mechanism: An extension of SinkCache that dedicates the sink region exclusively to task-description text tokens (~45 tokens), while a sliding window (2,048 tokens) retains recent visual context. Formalized as \(\mathcal{K}_t = \{\mathcal{Q}, k_{t-n:t}\}\), requiring only 17% of the memory of a standard KV cache.
- Design Motivation: Standard KV caches grow linearly with sequence length and lead to OOM; static SinkCache uses generic initial tokens as sinks without task specificity; Dynamic SinkCache preserves task objectives to achieve long-range semantic alignment.
Uncertainty-Aware Fusion Scoring:
- Function: Fuse the outputs of the three heads into a final highlight score.
- Mechanism: A piecewise linear function \(\hat{y}_t = \alpha\hat{i}_t + \beta\hat{r}_t - \epsilon(\hat{u}_t - \tau_u)\mathbf{1}[\hat{u}_t > \tau_u]\) performs weighted fusion under low uncertainty and suppresses scores when uncertainty exceeds a threshold.
- Design Motivation: High uncertainty indicates unreliable model judgment for the current frame, whose influence should therefore be reduced.

Loss & Training¶

The total loss is \(\mathcal{L}_{total} = \lambda_r\mathcal{L}_{rel-total} + \lambda_i\mathcal{L}_{info} + \lambda_u\mathcal{L}_{unc} + \lambda_{LM}\mathcal{L}_{LM}\), with fixed weights to ensure training stability. Training data includes the HIHD dataset (22,463 videos with YouTube engagement signals derived from Mr.Hisum) and Shot2Story/COIN for informativeness head supervision.

Key Experimental Results¶

Main Results — TVSum Highlight Detection¶

Model	Fine-tuned	mAP	Kendall τ	Spearman ρ
TR-DETR (Offline, SOTA)	Y	87.1	-	-
LLMVS	N	-	0.211	0.275
Aha (Zero-shot)	N	91.6	0.304	0.433
Aha (Domain-adapted)	N	93.0	0.285	0.406

Main Results — Mr.Hisum¶

Model	mAP@50	mAP@15
PGL-SUM	55.89	27.45
Aha	64.19	32.66

Ablation Study¶

Configuration	mAP	Note
Full Aha	93.0	Baseline
w/o Relevance Head	77.3	−15.7, most critical component
w/o Informativeness Head	83.2	−9.8, substantial contribution
w/o Language Conditioning	81.2	−11.8, task text is essential
Dynamic SinkCache	93.0	Outperforms unbounded cache and standard SinkCache

Key Findings¶

The fully online model surpasses all fine-tuned offline methods in zero-shot setting (91.6 vs. 87.1 mAP).
Dynamic SinkCache supports long-video inference over 127K+ tokens using only 17% of standard cache memory.
Language conditioning is critical for task-oriented HD (removal causes −11.8 mAP).

Highlights & Insights¶

This work is the first to demonstrate that strict online causal constraints need not preclude surpassing offline full-context methods, challenging the intuition that "a complete video is necessary for high-quality HD."
Anchoring task semantics as long-term memory via Dynamic SinkCache is an elegant and principled design choice.
The three-head decoupled architecture provides interpretability — relevance, novelty, and prediction reliability can each be analyzed independently.

Limitations & Future Work¶

Engagement signals as a proxy for highlights may introduce bias (e.g., clickbait content).
The current framework supports only frame-level scoring, lacking segment-level output and structured summarization.
HIHD is derived from YouTube data, which may limit generalization to safety-critical domains (e.g., medical, military).

vs. TR-DETR: An offline bidirectional attention method; Aha surpasses it by 6 mAP points using purely causal inference.
vs. MMDuet: A streaming Video-LLM where HD is an auxiliary function; Aha is specifically optimized for HD.
vs. StreamingLLM: Aha's Dynamic SinkCache is a task-aware extension of the StreamingLLM SinkCache.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First strictly online HD framework; results surpassing offline methods are compelling.
Experimental Thoroughness: ⭐⭐⭐⭐ TVSum + Mr.Hisum + ablations + robot video, multi-dimensional validation.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; method description is thorough.
Value: ⭐⭐⭐⭐⭐ Directly applicable to real-time video understanding systems.