Skip to content

Aha: Predicting What Matters Next — Online Highlight Detection Without Looking Ahead

Conference: NeurIPS 2025 arXiv: 2509.16421 Code: GitHub Area: Autonomous Driving / Video Understanding Keywords: Online Highlight Detection, Streaming Video, Autoregressive, Video-Language Model, Uncertainty Modeling

TL;DR

Aha proposes the first autoregressive framework for Online Highlight Detection (OHD), featuring a decoupled multi-objective prediction head (relevance / informativeness / uncertainty) and a novel Dynamic SinkCache memory mechanism. Under strict causal constraints with no access to future frames, Aha surpasses prior offline methods on TVSum and Mr.Hisum benchmarks by +5.9% and +8.3% mAP, respectively.

Background & Motivation

Background Highlight Detection (HD) aims to identify key segments from videos. Nearly all modern Transformer-based HD methods rely on offline, full-sequence access. Streaming Video-LLMs can process streaming video, but HD as an auxiliary function yields limited performance.

Limitations of Prior Work (1) Offline methods require complete videos and are unsuitable for real-time decision-making (autonomous driving / surveillance / search-and-rescue); (2) Existing Video-LLMs performing HD either require benchmark modification or post-processing smoothing that violates online constraints; (3) No robust method has been specifically designed for OHD.

Key Challenge Effective HD requires understanding of temporal context, yet online constraints mandate the exclusive use of past and current information — the central challenge is achieving high-accuracy frame-level scoring under strict causal constraints.

Goal Design a real-time, task-conditioned online highlight detection framework that uses neither future frames nor modifications to standard benchmarks.

Key Insight Establish an autoregressive scoring framework that captures three complementary dimensions — "Is it relevant? Is it novel? Is it certain?" — via a decoupled multi-objective head, coupled with a task-aware memory mechanism for unbounded streaming inference.

Core Idea Formalize online HD as an autoregressive multi-objective scoring problem, and employ Dynamic SinkCache to support unbounded-length inference with constant memory overhead.

Method

Overall Architecture

Aha consists of four components: (1) a frozen SigLIP visual encoder for frame feature extraction; (2) a single-layer linear projection mapping features to the LLM space; (3) a Qwen2-based autoregressive decoder processing interleaved text and visual token sequences; and (4) three multi-objective prediction heads plus an auxiliary LM head.

Key Designs

  1. Decoupled Multi-Objective Prediction Heads:

    • Function: Predict three complementary signals from decoder hidden state \(h_t\) — relevance, informativeness, and uncertainty.
    • Mechanism: The relevance head \(\hat{r}_t = W_r h_t\) is supervised with Smooth L1 loss plus total variation regularization (\(\mathcal{L}_{rel-total} = \mathcal{L}_{rel} + \lambda_{TV}\mathcal{L}_{TV}\)); the informativeness head predicts whether a frame introduces new information (supervised with BCE); the uncertainty head predicts Gaussian variance (supervised with NLL plus a variance diversity penalty to prevent mode collapse).
    • Design Motivation: Effective HD requires not only task relevance but also informational novelty and prediction reliability. The decoupled design allows each head to independently optimize its complementary objective.
  2. Dynamic SinkCache:

    • Function: Enable unbounded streaming inference with constant memory overhead.
    • Mechanism: An extension of SinkCache that dedicates the sink region exclusively to task-description text tokens (~45 tokens), while a sliding window (2,048 tokens) retains recent visual context. Formalized as \(\mathcal{K}_t = \{\mathcal{Q}, k_{t-n:t}\}\), requiring only 17% of the memory of a standard KV cache.
    • Design Motivation: Standard KV caches grow linearly with sequence length and lead to OOM; static SinkCache uses generic initial tokens as sinks without task specificity; Dynamic SinkCache preserves task objectives to achieve long-range semantic alignment.
  3. Uncertainty-Aware Fusion Scoring:

    • Function: Fuse the outputs of the three heads into a final highlight score.
    • Mechanism: A piecewise linear function \(\hat{y}_t = \alpha\hat{i}_t + \beta\hat{r}_t - \epsilon(\hat{u}_t - \tau_u)\mathbf{1}[\hat{u}_t > \tau_u]\) performs weighted fusion under low uncertainty and suppresses scores when uncertainty exceeds a threshold.
    • Design Motivation: High uncertainty indicates unreliable model judgment for the current frame, whose influence should therefore be reduced.

Loss & Training

The total loss is \(\mathcal{L}_{total} = \lambda_r\mathcal{L}_{rel-total} + \lambda_i\mathcal{L}_{info} + \lambda_u\mathcal{L}_{unc} + \lambda_{LM}\mathcal{L}_{LM}\), with fixed weights to ensure training stability. Training data includes the HIHD dataset (22,463 videos with YouTube engagement signals derived from Mr.Hisum) and Shot2Story/COIN for informativeness head supervision.

Key Experimental Results

Main Results — TVSum Highlight Detection

Model Fine-tuned mAP Kendall τ Spearman ρ
TR-DETR (Offline, SOTA) Y 87.1 - -
LLMVS N - 0.211 0.275
Aha (Zero-shot) N 91.6 0.304 0.433
Aha (Domain-adapted) N 93.0 0.285 0.406

Main Results — Mr.Hisum

Model mAP@50 mAP@15
PGL-SUM 55.89 27.45
Aha 64.19 32.66

Ablation Study

Configuration mAP Note
Full Aha 93.0 Baseline
w/o Relevance Head 77.3 −15.7, most critical component
w/o Informativeness Head 83.2 −9.8, substantial contribution
w/o Language Conditioning 81.2 −11.8, task text is essential
Dynamic SinkCache 93.0 Outperforms unbounded cache and standard SinkCache

Key Findings

  • The fully online model surpasses all fine-tuned offline methods in zero-shot setting (91.6 vs. 87.1 mAP).
  • Dynamic SinkCache supports long-video inference over 127K+ tokens using only 17% of standard cache memory.
  • Language conditioning is critical for task-oriented HD (removal causes −11.8 mAP).

Highlights & Insights

  • This work is the first to demonstrate that strict online causal constraints need not preclude surpassing offline full-context methods, challenging the intuition that "a complete video is necessary for high-quality HD."
  • Anchoring task semantics as long-term memory via Dynamic SinkCache is an elegant and principled design choice.
  • The three-head decoupled architecture provides interpretability — relevance, novelty, and prediction reliability can each be analyzed independently.

Limitations & Future Work

  • Engagement signals as a proxy for highlights may introduce bias (e.g., clickbait content).
  • The current framework supports only frame-level scoring, lacking segment-level output and structured summarization.
  • HIHD is derived from YouTube data, which may limit generalization to safety-critical domains (e.g., medical, military).
  • vs. TR-DETR: An offline bidirectional attention method; Aha surpasses it by 6 mAP points using purely causal inference.
  • vs. MMDuet: A streaming Video-LLM where HD is an auxiliary function; Aha is specifically optimized for HD.
  • vs. StreamingLLM: Aha's Dynamic SinkCache is a task-aware extension of the StreamingLLM SinkCache.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First strictly online HD framework; results surpassing offline methods are compelling.
  • Experimental Thoroughness: ⭐⭐⭐⭐ TVSum + Mr.Hisum + ablations + robot video, multi-dimensional validation.
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; method description is thorough.
  • Value: ⭐⭐⭐⭐⭐ Directly applicable to real-time video understanding systems.