HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming¶
Conference: ICLR 2026 arXiv: 2602.14214 Code: To be confirmed Area: Time Series Keywords: video saliency, LLM-as-judge, content-aware streaming, time series forecasting, adaptive bitrate
TL;DR¶
This paper proposes HiVid, the first framework to leverage LLMs as human proxies for generating content importance weights for video chunks. Through a Perception module (sliding-window scoring), a Ranking module (LLM-guided merge sort to eliminate scoring bias), and a Prediction module (multimodal time series forecasting with adaptive latency), HiVid enables content-aware streaming, achieving an 11.5% improvement in VOD PLCC, a 26% gain in live streaming prediction, and a 14.7% improvement in human MOS correlation.
Background & Motivation¶
Background: Content-aware video streaming assigns higher bitrates to more important chunks via \(QoE = \sum_i w_i \cdot q_i\). Existing approaches include CV-based highlight detection models (DETR, VASNet, etc.) and human crowdsourced annotation (SENSEI).
Limitations of Prior Work: CV models suffer from insufficient semantic understanding and poor generalization; large video understanding models (VideoLLaMA3, VILA) exhibit severe hallucination on subjective scoring tasks; human annotation is prohibitively expensive (78 min / $100 per video) and infeasible for live streaming scenarios.
Key Challenge: A weight generation approach that simultaneously achieves accuracy (semantic understanding) and efficiency (real-time + low cost) is lacking.
Goal: Three challenges: (1) LLMs cannot directly process video and are subject to token limits; (2) local scores within sliding windows are inconsistent across windows; (3) live streaming requires real-time inference but LLM latency is nondeterministic.
Key Insight: Using LLMs as "human proxies" for zero-shot subjective reasoning, bypassing token limits through windowing and context summarization.
Core Idea: LLM perception + merge-sort debiasing + multimodal predictive adaptive latency = end-to-end content-aware streaming.
Method¶
Overall Architecture¶
HiVid consists of three modules: Perception (base) → Ranking (VOD) / Prediction (live streaming), ultimately outputting chunk weights \(w_i\) integrated into the QoE model.
Key Designs¶
-
Perception Module: Every \(m\) frames are fed to an LLM (default GPT-4o) via a sliding window; the prompt requests scores and an updated summary: \(R_{(k-1)m+1}^{km}, S_{km} = LLM(F_{(k-1)m+1}^{km}, S_{(k-1)m})\) Only \(\lceil D/m \rceil\) LLM calls are needed for arbitrarily long videos; the summary \(S\) serves as a compressed historical context.
-
Ranking Module (VOD): LLM-guided merge sort is employed to eliminate inter-window scoring bias. At each merge step, \(m/2\) frames are sampled from each sorted group to form a new list for LLM ranking; overall complexity is \(O(k \log k)\) where \(k = \lceil D/m \rceil\). Scores are then normalized to \([0,1]\) and smoothed with Gaussian smoothing \(w_i = GS(s, \sigma, w_i)\).
-
Prediction Module (Live Streaming): A multimodal time series prediction model comprising:
- CLIP Alignment: Frozen CLIP encodes historical frames and text summaries.
- Content-Aware Attention: Temporal features serve as Q; concatenated image and text features serve as K/V: \(Attn(F(x_w), F(x_{cat}), F(x_{cat})) = softmax\left(\frac{Q_w K_{cat}^T}{\sqrt{d}}\right) \cdot V_{cat}\)
- Adaptive Prediction Dimension: Output length is dynamically adjusted based on LLM latency \(\Delta t\) and prediction latency \(\delta\): \(L_{out} = \lceil(\Delta t + \delta)/d\rceil + m + N\)
- Correlation Loss: \(loss = MSE(x, x_{gt}) + \lambda(1 - \text{Pearson}(x, x_{gt}))\)
Loss & Training¶
- The Perception and Ranking modules require no training (zero-shot LLM inference).
- For the Prediction module, multiple models with different \(L_{out}\) values are trained; at inference time, the smallest model satisfying the latency requirement is selected.
Key Experimental Results¶
Main Results¶
Saliency scoring (PLCC/mAP50) across three datasets:
| Method | Youtube-8M PLCC | TVSum PLCC | SumMe PLCC |
|---|---|---|---|
| DETR | 0.57 | 0.42 | 0.38 |
| SL-module | 0.59 | 0.43 | 0.39 |
| VideoLLaMA3 | 0.54 | 0.41 | 0.35 |
| HiVid | 0.66 | 0.50 | 0.47 |
Ablation Study¶
Effect of window parameter \(m\) on overhead and accuracy (201s video):
| m | Total API Calls | Total Cost | Total Time/h |
|---|---|---|---|
| 2 | 1458 | $8.12 | 1.26 |
| 6 | 384 | $2.41 | 0.67 |
| 10 | 202 | $1.35 | 0.54 |
\(m=10\) achieves the optimal accuracy–cost trade-off.
Key Findings¶
- HiVid outperforms the second-best method SL-module by 11.5% in average PLCC and 6% in mAP50.
- In live streaming scenarios, HiVid's multimodal prediction outperforms the strongest time series baseline iTransformer by 26%.
- Human MOS correlation improves by 14.7%, validating real-world streaming QoE gains.
- Video understanding models (VILA, Flamingo) underperform CV baselines on subjective scoring tasks.
Highlights & Insights¶
- First framework to systematically leverage LLMs for video-level content-aware streaming: Extends the LLM-as-judge paradigm from text to video streaming.
- LLM-guided merge sort: An elegant algorithm design using LLMs as a comparison function, with manageable \(O(k \log k)\) overhead.
- Adaptive prediction dimension: Dynamic adjustment for asynchronous LLM inference latency, a critical design for practical deployment.
- End-to-end validation: A complete validation chain from scoring accuracy to real-world streaming QoE.
- Content-Aware Attention for multimodal fusion: A novel attention design that aligns CLIP image and text features and combines them with temporal sequences.
Limitations & Future Work¶
- Relies on the closed-source GPT-4o API; cost remains relatively high ($1.35/video), limiting large-scale deployment.
- The Perception module uses only the first frame as an anchor, potentially missing intra-window dynamic changes (e.g., rapid motion).
- In live streaming, the initial \(\lceil(\Delta t + \delta)/d\rceil + m\) chunks lack LLM scores and are filled with a default weight of 1.
- The Ranking module incurs significant overhead for very long videos; the practical API cost of \(O(k \log k)\) LLM calls is non-trivial.
- Only GPT-4o is evaluated; open-source LLM alternatives (e.g., Llama, Qwen) are not explored.
- Scoring quality is heavily dependent on the subjective judgment capability of the LLM, and different LLMs may introduce different biases.
- Dynamic video characteristics (e.g., scene transitions, camera motion) are not considered as temporal features.
- Category-specific strategies for different video genres (sports, news, education, etc.) are not explored.
Related Work & Insights¶
- SENSEI obtains precise weights via human crowdsourcing at prohibitive cost; HiVid achieves an accuracy–efficiency balance through LLMs.
- Compared to attention-based highlight detection methods (DETR, SL-module), LLMs demonstrate clear advantages in semantic content understanding.
- Compared to VideoLLaMA3/VILA: large video understanding models suffer severe hallucination on subjective scoring tasks and underperform the LLM vision-plus-text strategy.
- Offers methodological insights for the intersection of networked systems and AI: asynchronous design patterns for integrating LLM inference into online systems.
- Suggests that CLIP-aligned image and text features can serve as effective contextual signals in multimodal time series prediction.
Rating¶
- Novelty: ⭐⭐⭐⭐ LLM-as-judge applied to video streaming is a novel combination, though individual modules are more engineering integration than technical innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, 17 baselines, ablations, and a human user study — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Problem-driven structure with three challenges mapped to three modules; rigorous exposition.
- Value: ⭐⭐⭐⭐ Practically significant for content-aware streaming, though generalizability to the broader academic community is somewhat limited.