LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale¶

Conference: CVPR 2025
arXiv: 2504.16030
Code: https://showlab.github.io/livecc
Area: Audio & Speech / Video Understanding
Keywords: Streaming Video LLM, ASR Transcription, Real-time Commentating, Dense Interleaving, YouTube CC

TL;DR¶

This paper proposes LiveCC, which trains video LLMs by densely interleaving ASR transcript words with video frames along the timeline. It constructs the Live-CC-5M pre-training dataset, enabling a 7B model to outperform 72B models (including Qwen2.5-VL-72B) on real-time video commentating tasks.

Background & Motivation¶

Background: Video LLMs are typically trained in an offline mode to "answer questions after watching the entire video." However, real-world scenarios (such as sports commentating, live video streaming) require the model to continuously generate descriptions as the video flows in—referred to as streaming understanding.

Limitations of Prior Work: Existing video LLMs lack streaming capabilities because the training data consists of "video-question-answer" triplets, which lack temporally dense continuous descriptions. Although YouTube has a massive number of captioned videos, the subtitle quality varies significantly.

Key Challenge: High-quality streaming training data is scarce—human annotation is extremely costly, whereas auto-generated closed captions (CC) are highly noisy.

Key Insight: Treating YouTube CC subtitles as ASR transcriptions and interleaving them with video frames along the timeline to form a streaming sequence. This data format is used for large-scale pre-training, followed by supervised fine-tuning (SFT) using high-quality WhisperX transcription data.

Core Idea: Temporally interleaving ASR words and video frames enabling large-scale unsupervised learning for streaming video understanding.

Method¶

Key Designs¶

Streaming Training Sequence Format:
- Function: Enables the LLM to learn to continuously generate descriptions as the video stream flows in
- Mechanism: The sequence format is [Con]<F_{t:t+k}><W_{t:t+k}><F_{t+k:t+2k}><W_{t+k:t+2k}>..., where video frames are sampled at 2 FPS and ASR words are interleaved according to their timestamps. The LLM predicts the next segment of ASR words in each time window.
- Design Motivation: Compared to the caption format (where the full description is placed at the end), the streaming sequence format increases the commentary win rate from 14.0% to 32.9%—dense interleaving allows the model to learn precise temporal correspondences.
Two-stage Data Construction (Live-CC-5M + Live-WhisperX-526K):
- Function: Large-scale, low-quality pre-training paired with small-scale, high-quality SFT
- Mechanism: Live-CC-5M collects CC subtitles from 5 million YouTube video clips (excluding talking-head videos) to pre-train streaming capabilities. Live-WhisperX-526K uses WhisperX to re-transcribe 526,000 high-quality clips and incorporates Q&A prompts generated by GPT-4o for SFT.
- Design Motivation: CC subtitles are noisy but large in scale (5M), while WhisperX transcriptions are high in quality but expensive (526K)—the two-stage training balances scale and quality.

Loss & Training¶

Standard autoregressive language modeling loss is applied, calculating loss only on the tokens of ASR words (excluding video frame tokens). SFT mixes Live-WhisperX-526K + LLaVA-Video-178K. Inference latency is <0.5s/frame @ 2 FPS.

Key Experimental Results¶

Main Results¶

Task	LiveCC-7B	Qwen2-VL-7B	GPT-4o
VideoMME (Short Video)	70.1%	69.4%	-
LiveSports-3K Win Rate	41.5%	33.7%	Reference
OVOBench	Outperforms 72B models	-	-

Ablation Study¶

Configuration	Commentary Win Rate	Description
Caption sequence format	14.0%	Non-streaming
Streaming sequence format	32.9%	+18.9%
Without ASR context	14.7%	Historical context is critical
With ASR context	32.0%	—
1M data	29.1%	—
5M data	32.9%	Data scaling is effective

Key Findings¶

Streaming sequence format substantially outperforms caption format: Win rate of 32.9% vs. 14.0%, indicating that temporally dense interleaving is crucial for streaming capability.
7B model outperforms 72B models: LiveCC-7B outperforms Qwen2.5-VL-72B and LLaVA-Video-72B in real-time commentating, demonstrating that the data paradigm is more critical than model scale.
ASR historical context is crucial: What was previously spoken directly influences what should be said next.

Highlights & Insights¶

Paradigm Innovation—Shifts video understanding from "watching then answering" to "commentating while watching." Changes in the data format bring a qualitative shift in capabilities.
YouTube CC as Free Training Data—Leverages existing massive caption data without requiring additional manual annotations.
Small Model Beats Large Model—A proper training paradigm is more effective than scaling parameters.

Limitations & Future Work¶

Low quality of YouTube CC (demands substantial preprocessing).
Streaming mode reduces instruction-following capability, requiring logit-based evaluation.
SFT relies on GPT-4o to generate prompts, leading to high cost and bias.
Only processes forward visual inputs.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Training paradigm innovation for streaming video understanding
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks and ablation studies, with a novel sports commentary evaluation
Writing Quality: ⭐⭐⭐⭐ Clear and complete
Value: ⭐⭐⭐⭐⭐ Opens up a large-scale training pathway for real-time video understanding