LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale¶
Conference: CVPR 2025
arXiv: 2504.16030
Code: https://showlab.github.io/livecc
Area: Audio & Speech / Video Understanding
Keywords: Streaming Video LLM, ASR Transcription, Real-time Commentating, Dense Interleaving, YouTube CC
TL;DR¶
This paper proposes LiveCC, which trains video LLMs by densely interleaving ASR transcript words with video frames along the timeline. It constructs the Live-CC-5M pre-training dataset, enabling a 7B model to outperform 72B models (including Qwen2.5-VL-72B) on real-time video commentating tasks.
Background & Motivation¶
Background: Video LLMs are typically trained in an offline mode to "answer questions after watching the entire video." However, real-world scenarios (such as sports commentating, live video streaming) require the model to continuously generate descriptions as the video flows in—referred to as streaming understanding.
Limitations of Prior Work: Existing video LLMs lack streaming capabilities because the training data consists of "video-question-answer" triplets, which lack temporally dense continuous descriptions. Although YouTube has a massive number of captioned videos, the subtitle quality varies significantly.
Key Challenge: High-quality streaming training data is scarce—human annotation is extremely costly, whereas auto-generated closed captions (CC) are highly noisy.
Key Insight: Treating YouTube CC subtitles as ASR transcriptions and interleaving them with video frames along the timeline to form a streaming sequence. This data format is used for large-scale pre-training, followed by supervised fine-tuning (SFT) using high-quality WhisperX transcription data.
Core Idea: Temporally interleaving ASR words and video frames enabling large-scale unsupervised learning for streaming video understanding.
Method¶
Key Designs¶
-
Streaming Training Sequence Format:
- Function: Enables the LLM to learn to continuously generate descriptions as the video stream flows in
- Mechanism: The sequence format is
[Con]<F_{t:t+k}><W_{t:t+k}><F_{t+k:t+2k}><W_{t+k:t+2k}>..., where video frames are sampled at 2 FPS and ASR words are interleaved according to their timestamps. The LLM predicts the next segment of ASR words in each time window. - Design Motivation: Compared to the caption format (where the full description is placed at the end), the streaming sequence format increases the commentary win rate from 14.0% to 32.9%—dense interleaving allows the model to learn precise temporal correspondences.
-
Two-stage Data Construction (Live-CC-5M + Live-WhisperX-526K):
- Function: Large-scale, low-quality pre-training paired with small-scale, high-quality SFT
- Mechanism: Live-CC-5M collects CC subtitles from 5 million YouTube video clips (excluding talking-head videos) to pre-train streaming capabilities. Live-WhisperX-526K uses WhisperX to re-transcribe 526,000 high-quality clips and incorporates Q&A prompts generated by GPT-4o for SFT.
- Design Motivation: CC subtitles are noisy but large in scale (5M), while WhisperX transcriptions are high in quality but expensive (526K)—the two-stage training balances scale and quality.
Loss & Training¶
Standard autoregressive language modeling loss is applied, calculating loss only on the tokens of ASR words (excluding video frame tokens). SFT mixes Live-WhisperX-526K + LLaVA-Video-178K. Inference latency is <0.5s/frame @ 2 FPS.
Key Experimental Results¶
Main Results¶
| Task | LiveCC-7B | Qwen2-VL-7B | GPT-4o |
|---|---|---|---|
| VideoMME (Short Video) | 70.1% | 69.4% | - |
| LiveSports-3K Win Rate | 41.5% | 33.7% | Reference |
| OVOBench | Outperforms 72B models | - | - |
Ablation Study¶
| Configuration | Commentary Win Rate | Description |
|---|---|---|
| Caption sequence format | 14.0% | Non-streaming |
| Streaming sequence format | 32.9% | +18.9% |
| Without ASR context | 14.7% | Historical context is critical |
| With ASR context | 32.0% | — |
| 1M data | 29.1% | — |
| 5M data | 32.9% | Data scaling is effective |
Key Findings¶
- Streaming sequence format substantially outperforms caption format: Win rate of 32.9% vs. 14.0%, indicating that temporally dense interleaving is crucial for streaming capability.
- 7B model outperforms 72B models: LiveCC-7B outperforms Qwen2.5-VL-72B and LLaVA-Video-72B in real-time commentating, demonstrating that the data paradigm is more critical than model scale.
- ASR historical context is crucial: What was previously spoken directly influences what should be said next.
Highlights & Insights¶
- Paradigm Innovation—Shifts video understanding from "watching then answering" to "commentating while watching." Changes in the data format bring a qualitative shift in capabilities.
- YouTube CC as Free Training Data—Leverages existing massive caption data without requiring additional manual annotations.
- Small Model Beats Large Model—A proper training paradigm is more effective than scaling parameters.
Limitations & Future Work¶
- Low quality of YouTube CC (demands substantial preprocessing).
- Streaming mode reduces instruction-following capability, requiring logit-based evaluation.
- SFT relies on GPT-4o to generate prompts, leading to high cost and bias.
- Only processes forward visual inputs.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Training paradigm innovation for streaming video understanding
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks and ablation studies, with a novel sports commentary evaluation
- Writing Quality: ⭐⭐⭐⭐ Clear and complete
- Value: ⭐⭐⭐⭐⭐ Opens up a large-scale training pathway for real-time video understanding