Skip to content

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Conference: CVPR 2025
arXiv: 2504.16030
Code: https://showlab.github.io/livecc
Area: Audio & Speech / Video Understanding
Keywords: Streaming Video LLM, ASR Transcription, Real-time Commentating, Dense Interleaving, YouTube CC

TL;DR

This paper proposes LiveCC, which trains video LLMs by densely interleaving ASR transcript words with video frames along the timeline. It constructs the Live-CC-5M pre-training dataset, enabling a 7B model to outperform 72B models (including Qwen2.5-VL-72B) on real-time video commentating tasks.

Background & Motivation

Background: Video LLMs are typically trained in an offline mode to "answer questions after watching the entire video." However, real-world scenarios (such as sports commentating, live video streaming) require the model to continuously generate descriptions as the video flows in—referred to as streaming understanding.

Limitations of Prior Work: Existing video LLMs lack streaming capabilities because the training data consists of "video-question-answer" triplets, which lack temporally dense continuous descriptions. Although YouTube has a massive number of captioned videos, the subtitle quality varies significantly.

Key Challenge: High-quality streaming training data is scarce—human annotation is extremely costly, whereas auto-generated closed captions (CC) are highly noisy.

Key Insight: Treating YouTube CC subtitles as ASR transcriptions and interleaving them with video frames along the timeline to form a streaming sequence. This data format is used for large-scale pre-training, followed by supervised fine-tuning (SFT) using high-quality WhisperX transcription data.

Core Idea: Temporally interleaving ASR words and video frames enabling large-scale unsupervised learning for streaming video understanding.

Method

Key Designs

  1. Streaming Training Sequence Format:

    • Function: Enables the LLM to learn to continuously generate descriptions as the video stream flows in
    • Mechanism: The sequence format is [Con]<F_{t:t+k}><W_{t:t+k}><F_{t+k:t+2k}><W_{t+k:t+2k}>..., where video frames are sampled at 2 FPS and ASR words are interleaved according to their timestamps. The LLM predicts the next segment of ASR words in each time window.
    • Design Motivation: Compared to the caption format (where the full description is placed at the end), the streaming sequence format increases the commentary win rate from 14.0% to 32.9%—dense interleaving allows the model to learn precise temporal correspondences.
  2. Two-stage Data Construction (Live-CC-5M + Live-WhisperX-526K):

    • Function: Large-scale, low-quality pre-training paired with small-scale, high-quality SFT
    • Mechanism: Live-CC-5M collects CC subtitles from 5 million YouTube video clips (excluding talking-head videos) to pre-train streaming capabilities. Live-WhisperX-526K uses WhisperX to re-transcribe 526,000 high-quality clips and incorporates Q&A prompts generated by GPT-4o for SFT.
    • Design Motivation: CC subtitles are noisy but large in scale (5M), while WhisperX transcriptions are high in quality but expensive (526K)—the two-stage training balances scale and quality.

Loss & Training

Standard autoregressive language modeling loss is applied, calculating loss only on the tokens of ASR words (excluding video frame tokens). SFT mixes Live-WhisperX-526K + LLaVA-Video-178K. Inference latency is <0.5s/frame @ 2 FPS.

Key Experimental Results

Main Results

Task LiveCC-7B Qwen2-VL-7B GPT-4o
VideoMME (Short Video) 70.1% 69.4% -
LiveSports-3K Win Rate 41.5% 33.7% Reference
OVOBench Outperforms 72B models - -

Ablation Study

Configuration Commentary Win Rate Description
Caption sequence format 14.0% Non-streaming
Streaming sequence format 32.9% +18.9%
Without ASR context 14.7% Historical context is critical
With ASR context 32.0%
1M data 29.1%
5M data 32.9% Data scaling is effective

Key Findings

  • Streaming sequence format substantially outperforms caption format: Win rate of 32.9% vs. 14.0%, indicating that temporally dense interleaving is crucial for streaming capability.
  • 7B model outperforms 72B models: LiveCC-7B outperforms Qwen2.5-VL-72B and LLaVA-Video-72B in real-time commentating, demonstrating that the data paradigm is more critical than model scale.
  • ASR historical context is crucial: What was previously spoken directly influences what should be said next.

Highlights & Insights

  • Paradigm Innovation—Shifts video understanding from "watching then answering" to "commentating while watching." Changes in the data format bring a qualitative shift in capabilities.
  • YouTube CC as Free Training Data—Leverages existing massive caption data without requiring additional manual annotations.
  • Small Model Beats Large Model—A proper training paradigm is more effective than scaling parameters.

Limitations & Future Work

  • Low quality of YouTube CC (demands substantial preprocessing).
  • Streaming mode reduces instruction-following capability, requiring logit-based evaluation.
  • SFT relies on GPT-4o to generate prompts, leading to high cost and bias.
  • Only processes forward visual inputs.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Training paradigm innovation for streaming video understanding
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks and ablation studies, with a novel sports commentary evaluation
  • Writing Quality: ⭐⭐⭐⭐ Clear and complete
  • Value: ⭐⭐⭐⭐⭐ Opens up a large-scale training pathway for real-time video understanding