Skip to content

ProAct-VL: A Proactive VideoLLM for Real-Time AI Companions

Conference: ICML 2026
arXiv: 2603.03447
Code: To be confirmed
Area: Video Understanding / Real-time Multimodal Interaction
Keywords: Video Large Language Model, Streaming Inference, Proactive Response, Real-time Interaction, Game Commentary

TL;DR

ProAct-VL enables VideoLLMs to autonomously decide when to respond and generate short segment comments under streaming input through a chunked input-output paradigm, a lightweight FLAG decision head, and a transition-aware loss function. It achieves ~1s low latency and strong proactivity—obtaining a TimeDiff of only 1.20s and a trigger F1 of 63.25% in game commentary tasks, significantly surpassing offline models like GPT-4o.

Background & Motivation

Background: Recent developments in Video Large Language Models (VideoLLMs) support video perception and real-time user interaction. However, most existing works adopt either a passive "chunk-sequential processing" response mode or a low-latency passive streaming mode that lacks response control.

Limitations of Prior Work: - Proactive models decide when to speak but generate complete long answers once triggered, resulting in high latency and coarse temporal granularity. - Real-time models emphasize fast generation but lack explicit control over "speaking behavior," often leading to over-talking. - Existing methods struggle to balance proactive timing with content quality.

Key Challenge: A real AI companion (e.g., a game commentator) requires coordination across three layers: (1) low-latency inference, (2) autonomous decision on when to respond, and (3) control over the quality and quantity of generated content. These three form a triangle that is difficult to optimize simultaneously.

Goal: Construct a unified framework to solve "when to speak," "what to say," and "how fast to speak" simultaneously.

Key Insight: Scenarios such as game commentary and game guidance possess rich, automatically evaluable interaction patterns and are thus selected as specific evaluation scenarios. A large-scale annotated dataset is constructed to drive model training.

Core Idea: A chunk-level I/O paradigm + FLAG token decision head + transition-aware loss function are used to unify streaming video understanding and proactive response modeling.

Method

Overall Architecture

At each time step \(t\) (1-second chunk): 1. Input: Triplet \((V_t, Q_t, B_t)\)—visual content of the current time window, optional user query, and environmental context (including summaries of previous comments). 2. Processing: Persistent KV cache \(\mathcal{K}_{t-1}\) maintains the full context, processed by a causal Transformer. 3. Decision: Speaking probability \(p_t\) is extracted from the hidden state \(h_t\) of a special <|FLAG|> token and compared with a threshold \(\tau\) to obtain a binary decision \(a_t\). 4. Output: If \(a_t = 1\), a short segment comment \(U_t\) (approx. 1s) is generated; otherwise, a silence token is output. The generated \(U_t\) is automatically appended to the context as input for \(t+1\).

Key Designs

  1. Chunk-level I/O Paradigm:

    • Function: Discretizes continuous video streams into fixed-duration chunks (1s in this paper) to support online causal processing.
    • Mechanism: At each step \(t\), the model generates \((U_t, \mathcal{K}_t)\) from \((V_t, Q_t, B_t)\) and the persistent KV cache \(\mathcal{K}_{t-1}\). The generated \(U_t\) is immediately appended as part of the input for \(t+1\), forming a continuous dialogue history.
    • Design Motivation: Chunks combined with persistent caching avoid the computational waste of re-processing the entire text at each step while maintaining full temporal context; long answers spanning multiple chunks can naturally connect across subsequent time steps.
  2. Lightweight Proactive Response Mechanism:

    • Function: Inserts a special FLAG token at the end of each user message. The speaking probability is calculated from its hidden state via a lightweight MLP + sigmoid, and a binary decision is made using threshold \(\tau\).
    • Mechanism: \(p_t = \sigma(\text{MLP}(h_t))\), \(a_t = \mathbb{I}[p_t \geq \tau]\). The decision head is extremely lightweight and does not add an inference bottleneck.
    • Design Motivation: Compared to designs that couple proactivity and generation, the decoupled decision mechanism allows the model to independently learn "when to speak" strategies, improving training and inference efficiency.
  3. Multi-level Stability Loss (Transition-aware + Stability Regularization):

    • Function: Composes \(\mathcal{L}_{\text{resp}}\) through "transition-aware classification loss" and "stability regularization," combined with the main language modeling loss \(\mathcal{L}_{\text{main}}\) via weighting.
    • Mechanism: Transition-aware classification loss \(\mathcal{L}_{\text{cls}}\) uses weights \(w_t = \gamma\) (when \(y_t \neq y_{t-1}\)) and \(w_t = 1\) (when state persists), emphasizing rare but critical "speaking-silence" state transitions. Stability regularization \(\mathcal{L}_{\text{reg}}\) contains two terms: local temporal consistency \(\mathbb{E}[(p_t - p_{t-1})^2 \mid y_t = y_{t-1}]\) (smoothing probability changes during state persistence) and a global speaking rate constraint \((\mathbb{E}[p_t] - \mathbb{E}[y_t])^2\) (aligning the model's average speaking duration with human commentators). Finally, \(\mathcal{L} = \mathcal{L}_{\text{main}} + \alpha \mathcal{L}_{\text{resp}}\).
    • Design Motivation: Treating the response state as a sequence learning problem rather than independent binary classification significantly improves learning of "when to transition states"; the global speaking rate constraint prevents over- or under-talking.

Key Experimental Results

Main Results (Live Gaming Benchmark)

Model Category Model CC ↑ LiveU ↑ FinalQ ↑ TimeDiff ↓ F1 ↑
Offline GPT-4o 39.42 4.62 4.80 3.07 54.88
Offline Gemini 2.5 Pro 4.70 4.82 2.59 49.23
Proactive VideoLLM-online 13.78 3.56 1.74 12.59 6.54
Proactive MMDuet 20.08 2.67 2.68 26.72 0.18
Proactive Livestar 8.59 3.14 2.41 27.33 0.20
Real-time LiveCC-7B-Base 38.88 3.85 3.83 11.35 36.10
Real-time StreamingVLM 14.89 3.49 2.65 2.21 50.67
Ours ProAct-VL 49.23 6.52 5.03 1.20 63.25

CC = Win rate against Gemini 2.5 Pro; LiveU = Quality of streaming segment comments; FinalQ = Overall script quality; TimeDiff = Response time deviation (seconds); F1 = Trigger precision. ProAct-VL is optimal across all indicators, especially in response timing (1.20s) and trigger precision (63.25%), far exceeding baselines.

Ablation Study

Configuration CC TimeDiff P R F1 Description
Only \(\mathcal{L}_{\text{cls}}\) 45.54 18.50 12.13 14.00 11.03 Classification loss alone
Only \(\mathcal{L}_{\text{reg}}\) 47.53 8.28 45.20 67.02 47.39 Stability regularization alone
Full 50.91 3.41 65.72 62.41 60.08 Combination of both loss terms

Key Findings

  • Removing \(\mathcal{L}_{\text{reg}}\) has the greatest impact—F1 drops by 49.05 and TimeDiff increases by 15.09, showing that stability regularization is crucial.
  • Removing \(\mathcal{L}_{\text{cls}}\) also leads to performance degradation, though the impact is smaller than regularization; the two loss terms are complementary.
  • Long-sequence stability: Streaming Commentary increased from 73.75% to 82.03% (10-50 minute videos); although response quality slightly decays, it tends to stabilize (F1 drops from 74.42% to 69.23%); long-term stability is significantly better than StreamingVLM.

Highlights & Insights

  • Unification of Proactivity and Streaming Real-time Performance: Traditional trade-offs were "passive but fast" or "proactive but slow"; this paper achieves strong proactivity under ~1s latency by decoupling decision and generation. This design can be transferred to interactive tasks requiring real-time decision-making (customer service systems, real-time subtitling).
  • Transition-aware Weighting Mechanism: Treating state transitions as rare events and assigning high weights (\(\gamma = 5\)) provides the core insight that "transitions are often more important than persistence in sequential decision-making," which is inspiring for any time-series classification task.
  • High-quality Annotation Pipeline for Live Gaming Dataset: A three-stage process of WhisperX ASR + Qwen3 sentiment annotation + DeepSeek domain error correction ensures high-precision transcription; the pipeline (especially LLM correction + cleaning) can be reused for other multimodal datasets.

Limitations & Future Work

  • The dataset is limited to the gaming domain (though covering 12 popular games, the focus remains entertainment); generalization to fields like sports commentary or news broadcasting is limited.
  • Metrics such as CC / LiveU / FinalQ are calculated by closed-source LLMs (GPT-5.1), limiting reproducibility; manual verification across languages/modalities still needs supplementation.
  • The response decision mechanism is relatively simple—using only FLAG token hidden states + MLP might ignore fine-grained visual signals (motion magnitude, scene changes).
  • Future directions: Expanding to more real-time interaction fields; introducing multimodal features (audio emotion, gestures) to enhance decisions; exploring threshold-free decision strategies (directly regressing latency instead of binary classification).
  • vs Proactive Models (VideoLLM-online / MMDuet): These models generate full answers when "speaking," resulting in high latency (> 10s) and low trigger accuracy (F1 < 10%); this paper enforces short segment generation (1s) + decoupled decisions to ensure proactivity while avoiding long-tail latency.
  • vs Low-latency Models (LiveCC / StreamingVLM): They optimize inference speed but lack control over "when to speak," often leading to over-generation. ProAct-VL adds "silence" capability via an explicit response head, allowing it to interact with restraint like a human.
  • vs Offline Models (GPT-4o / Gemini): They have strong understanding but cannot operate in real-time; ProAct-VL achieves close performance (CC 49.23 vs GPT-4o 39.42) while supporting true real-time deployment.

Rating

  • Novelty: ⭐⭐⭐⭐ Unified framework for proactivity and real-time performance; although individual components like transition-aware loss + FLAG mechanism are not overly complex, the engineering combination is clever.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 3 interaction scenarios + 2 test sets (in-domain + out-of-domain) + long-sequence stability + ablation + inference efficiency + manual verification.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic and intuitive charts; some details (ChatML format, RoPE correction) are relegated to the appendix.
  • Value: ⭐⭐⭐⭐⭐ Addresses real problems for AI companion applications; provides a deployable system + 561-hour annotated dataset; directly drives advancements in live streaming, gaming, and virtual assistants.