Skip to content

Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Conference: CVPR2026
arXiv: 2512.21778
Code: None
Area: Multimodal VLM
Keywords: Video scene segmentation, Vision-Language Models, Multimodal reasoning, Sequence prediction, Confidence estimation

TL;DR

Proposes Scene-VLM—the first video scene segmentation framework based on fine-tuned VLMs. By utilizing structured multimodal shot representations (visual frames + dialogue + metadata), causal sequence prediction, a context-focus window mechanism, and token logit confidence extraction, it achieves significant gains of +6 AP and +13.7 F1 on MovieNet and demonstrates natural language explanation capabilities.

Background & Motivation

Video scene segmentation (partitioning long videos into semantically coherent scenes) is a foundational task in video understanding, essential for applications such as automated structured summarization and semantic retrieval. Formally, a scene consists of consecutive shots sharing the same location, time, characters, or narrative theme.

Three major limitations of prior encoder-based methods (BaSSL, TranS4mer, MEGA): (1) Visual Over-reliance: Ignores or underutilizes non-visual signals such as dialogue and characters; (2) Pointwise Independent Prediction: Each shot is classified independently without leveraging causal dependencies between consecutive decisions; (3) Lack of Explainability: Outputs only confidence scores, failing to explain why a boundary was predicted.

Core Idea: Leveraging VLM multimodal reasoning and text generation capabilities to redefine scene segmentation as a sequence generation task that sequentially outputs "Shot i: Yes/No," naturally achieving causal dependency, multimodal fusion, and explainability.

Method

Overall Architecture

Scene-VLM transforms video scene segmentation from traditional encoder-based per-shot classification into a VLM sequence generation task. Based on a fine-tuned Qwen2.5-VL-7B, the input comprises multimodal representations of \(N\) consecutive shots (visual frames + dialogue + character IDs). The model sequentially generates "Shot i: Yes/No" judgments for each shot within a focus window and extracts confidence scores from the judgment token logits. This reformulation integrates multimodal fusion, inter-shot causal dependencies, and explainability into a single generative framework.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["N Consecutive Shots"] --> REP["Structured Multimodal Shot Representation<br/>K=3 frames + dialogue + characters / Shot-ID markers"]
    REP --> SEQ["Causal Sequence Prediction<br/>VLM outputs Shot i: Yes/No sequentially, later decisions rely on earlier ones"]
    SEQ --> WIN["Context-Focus Window Mechanism<br/>20-shot context, predicting only the middle 10 shots"]
    WIN --> CONF["VLM Confidence Extraction<br/>conf = P(Yes) / (P(Yes) + P(No))"]
    CONF --> OUT["Scene Boundary + Confidence"]
    CONF -->|Optional Targeted Fine-tuning| EXP["Explainability Alignment<br/>Generating natural language boundary reasoning"]

Key Designs

1. Structured Multimodal Shot Representation: Incorporating non-visual signals like dialogue and characters

Prior encoder methods overemphasize visual features and underestimate or ignore narrative signals like dialogue and characters. Scene-VLM equips each shot \(s_i\) with \(K=3\) sampled frames, synchronized subtitles, and character information. It overlays a visual identifier (shot-ID marker) on each frame to help the model associate visual content with the shot numbers mentioned in the text, providing narrative context invisible to visual-only methods.

2. Causal Sequence Prediction: Making preceding decisions visible to each boundary judgment

A drawback of pointwise independent classification is that shots are judged in isolation, failing to use "causal dependencies between decisions." By reformulating this as sequence generation, the model outputs Yes/No for multiple shots sequentially. Each boundary judgment causally influences subsequent ones by using previous predictions as context. Attention analysis confirms the model "trusts" previous decisions, allocating less attention to processed shots and more to those pending judgment.

3. Context-Focus Window Mechanism: Providing sufficient evidence for every judged shot

Shots at the ends of a sequence naturally lack context on one side, leading to performance drops at the edges. Scene-VLM uses a 20-shot context window but only performs predictions for the middle 10 shots (focus window), ensuring each evaluated shot has ample evidence from both sides. Ablations show that removing this mechanism causes a sharp drop in F1 at edge positions, while its presence ensures consistency.

4. VLM Confidence Extraction: Reading scores from Yes/No logits

Unlike encoders with classification heads, VLMs do not directly provide scores. Scene-VLM calculates normalized confidence from the softmax logits of the judgment tokens: \(\text{conf}_i = P(\text{Yes}) / (P(\text{Yes}) + P(\text{No}))\). This allows for precision-recall trade-offs similar to traditional methods. This simple technique enables any binary classification-style VLM output to yield adjustable confidence scores.

5. Explainability Alignment: Enabling the model to state "why this is a boundary"

While encoders only output a confidence score, Scene-VLM can generate coherent natural language explanations (e.g., "The scene transitions from indoors to outdoors, and both the characters and narrative topic have changed") through targeted fine-tuning on a small set of samples with annotated explanations.

Loss & Training

  • Standard next-token prediction loss
  • Base Model: Qwen2.5-VL-7B
  • Training Data: MovieNet-318 (190 movies for training)

Key Experimental Results

Main Results (MovieNet-318)

Method F1 ↑ AP ↑
BaSSL 47.0 57.4
TranS4mer 48.4 60.8
MEGA 55.3 58.6
Chapter-LLaMA 38.6 41.5
Scene-VLM (Ours) 62.1 66.8

Zero-shot Cross-domain (BBC Planet Earth)

Method AP ↑
TranS4mer 43.6
Scene-VLM (Ours) 45.8

Ablation Study

Configuration F1 AP Description
Ours 62.1 66.8 -
No vision 32.0 34.7 Vision is the core signal
No Shot-ID 60.8 64.1 Temporal anchoring is valuable
No subtitles 61.1 62.2 Subtitles provide complementary signals
Vision only 58.6 61.4 Multimodal fusion provides +3.5 F1
Context 20 + Focus 10 62.1 - Optimal configuration
Context 20 + Focus 1 (Pointwise) 60.1 - Sequence prediction outperforms pointwise
Context 5 + Focus 5 55.8 - Larger context is better

Model Scale Impact

Parameters F1 AP
1.5B 55.9 58.7
3B 59.6 62.8
7B 62.1 66.8

Key Findings

  • Vision is the most important signal source (F1 drops 30 points without it), but subtitles and character IDs provide irreplaceable supplementary information.
  • Attention analysis shows that after length normalization, the attention on subtitle and character tokens is comparable to visual tokens.
  • The model shows higher attention toward subsequent shots than preceding ones, as preceding info is already encoded in the output tokens.
  • The focus mechanism is critical for edge positions: F1 drops sharply at edges without it but remains consistent across positions with it.
  • Performance improves monotonically from 1.5B to 7B parameters, with significant gains at 7B suggesting potential benefits from even larger models.

Highlights & Insights

  • Paradigm Shift: Transitioning from an encoder classification framework to a VLM sequence generation framework addresses multimodal fusion, sequence dependency, and explainability simultaneously.
  • Confidence Extraction Technique: The method of calculating normalized confidence from Yes/No logits is simple and effective, providing a general solution for applying VLMs to binary classification tasks.
  • Deep Attention Analysis: Reveals the information flow pattern in VLMs during scene boundary prediction—trusting historical predictions while focusing heavily on future context.
  • Zero-shot generalization on BBC indicates the framework is not restricted to the movie domain.

Limitations & Future Work

  • Sampling 3 frames per shot might be insufficient to capture scenes with intense intra-shot motion.
  • A 20-shot context window may be inadequate for extremely long movies; hierarchical or memory-augmented expansions are needed.
  • Inference speed may be slower than lightweight encoder methods (latency not reported).
  • Explainability alignment requires manual annotation of explanation samples, involving non-negligible costs.
  • vs MEGA: MEGA also fuses subtitles and scripts but uses a fixed fusion strategy and pointwise prediction; Scene-VLM is more flexible with end-to-end VLM reasoning.
  • vs Chapter-LLaMA: An LLM-based chaptering method that only uses text descriptions without direct visual processing; its F1 on MovieNet is only 38.6.
  • vs TranS4mer: Uses self-attention and SSM for long-range dependency but remains an encoder-based approach without explainability.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First application of VLM to video scene segmentation; paradigm innovation solves multiple long-standing pain points.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely detailed ablations (modalities, windows, frames, scale) and in-depth attention analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure and intuitive diagrams; logic from methodology to analysis is complete.
  • Value: ⭐⭐⭐⭐ Provides a new paradigm for video structural understanding; confidence extraction and explainability designs have broad transfer value.