SummDiff: Generative Modeling of Video Summarization with Diffusion¶
Conference: ICCV 2025 arXiv: 2510.08458 Area: Image Generation Keywords: Video Summarization, Diffusion Models, Conditional Generation, Subjectivity Modeling, Knapsack Problem
TL;DR¶
SummDiff is the first work to introduce diffusion models into video summarization, formulating the task as a conditional generation problem. By learning the distribution of "good summaries," the model generates diverse plausible summaries that better reflect the inherent subjectivity of the video summarization task.
Background & Motivation¶
Video summarization aims to select keyframes from long videos to preserve core content. However, the task is inherently highly subjective: different annotators often disagree on what constitutes a "good summary."
Key limitations of existing methods include:
Ignoring annotation diversity: Most methods average frame-level importance scores from multiple annotators as the training target. This regression-based approach discards the individual perspectives of different annotators. For example, if half the annotators select clips from the first quarter of a video and the other half select clips from the last quarter, simple averaging assigns similar intermediate scores to both segments, failing to distinguish between two equally valid summarization strategies.
Deterministic output: Given a video, only a single summary can be generated, failing to reflect the plurality of reasonable summarization possibilities.
Imperfect evaluation metrics: Insufficient analysis of the knapsack step, and F1 scores that are overly sensitive to video segmentation.
The authors propose a generative perspective: treating video summarization as a conditional generation task, where the model learns a probability distribution over good summaries and generates diverse plausible summaries through sampling.
Method¶
Overall Architecture¶
SummDiff consists of three stages: (1) video encoding; (2) diffusion-based importance score denoising; and (3) knapsack optimization to produce the final summary.
1. Video Encoding¶
A pretrained image encoder extracts per-frame features \(\mathbf{z}_i \in \mathbb{R}^D\), which are then contextualized via self-attention to yield the global visual feature matrix \(\mathbf{Z} \in \mathbb{R}^{N \times D}\).
2. Video Importance Score Denoiser (Core)¶
Forward process: For a single annotator's importance scores \(\mathbf{s}_0 \in [0,1]^N\), a logit transformation is first applied: \(\mathbf{u}_0 = \log \frac{\mathbf{s}_0}{1-\mathbf{s}_0}\). Gaussian noise is then added in logit space:
Codebook quantization: The noised logits \(\mathbf{u}_t\) are mapped back to \([0,1]\) via sigmoid, then uniformly divided into \(K\) bins, each corresponding to a learnable \(D\)-dimensional embedding. This maps the scalar scores \(\mathbf{u}_t \in \mathbb{R}^N\) to \(\mathcal{C}(\mathbf{u}_t) \in \mathbb{R}^{N \times D}\), enabling compatibility with cross-attention.
Transformer cross-attention denoising: The quantized embeddings \(\mathcal{C}(\mathbf{u}_t)\) serve as queries, and the visual features \(\mathbf{Z}\) serve as keys/values. AdaLN-Zero is employed to inject timestep and positional conditioning, avoiding information aliasing:
3. Training and Inference¶
Training loss: MSE is computed against each individual annotator's scores:
Inference: Starting from random noise \(\mathbf{u}_T \sim \mathcal{N}(0, \mathbf{I})\), the model performs iterative denoising via DDIM reverse process, followed by a sigmoid to obtain \(\hat{\mathbf{s}}_0 \in (0,1)^N\).
4. Summary Generation (KTS + Knapsack)¶
KTS is used to segment the video into semantic shots. Shot-level importance is computed as the mean frame score, and dynamic programming solves the knapsack problem under a budget constraint (e.g., \(\rho=0.15\)) to select the optimal subset of shots.
Key Experimental Results¶
Main Results¶
| Method | SumMe τ | SumMe ρ | TVSum τ | TVSum ρ |
|---|---|---|---|---|
| Random | 0.000 | 0.000 | 0.000 | 0.000 |
| Human | 0.205 | 0.213 | 0.177 | 0.204 |
| CSTA | 0.108 | 0.120 | 0.168 | 0.221 |
| SummDiff | 0.133 | 0.148 | 0.173 | 0.226 |
Under the TVT setting, SummDiff achieves a 23% improvement in τ on SumMe and outperforms all baselines on TVSum.
Large-Scale Evaluation on Mr. HiSum¶
| Method | τ | ρ | MAP@50% | MAP@15% |
|---|---|---|---|---|
| CSTA | 0.128 | 0.185 | 63.38 | 30.42 |
| SummDiff | 0.175 | 0.238 | 65.44 | 33.83 |
On the large-scale Mr. HiSum dataset (31,892 videos), SummDiff substantially outperforms the strongest baseline CSTA across all metrics, demonstrating strong scalability.
Ablation Study¶
| Configuration | SumMe τ | TVSum τ |
|---|---|---|
| w/o codebook (scalar input) | 0.105 | 0.152 |
| Simple additive conditioning | 0.118 | 0.161 |
| AdaLN-Zero + codebook | 0.133 | 0.173 |
Both codebook quantization and AdaLN-Zero conditioning injection are critical to performance gains.
Highlights & Insights¶
- First application of diffusion models to video summarization, transforming a deterministic regression problem into a conditional generation problem that naturally accommodates task subjectivity.
- Training on individual annotator scores rather than averaged scores effectively preserves diverse summarization perspectives.
- Codebook quantization elegantly resolves the dimensionality mismatch between scalar importance scores and high-dimensional cross-attention.
- New evaluation metrics are proposed, offering deeper assessment through knapsack-level analysis.
Limitations & Future Work¶
- Evaluation reliability on conventional small-scale datasets (SumMe/TVSum) is limited, with only 25 and 50 videos respectively.
- Training requires per-annotator individual scores; on large-scale datasets such as Mr. HiSum, only aggregated annotations are available, limiting the full advantage of the generative formulation.
- The logit-space transformation requires clipping, which may introduce numerical bias.
Related Work & Insights¶
- Traditional video summarization: VASNet, PGL-SUM, CSTA, and others predict averaged scores via regression.
- Generative approaches: GAN-based summarization (SUM-GAN) uses adversarial loss but targets a different objective.
- Diffusion models: This work represents the first application to video summarization, drawing on the conditional generation mechanism of DiT.
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Technical Depth | 4 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Overall | 4.0 |