TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization¶
Conference: ICLR 2026 arXiv: 2603.01169 Code: https://github.com/smkim37/TripleSumm Area: Audio & Speech Keywords: video summarization, triple-modality fusion, adaptive weighting, multi-scale temporal, large-scale dataset
TL;DR¶
This paper proposes TripleSumm, which achieves dynamic frame-level modality importance adjustment via a Multi-scale Temporal block (hierarchical sliding-window attention) and a Cross-modal Fusion block (fusion token adaptively weighting visual/text/audio). The authors also release MoSu, the first large-scale triple-modality video summarization dataset (52,678 videos), achieving SOTA on 4 benchmarks.
Background & Motivation¶
Background: Video summarization extracts key segments to represent the original video content. Existing methods primarily rely on visual features combined with attention mechanisms.
Limitations of Prior Work: Modality importance varies dynamically from frame to frame (e.g., text is more informative when a judge is speaking; visual and audio are more informative during a robot performance), yet existing methods employ static or modality-agnostic fusion strategies. Furthermore, no large-scale triple-modality dataset exists.
Core Idea: Adaptive frame-level modality fusion combined with a large-scale triple-modality benchmark.
Method¶
Overall Architecture¶
Visual/text/audio streams of the raw video → modality-specific pretrained encoders (GoogLeNet/CLIP + RoBERTa + AST) → linear projection + LayerNorm to a shared dimension \(D\) → per-frame aggregation into fusion tokens \(\mathbf{E}^f\) → alternating \(L\)-layer stacks of Multi-scale Temporal blocks (MST) + Cross-modal Fusion blocks (CMF) → prediction head outputting frame-level importance scores \(\hat{S} \in [0,1]\) → high-scoring frames selected to compose the summary.
Key Designs¶
-
Multi-scale Temporal Block (MST):
- Employs Window Self-Attention (WSA) with window size \(w\) increasing layer by layer (small windows in early layers capture local dependencies; large windows in later layers capture long-range dependencies).
- Reduces complexity from \(O(N^2)\) for full attention to \(O(w \cdot N)\).
- Cross-modal parameter sharing — the same WSA parameters are applied to all four token types (fusion/visual/text/audio).
- Design Motivation: Temporal dynamics in video frames (e.g., scene transitions, rhythm changes) are modality-agnostic; shared parameters efficiently capture universal temporal patterns.
-
Cross-modal Fusion Block (CMF):
- The fusion token \(\mathbf{h}^f_i\) serves as the query, while the three modality tokens \(\mathbf{h}^{\{v,t,a\}}_i\) serve as keys/values.
- Cross-attention achieves frame-level adaptive weighting — at each timestep the model independently determines which modality to attend to.
- Design Motivation: Avoids the modality bias present in conventional methods (e.g., always using visual as the query); the fusion token acts as a neutral anchor that treats all three modalities equally.
-
Fusion Token Design:
- \(\mathbf{e}^f_i = \text{Agg}(\mathbf{e}^v_i, \mathbf{e}^t_i, \mathbf{e}^a_i)\), where the aggregation function can be average pooling or an MLP.
- Temporal Positional Encoding (TPE) and Learnable Modality Embeddings (LME) are added to distinguish timesteps and modality sources.
- Key property: after being updated in CMF, fusion tokens carry information from the most relevant modality without directly modifying the individual modality tokens.
-
MoSu Dataset Construction:
- Filtered from YouTube-8M by: (1) availability of English subtitles and audio tracks; (2) >50,000 views to obtain "Most Replayed" statistics; (3) duration ≥120 s to ensure sufficient length.
- Final dataset: 52,678 videos covering visual, text, and audio modality features.
- Annotations derived from YouTube "Most Replayed" heatmaps — collective behavioral feedback from at least 50,000 viewers per video.
Loss & Training¶
L2 regression loss: \(\mathcal{L} = \|S - \hat{S}\|_2^2\), predicting frame-level importance scores. The final summary is generated by selecting temporally coherent segments that maximize predicted scores.
Key Experimental Results¶
Main Results¶
| Benchmark | Metric | TripleSumm | Prev. SOTA | Gain |
|---|---|---|---|---|
| MoSu | Kendall τ | 0.145 | 0.107 (CFSum) | +35.5% |
| Mr.HiSum | Kendall τ | 0.105 | 0.089 | +18.0% |
| SumMe | F1 | 52.3 | 50.1 | +2.2 |
| TVSum | F1 | 63.7 | 61.5 | +2.2 |
Ablation Study¶
| Configuration | MoSu τ | Note |
|---|---|---|
| Full TripleSumm | 0.145 | Complete model |
| Visual only | 0.091 | Significant degradation |
| Visual + Text | 0.128 | Audio contributes clearly |
| w/o MST | 0.121 | Multi-scale is important |
| w/o CMF | 0.118 | Adaptive fusion is critical |
Key Findings¶
- TripleSumm degrades gracefully under missing modalities — it dynamically relies on available modalities and does not collapse when a single modality is absent.
- Qualitative analysis shows that fusion tokens adaptively attend to different modalities across frames (e.g., judge-speaking frames → high text weight; musical performance frames → high audio weight).
- Visual-only models achieve τ=0.091 on MoSu; adding text raises this to 0.128, and further adding audio yields 0.145 — each modality contributes independently.
- The multi-scale design of MST is particularly important for long videos — single-window-size variants drop τ by approximately 16% on MoSu.
- Parameter efficiency is high: TripleSumm has a parameter count comparable to the visual-only PGL-SUM, yet leverages three times as many information sources.
Highlights & Insights¶
- The fusion token as a "neutral anchor" for cross-modal interaction is the key innovation — it eliminates the modality bias caused by using visual features as the query in conventional methods.
- The hierarchical window design of MST enables the model to build temporal understanding progressively from local to global, which is especially important for long videos (>2 minutes).
- Incorporating three modalities not only improves performance but also enhances robustness — performance degrades gracefully when any single modality is missing.
- The "Most Replayed" annotation scheme for MoSu is a pragmatic choice, leveraging collective viewing behavior as a free proxy for frame importance.
Limitations & Future Work¶
- MoSu is based on YouTube "Most Replayed" and may be biased toward entertainment content, with insufficient coverage of educational or professional videos.
- The fusion token is initialized via simple average aggregation; more sophisticated initialization (e.g., gating) could yield further improvements.
- Only frozen features from pretrained encoders are used; end-to-end fine-tuning of the encoders may unlock additional potential.
- The WSA window size increases according to a fixed schedule; adaptive window size adjustment could be more beneficial.
- Comparisons with LLM-based summarization approaches (e.g., VideoLLMs) remain unexplored.
Related Work & Insights¶
- vs. CFSum: CFSum also uses three modalities but employs static fusion; TripleSumm's frame-level adaptive weighting is the key differentiator.
- vs. A2Summ: A2Summ focuses on audio-visual dual-modality, whereas TripleSumm provides complete coverage of all three modalities.
- vs. PGL-SUM/CSTA: These visual-only Transformer methods lag significantly behind on MoSu, validating the necessity of multimodal input.
- MoSu Dataset: The first large-scale triple-modality video summarization benchmark, derived from "Most Replayed" statistics of 52,678 YouTube videos with at least 50,000 viewers each.
- Inspiration: The design concept of the fusion token as a cross-modal interaction "anchor" is transferable to other multimodal tasks such as multimodal retrieval and video question answering.
Rating¶
- Novelty: ⭐⭐⭐⭐ Adaptive triple-modality fusion + large-scale dataset, contributing both methodology and resources.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 benchmarks + ablation + qualitative analysis + missing-modality robustness tests.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with intuitive figures.
- Value: ⭐⭐⭐⭐⭐ Both the MoSu dataset and the triple-modality fusion method offer lasting value.