Skip to content

UWAV: Uncertainty-Weighted Weakly-Supervised Audio-Visual Video Parsing

Conference: CVPR 2025
arXiv: 2505.09615
Code: Yes (to be released)
Area: Audio & Speech / Weakly-Supervised Learning
Keywords: Audio-Visual Video Parsing, Weakly-Supervised, Uncertainty Weighting, Pseudo-Labels, Feature Mixup

TL;DR

The authors propose UWAV, a weakly-supervised audio-visual video parsing framework. By pre-training a temporal-aware module on large-scale annotated data to generate high-quality pseudo-labels, and employing three techniques—uncertainty-weighted soft labels, class-balanced reweighting, and feature mixup—UWAV improves weakly-supervised training performance and achieves state-of-the-art (SOTA) results on the LLP dataset.

Background & Motivation

Background: Audio-visual video parsing (AVVP) aims to classify each temporal segment in a video into audio events, visual events, or audio-visual synchronized events. Since frame-level annotations are extremely expensive, mainstream methods adopt weak supervision—utilizing only video-level labels without frame-level annotations.

Limitations of Prior Work: The core difficulty of weakly-supervised AVVP lies in the poor quality of pseudo-labels. The frame-level pseudo-labels generated by existing methods (e.g., VALOR, PPL) have limited accuracy because they (1) learn temporal modeling from scratch on the target dataset with insufficient data; (2) utilize hard binary pseudo-labels, which discards the confidence information of the model's predictions; and (3) suffer from extreme imbalance between positive and negative classes (most frames are "no event").

Key Challenge: Weakly-supervised learning requires high-quality pseudo-labels to compensate for missing frame-level annotations, but generating good pseudo-labels itself requires strong temporal modeling capabilities—a chicken-and-egg problem.

Goal: This work aims to break this loop by acquiring temporal modeling capabilities through external pre-training, and further improve training quality by softening pseudo-labels with uncertainty information.

Key Insight: A temporal Transformer is pre-trained on a large-scale annotated dataset (UnAV) and transferred to the target dataset to generate pseudo-labels. Converting pseudo-labels from hard 0/1 values to continuous confidence values (the sigmoid distance to the threshold) preserves uncertainty information.

Core Idea: External pre-training \(\rightarrow\) high-quality pseudo-labels \(\rightarrow\) uncertainty-weighted soft labels + class balancing + feature mixup = enhanced weakly-supervised training.

Method

Overall Architecture

A five-step pipeline is proposed: (1) pre-train a temporal-aware Transformer on UnAV (using CLIP/CLAP features + 5 self-attention layers); (2) generate frame-level pseudo-labels on the target dataset using the pre-trained model; (3) convert hard pseudo-labels into uncertainty-weighted soft labels; (4) perform feature mixup for data augmentation; (5) apply class-balanced loss reweighting. The final model is jointly trained with soft labels, mixup regularization, and class balancing.

Key Designs

  1. External Pre-trained Pseudo-Label Generator:

    • Function: Learn temporal dependencies from large-scale annotated data to generate high-quality frame-level pseudo-labels.
    • Mechanism: The visual branch extracts features using a CLIP image encoder, while the audio branch extracts features using a CLAP encoder. Both are fed into a 5-layer Transformer to capture temporal dependencies. Event embeddings are obtained via prompt templates ("A photo of <<>>" / "This is the sound of <<>>") to calculate similarity for predictions. The generator is pre-trained on the UnAV dataset with frame-level annotations, and then generates pseudo-labels on the target dataset as \(\hat{y}_t^v = \mathbb{1}_{\{\hat{z}_t^v > \theta^v\}} \odot y\) (intersecting with video-level labels).
    • Design Motivation: The quality of pseudo-labels exceeds existing methods by 12.7% (visual F1) because the pre-trained temporal model is better at determining the frame-level existence of events.
  2. Uncertainty-Weighted Soft Labels:

    • Function: Preserve the confidence information of the model's predictions for each frame.
    • Mechanism: Instead of using hard 0/1 pseudo-labels, \(\hat{p}_t^v = \text{Sigmoid}(\hat{z}_t^v - \theta^v) \odot y\) is used as soft labels. The farther the logit is from the threshold \(\theta^v\), the closer the Sigmoid is to 0 or 1, indicating higher confidence. The closer the logit is to the threshold, the closer the Sigmoid is to 0.5, indicating lower confidence (uncertainty).
    • Design Motivation: Hard pseudo-labels treat "almost certain events" and "marginal samples that barely pass the threshold" equally, but the latter are more likely to be errors. Soft labels apply smaller gradients to these uncertain samples during training.
  3. Uncertainty-Weighted Feature Mixup:

    • Function: Provide data augmentation and regularization to prevent overfitting to pseudo-labels.
    • Mechanism: Two temporal segments \((t_i, t_j)\) are randomly selected to mix features: \(\bar{f}_{t_i,t_j} = \lambda \tilde{f}_{t_i} + (1-\lambda)\tilde{f}_{t_j}\), where the mixup coefficient \(\lambda \sim \text{Beta}(1.7, 1.7)\). The corresponding soft labels are mixed with the same ratio.
    • Design Motivation: Traditional Mixup is effective for classification tasks. However, in weakly-supervised scenarios, the labels themselves contain noise. Therefore, the mixup is applied to uncertainty-weighted soft labels instead of hard labels.

Loss & Training

The total loss is defined as \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{w\text{-soft}}} + \mathcal{L}_{\text{mix}} + \mathcal{L}_{\text{video}}\). Here, \(\mathcal{L}_{\text{w\text{-soft}}}\) represents the class-balanced reweighted binary cross-entropy (BCE) loss on soft labels, where the positive sample weight is \(w_{\text{pos}} = \frac{\text{负样本数}}{NTC} \times W\) (\(W=0.5\)). \(\mathcal{L}_{\text{mix}}\) denotes the BCE loss on mixed features, and \(\mathcal{L}_{\text{video}}\) is the standard video-level BCE loss.

Key Experimental Results

Main Results

Frame-level F1 (%) on the LLP dataset:

Metric VALOR PPL UWAV Gain
Visual (V) 65.1 66.7 70.0 +3.3
Audio-Visual (AV) 61.2 61.9 63.4 +1.5
Type@AV 64.3 64.8 65.9 +1.1

Comparison of pseudo-label quality (F1):

Metric VALOR PPL UWAV
Visual 61.7 61.8 74.5
Type@AV 66.0 60.6 72.8

Ablation Study

Configuration Type@AV
Hard pseudo-labels (Baseline) 64.2
+ Soft labels 64.4 (+0.2)
+ Class balancing 65.4 (+1.2)
+ Feature mixup 65.2 (+1.0)
All combinations 65.9 (+1.7)

Key Findings

  • Pseudo-label quality is the primary contributor: The pre-trained generator improves the visual pseudo-label F1 from 61.8% to 74.5% (+12.7 percentage points), yielding the largest gain among all techniques.
  • Class balancing and feature mixup exhibit comparable contributions: Contributing 1.2% and 1.0% respectively, with a partially complementary effect.
  • Soft labels yield minor standalone gains: Moving from hard to soft labels alone only improves the performance by 0.2%, but shows a cumulative effect when combined with class balancing and mixup.
  • Marginal improvement on the AVE dataset: Achieved 80.6% vs. VALOR's 80.4% (+0.2% improvement), indicating limited benefits of the proposed method on smaller datasets.

Highlights & Insights

  • External pre-training breaks the weakly-supervised loop: Pre-training a temporal model on large-scale annotated data is a straightforward and highly effective solution to the low-quality pseudo-label issue in weak supervision (+12.7% pseudo-label F1).
  • Simple modeling of uncertainty: Utilizing \(\text{Sigmoid}(\text{logit} - \text{threshold})\) as confidence avoids the need for auxiliary uncertainty estimation networks or Bayesian methods. The simple intuition that distance from the threshold represents certainty is natural and carries zero overhead.
  • Orthogonal combination of three techniques: Soft labels handle label noise, class balancing handles distribution skewness, and Mixup handles overfitting. These components address different issues, resulting in a joint effect that is greater than the sum of its parts.

Limitations & Future Work

  • Reliance on external annotated data: Requires pre-training on the UnAV dataset, which increases data requirements and computational overhead (80 epochs of pre-training).
  • Underperformance on certain metrics: Audio F1 (64.2 vs. PPL's 65.9) and event-level Event@AV (57.4 vs. 57.9) lag behind some existing methods.
  • Fixed 1-second segments: Unable to capture events that cross boundaries or sub-second temporal limits.
  • Dataset-specific hyperparameter tuning: Hyperparameters like \(\alpha\), \(W\), and threshold \(\theta\) require fine-tuning for different datasets.
  • Limited efficacy on small datasets: The improvement is only +0.2% on AVE, indicating that the pre-training advantage is more pronounced when data is abundant.
  • vs. VALOR: VALOR learns temporal modeling internally on the target dataset, which limits pseudo-label quality. UWAV leverages external pre-training to gain stronger temporal understanding.
  • vs. PPL: PPL employs a similar pseudo-label generation strategy but does not utilize uncertainty information. UWAV's soft labels make training more robust to noise.
  • vs. HAN: HAN is an early weakly-supervised method that lacks strong pre-trained features and pseudo-label refinement strategies.

Rating

  • Novelty: ⭐⭐⭐ Each component (pseudo-labels, soft labels, Mixup, class balancing) is not individually novel, but their combination is rational and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation on two datasets with detailed ablation studies and pseudo-label quality analysis.
  • Writing Quality: ⭐⭐⭐⭐ Step-by-step methodology presentation with clear logic.
  • Value: ⭐⭐⭐ Achieves state-of-the-art (SOTA) results on the LLP dataset, though the novelty of individual components is limited.