Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment¶

Conference: CVPR 2025
arXiv: 2503.12847
Code: None
Area: Image Segmentation
Keywords: Audio-Visual Segmentation, Modality Alignment, Uncertainty Estimation, Contrastive Learning, Semantic Grouping

TL;DR¶

This work proposes Audio-guided Modality Alignment (AMA) and Uncertainty Estimation (UE) modules to resolve incorrect association of visually similar objects and over/under-segmentation caused by frequent vocal state changes in audio-visual segmentation, achieving a 4.2% boost on AVS-Semantic.

Background & Motivation¶

Audio-Visual Segmentation (AVS) aims to localize and segment sounding objects in videos based on audio cues. Existing approaches primarily focus on spatio-temporal multimodal modeling but neglect two key challenges:

Spatial Ambiguity: When visually similar but acoustically different objects in a scene are close to each other (e.g., two identical-looking dogs with only one barking), global attention mechanisms struggle to distinguish the sounding object from the silent one, leading to over-segmentation.
Temporal Ambiguity: The sounding state of an object changes frequently (e.g., a dog intermittently barking). Existing temporal modeling tends to generate overly smooth predictions, ignoring sudden transition points.

Statistical analysis of the AVSS-V2 dataset conducted by the authors reveals that 33.3% of the sampled subsets contain a significant number of these challenging frames, indicating this is a widespread and critical problem.

Method¶

Overall Architecture¶

The framework consists of a visual encoder (\(L\) blocks), an audio encoder, and a mask decoder. An AMA module is inserted after each visual encoder block to perform audio-guided modality alignment. The fused features of all frames pass through a temporal attention layer and are then simultaneously fed into a mask prediction head and an uncertainty estimation head. The final prediction confidence is adjusted using the generated uncertainty map.

Key Designs¶

1. Audio-guided Modality Alignment (AMA): Semantic Grouping + Audio-Guided Merging + Contrastive Learning

Function: Focuses model attention on audio-related visual regions to distinguish sounding objects from silent ones.
Mechanism: First, visual features are grouped into \(P\) semantic clusters using DPC-KNN density peak clustering. Then, intra-group audio-visual cross-attention is conducted to merge intra-group features into a compact representation based on their audio response weights (amplifying high-response features and suppressing low-response ones). Through multi-layer iterations, sounding regions are progressively enhanced. Finally, an InfoNCE contrastive loss is applied to maximize the feature distance between positive samples (high audio response) and negative samples (low audio response).
Design Motivation: Global attention disperses model focus across all visually similar regions. Restricting the interaction range via grouping and introducing audio-guided feature competition forces the model to distinguish between sounding and silent objects, without relying on ground-truth masks (as in stepstone-style methods).

2. Uncertainty Estimation (UE) Module

Function: Identifies high-uncertainty regions caused by frequent changes in sounding state, reducing the prediction confidence in these regions.
Mechanism: After passing through temporal attention, the fused features are branched into two outputs: mask prediction logits \(m\) and uncertainty logits \(\alpha\) (modeled via a Dirichlet distribution). The final prediction is generated by weight-integrating the uncertainty map and the mask probability map.
Design Motivation: Temporal modeling naturally tends to generate smooth predictions, resulting in high uncertainty at state transition frames. By explicitly estimating uncertainty and lowering the confidence of transition frames, incorrect over-segmentation at mutation points can be avoided.

3. Multi-level Compact Representation Update

Function: Iteratively updates the compact representation via multi-layer Transformer decoders to progressively converge high-level semantics.
Mechanism: \(G_l \leftarrow G_l + \text{softmax}(G_l f_{v_l}^T / \sqrt{D_l} + S) f_{v_l}\), incorporating the correlation score \(S\) into the attention weights to ensure that tokens with higher audio responses contribute more. The updated compact representation is mapped back to the visual feature map for the next layer.
Design Motivation: Single-pass alignment is insufficient; multi-layer iterations allow the sounding regions to be progressively enhanced while silent regions are gradually attenuated.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{\text{seg}} + \lambda \mathcal{L}_{\text{cst}} + \mu \mathcal{L}_{\text{unc}}\]

The segmentation loss contains cross-entropy, Dice, and IoU losses; the contrastive loss \(\mathcal{L}_{\text{cst}}\) is formulated as InfoNCE; the uncertainty loss \(\mathcal{L}_{\text{unc}}\) is based on the KL divergence of the Dirichlet distribution.

Key Experimental Results¶

Main Results: AVS Benchmark Comparison¶

Method	AVS-Object \(\mathcal{J\&F}\)↑	AVS-Semantic \(\mathcal{J\&F}_\beta\)↑	VPO-MSMI \(\mathcal{J\&F}_\beta\)↑
TPAVI	78.7	29.7	-
CATR	81.4	37.6	-
Prev. SOTA	-	~42.0	~35.0
Ours	84.1	+4.2 vs SOTA	+11.5 vs SOTA

Ablation Study: Contribution of Each Module¶

Component	AVS-Semantic \(\mathcal{J\&F}_\beta\)
Baseline	37.6
+ Semantic Grouping	39.8
+ Audio-Guided Merging	41.2
+ Contrastive Learning	42.8
+ Uncertainty Estimation	44.2

Key Findings¶

The AMA module yields the largest improvement (+11.5%) on VPO-MSMI (a highly challenging scenario with multiple concurrent sound sources), demonstrating its capability to handle complex scenes.
Uncertainty estimation displays prominent effectiveness on frames with frequent state transitions, effectively reducing over-segmentation.
Automatically constructing positive and negative samples based on audio responsiveness outperforms construction using ground-truth masks, as the latter prevents the model from learning the distinct acoustic response of each individual object.
The number of groups \(P\) affects performance; too few groups fail to distinguish similar objects, while too many groups introduce additional noise.

Highlights & Insights¶

Paradigm Shift from "Global Alignment" to "Grouped Competition": Replacing global attention with semantic grouping and audio-guided feature competition allows for more precise localization of sound sources.
Uncertainty Estimation for State Transitions is an elegant solution—instead of attempting to precisely predict transition frames, it acknowledges the uncertainty and lowers confidence.
Automatic Contrastive Sample Construction (based on audio responsiveness) avoids reliance on ground-truth masks.

Limitations & Future Work¶

The grouping strategy relies on DPC-KNN clustering, where computational overhead and the selection of the number of groups require tuning.
The capability to handle extreme noise or silent videos remains to be verified.
The assumption of a Dirichlet distribution for uncertainty estimation may not hold under certain distributions.
Adaptive grouping strategies that do not require a predefined number of groups can be explored in the future.

Relationship with TPAVI/CATR: These approaches employ global attention for audio-visual alignment, whereas this work replaces it with grouped competition and contrastive learning.
Relationship with BAVS: BAVS utilizes frame-level inputs to avoid over-segmentation, whereas this work resolves the same issue under clip-level inputs via uncertainty estimation.
Insights: In multi-modal alignment tasks, "discriminative alignment" (grouping + competition) is more effective than "indiscriminate alignment" (global attention).

Rating¶

⭐⭐⭐⭐

Critically and systematically analyzes and resolves two core challenges in AVS. The AMA and UE modules are well-designed and mutually complementary. Full SOTA performance is achieved across multiple benchmarks. Technical highlights lie in the audio-driven grouped competition mechanism and uncertainty-aware prediction.