Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics¶
Conference: CVPR 2025
arXiv: 2503.12840
Code: Yes (See paper for GitHub link)
Area: Segmentation / Multimodal VLM
Keywords: Audio-Visual Segmentation, Audio Semantic Derivation, Dynamic Elimination, Intra-class Discriminability Enhancement, Multimodal Alignment
TL;DR¶
Starting from the intrinsic characteristics of audio and addressing the dual challenges of feature confusion in mixed audio and intra-class variation of different sounds from the same object, DDESeg proposes a Dynamic Derivation Module to derive independent source representations from mixed signals and enhance discriminability. It then employs a Dynamic Elimination Module to filter out irrelevant audio semantics such as off-screen voices, achieving SOTA performance on all AVS benchmarks.
Background & Motivation¶
Background: Audio-Visual Segmentation (AVS) aims to locate and segment sounding objects in images or videos based on audio signals. Unlike text- or image-guided segmentation, audio possesses unique characteristics: multiple sound sources overlap in the time-frequency domain, and the same object can emit vastly different sounds. Existing methods primarily address this by designing complex audio-visual interaction architectures, paying less attention to the challenges inherent to the audio itself.
Limitations of Prior Work: (1) Feature Confusion: When multiple sound sources coexist, audio signals highly overlap in frequency, timbre, and time (e.g., a large portion of grey spectral overlap between a horse neighing and a dog barking), making it difficult for extracted audio semantics to accurately distinguish among individual sound sources. (2) Difficult Audio-Visual Alignment: Sounds from the same object exhibit substantial intra-class variation (e.g., a cat can make vastly different sounds like howling, growling, hissing, or meowing), leading to a highly scattered feature distribution for the same class of sounds. Consequently, models struggle to consistently associate diverse sounds with the same visual object.
Key Challenge: Existing methods (such as the quantized decomposition in QDFormer and audio separation in CPM) assume independent sound events, attempting to directly decompose or separate audio semantics from mixed signals. However, since sound sources share frequencies, these methods are prone to losing crucial semantic details or generating incomplete representations.
Goal: (1) How to obtain independent semantic representations of each sound source from mixed audio without losing information? (2) How to enhance the discriminability of audio representations to tackle intra-class variation? (3) How to filter out audio semantics that cannot be matched visually, such as off-screen sounds?
Key Insight: Instead of separating audio signals, the approach "derives" them. By utilizing a pre-constructed semantic memory bank, it retrieves the \(K\) nearest class centers of the mixed audio and derives independent semantic representations for each sound source by compensating for differences. Then, it utilizes intra-class discriminative feature scaling to enhance the discriminability of the representations.
Core Idea: Derivation instead of separation—derive sound source semantics from mixed audio and enhance discriminability, then dynamically eliminate visually irrelevant audio.
Method¶
Overall Architecture¶
DDESeg adopts a dual-branch framework for multi-stage audio-visual feature alignment. The core workflow is: (1) An audio encoder extracts audio features \(F_a\); (2) A Dynamic Derivation Module (DDM) derives \(K\) independent source semantic representations \(\hat{A}\) from \(F_a\); (3) In subsequent stages, a Dynamic Elimination Module (DEM) evaluates the relevance score of each audio representation based on visual features to suppress irrelevant audio; (4) A Feature Fusion Module progressively aligns refined audio features with visual features layer by layer; (5) A segmentation head outputs the final prediction.
Key Designs¶
-
Dynamic Derivation Module (DDM):
- Function: Derive \(K\) audio representations with independent source semantics from mixed audio features.
-
Mechanism: Involves three steps—
Step 1 - Semantic Memory Construction: Use a pre-trained audio model to extract features of single-source audio, and apply hierarchical clustering to each class to obtain class-level global centers \(\mu^c\) and sub-cluster representative features \(x_{rep_j}^c\).
Step 2 - Audio Prototype Derivation: Compute the distance between the input \(F_a\) and each class center, finding the \(K\) nearest centers \(\{\mu_i\}_{i=1}^K\). Inspired by the generalized Laplacian operator, compute boundary features \(e_{u_i} = \phi_{GELU}(W_e(\mu_i - F_a) + b_e)\) and weight them to obtain the compensation \(\Delta a_{u_i}\). Finally, \(a_i = F_a + \Delta a_{u_i}\).
Step 3 - Discriminability Enhancement: Utilize the difference between the intra-class representative features \(x_{rep_j}^c\) and the derived representation \(a_i\) to learn a scaling shift \(\Delta a_{c_i,j}\), enhancing discriminability via \(\hat{a}_i = a_i \odot (1 + \Delta a_{c_i,j})\). Note that Step 2 uses addition for inter-class adjustment (semantic shift across classes), while Step 3 uses multiplication for intra-class enhancement (keeping the same-class semantic space undeformed).
-
Design Motivation: Instead of separating signals (which loses information on shared frequencies), the approach starts from the semantic memory of "knowing what sounds exist in the world" to derive representations by calculating difference compensations. Intra-class discriminability enhancement addresses the challenge of massive intra-class variations like meowing vs. howling cats.
-
Dynamic Elimination Module (DEM):
- Function: Filter out audio representations that do not match the current visual frame (e.g., off-screen sounds).
- Mechanism: First, apply Gumbel-Softmax to perform soft clustering on visual features \(V\), yielding \(K\) visual semantic centers \(C_v\). Then, perform multi-head cross-attention between audio representations \(\hat{A}\) and visual centers \(C_v\) (with audio as the query) to obtain fused features \(F_{av}\). Compute a relevance score \(S \in [0,1]^k\) for each audio representation via an MLP + sigmoid, and finally suppress irrelevant audio via \(\hat{A} = S \cdot \hat{A}\).
- Design Motivation: Instead of hard-threshold filtering (which leads to gradient truncation), differentiable score weighting is utilized for soft elimination.
-
Feature Fusion Module:
- Function: Align audio and visual features layer by layer.
- Mechanism: Each fusion block employs cross-attention (audio query + visual key/value), convolutional downsampling, and self-attention + FFN to capture global context.
- Design Motivation: Multi-stage progressive alignment is more precise than one-time fusion.
Loss & Training¶
- A weighted combination of three loss terms: \(\mathcal{L} = \lambda_{dice}\mathcal{L}_{dice} + \lambda_{bce}\mathcal{L}_{bce} + \lambda_{iou}\mathcal{L}_{iou}\)
- The weights are set to 5, 5, and 2, respectively.
- PVT-V2-B5 is used as the visual backbone, and VGGish or HTSAT is used as the audio backbone.
Key Experimental Results¶
Main Results¶
Main results on the AVS datasets (PVT-V2-B5 + VGGish/HTSAT):
| Method | AVS-Objects-S4 (\(\mathcal{J}\&\mathcal{F}\)) | AVS-Objects-MS3 (\(\mathcal{J}\&\mathcal{F}\)) | AVS-Semantic (\(\mathcal{J}\&\mathcal{F}\)) |
|---|---|---|---|
| CAVP [CVPR24] | 90.5 | 72.7 | 55.3 |
| AVSStone [ECCV24] | 87.3 | 72.5 | 61.5 |
| BiasAVS [MM24] | 88.2 | 74.0 | 47.2 |
| DDESeg (VGGish) | 92.0 | 74.7 | 65.7 |
| DDESeg (HTSAT) | 94.2 | 77.9 | 67.9 |
DDESeg (HTSAT) vs. second-best method: S4 +3.7, MS3 +3.9, Semantic +6.4
Ablation Study¶
| Configuration | S4 \(\mathcal{J}\&\mathcal{F}\) | MS3 \(\mathcal{J}\&\mathcal{F}\) | Description |
|---|---|---|---|
| Baseline (w/o DDM/DEM) | 88.1 | 70.2 | Direct audio-visual interaction |
| + DDM Step1+2 (Derivation only) | 90.3 | 72.8 | Derive multi-source representations |
| + DDM Step3 (With discriminative enhancement) | 91.4 | 74.0 | Intra-class discriminative enhancement is effective |
| + DEM (Full model) | 92.0 | 74.7 | Eliminating irrelevant audio brings further improvements |
Key Findings¶
- Step 2 (derivation) of DDM contributes the most (+2.2/+2.6), and Step 3 (discriminative enhancement) further improves performance by +1.1/+1.2.
- DEM brings a +0.6 improvement on S4, and even larger gains on MS3, which requires distinguishing multiple sound sources.
- Replacing VGGish with HTSAT as the audio encoder brings significant performance gains (Semantic +2.2), indicating that stronger audio features are crucial for AVS.
- The performance improvement is most prominent on AVS-Semantic (+6.4), the most challenging semantic segmentation setting, indicating that this method has the greatest advantage when fine-grained semantic differentiation is required.
- SOTA performance is also achieved on the VPO dataset, verifying the generalization of the method.
Highlights & Insights¶
- Derivation instead of Separation: Breaking away from the paradigm of "separating mixed audio," the model instead "derives" representation of each sound source using semantic memory. This avoids the information loss in shared frequencies typical of separation methods, which is conceptually more elegant and robust.
- Inter-class Addition + Intra-class Multiplication: Step 2 uses additive compensation for inter-class semantic shifts (facilitating transitions to different class semantic spaces), while Step 3 uses multiplicative scaling for intra-class enhancement (keeping fine-tuning within the same semantic space). This dual adjustment strategy is delicately designed.
- Soft Elimination Mechanism: Replacing hard-threshold filtering with differentiable, relevance-score soft weighting effectively suppresses interference such as off-screen sounds while maintaining the gradient flow for end-to-end training.
- Hierarchical Construction of Semantic Memory: Performing sub-clustering on each class and selecting representative features of the nearest centroids effectively captures intra-class variation patterns.
Limitations & Future Work¶
- The semantic memory bank needs to be pre-constructed, depending on the availability and coverage of single-source audio data.
- \(K\) (the number of derived sound sources) is a fixed hyperparameter, whereas the number of sound sources in real-world scenarios is dynamically variable.
- The multi-stage DDM \(\to\) DEM \(\to\) Fusion pipeline increases model complexity and inference overhead.
- When there is a sounding source in the frame but the sound is extremely weak or masked, the derivation module might fail to find the correct semantic center for it.
- Discriminative enhancement relies on the quality of sub-clustering within the semantic memory, which may yield limited effectiveness for long-tail classes.
Related Work & Insights¶
- vs QDFormer: QDFormer assumes sound events are independent and uses quantized decomposition to separate audio semantics; DDESeg does not assume independence, handling overlapping signals through derivation rather than decomposition.
- vs CPM: CPM decodes audio-visual fusion features into spectrograms and supervises them with separation losses; DDESeg operates on a semantic level without reverting to the signal level, avoiding reconstruction errors.
- vs CAVP: CAVP scores high on S4 but falls far behind DDESeg on Semantic, indicating that CAVP excels at binary localization but is insufficient for semantic differentiation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The "derivation + elimination" paradigm differs from existing separation/decomposition methods, offering an in-depth analysis of problems from the intrinsic nature of audio.
- Experimental Thoroughness: ⭐⭐⭐⭐ It covers two major benchmarks (AVS and VPO), with ablation studies clearly demonstrating the contributions of each module.
- Writing Quality: ⭐⭐⭐⭐ The problem motivation is illustrated very clearly, and the methodology description is comprehensive.
- Value: ⭐⭐⭐⭐ The 6.4% improvement on AVS-Semantic demonstrates that this method substantively advances the field.