Object-aware Sound Source Localization via Audio-Visual Scene Understanding¶

Conference: CVPR 2025
arXiv: 2506.18557
Code: https://github.com/VisualAIKHU/OA-SSL
Area: Audio-Visual Learning / Sound Source Localization / Multimodal Contrastive Learning
Keywords: Sound Source Localization, MLLM Supervision, Object-aware Contrastive, Wasserstein Region Isolation

TL;DR¶

This paper proposes OA-SSL: during the training phase, an MLLM is used to generate fine-grained descriptions of "$K$ sounding objects + 1 silent object" for each image as additional supervision anchors. Then, OCA (object-aware contrastive alignment) and ORI (object region isolation) losses are employed, enabling the model to locate only the truly sounding objects even in complex scenarios with multiple guitars where only one is being played.

Background & Motivation¶

Background: Audio-visual sound source localization (AVSL) aligns audio and visual features through self-supervised contrastive learning to locate "sounding objects" in video frames.
Limitations of Prior Work: Existing methods only calculate "audio ↔ pixel" similarity, failing to distinguish visually similar but acoustically silent objects—for example, if there are 3 guitars in a frame but only 1 is being played, the model will highlight all guitars.
Key Challenge: Self-supervised audio-visual correspondence loss can only learn "which class of objects makes sound" but cannot learn "which specific instance is sounding at this moment."
Goal:
- Introduce additional fine-grained semantic signals to differentiate sound-making vs. silent objects.
- Ensure that localization regions of different instances in multi-source scenes are separated from each other.
- Keep only the audio and visual inputs during inference to avoid MLLM inference overhead.
Key Insight: MLLMs (e.g., GPT-4V) possess the capability of "judging actions/states" in multimodal scene understanding, allowing them to offline generate descriptions like "playing guitar" vs. "non-playing guitars and drum set" during training.
Core Idea: Encode the dual-way descriptions of "sounding/silent" objects generated by the MLLM into reference anchors, guiding audio-visual feature learning with both contrastive and Wasserstein losses.

Method¶

Overall Architecture¶

During Training: Image + audio categories → MLLM → generate $K$ foreground captions (one for each sounding source) + 1 background caption (silent objects) → Text encoder obtains reference features $\mathbf{F}_r^p$ (foreground, $B \times K \times c$) and $l_r^n$ (background, $B \times c$).
Simultaneously: The image goes through the visual encoder to obtain $\mathbf{F}_v$, and the audio goes through the audio encoder to obtain $l_a$. A sound-associated map $\mathbf{S}_a$ is generated according to cosine similarity.
Training Losses: OCA loss + ORI loss + original self-supervised contrastive loss.
During Inference: The MLLM and text branch are completely removed. Only the audio-visual branches and the learned feature space are used, keeping the inference cost unchanged.

Key Designs¶

Audio-Visual Scene Understanding (MLLM Supervision)
- Function: Use the MLLM to generate descriptions of sounding and silent objects as anchors.
- Mechanism: A carefully designed prompt forces the MLLM to output $K$ foreground captions (e.g., "a person is playing the leftmost guitar") and 1 background caption (e.g., "two non-playing guitars and a drum set in the background"). These captions are converted into reference features $\mathbf{F}_r^p, l_r^n$ via a text encoder.
- Design Motivation: The common-sense knowledge of MLLMs allows them to judge "playing" vs. "holding", which compensates for the lack of action semantics in self-supervised methods.
Object-aware Contrastive Alignment (OCA) Loss
- Function: Pull the visual regions of "corresponding sounding objects" closer to foreground anchors while pushing background regions away, with a symmetric constraint applied to silent objects.
- Mechanism: First, the foreground mask $M^p$ and background mask $M^n$ are obtained by thresholding $\mathbf{S}_a$ with a sigmoid function, and foreground visual features $l_v^p$ and background visual features $l_v^n$ are obtained through GAP. Foreground loss: $$\mathcal{L}_{frg} = -\frac{1}{B}\sum \log \frac{p_i}{p_i + n_i^{hard} + n_i^{soft}}$$ where $p_i = \exp(\text{Sim}(l_v^p, l_r^p))$ is the positive pair, $n^{hard}$ is the hard negative of background-foreground within the same batch, and $n^{soft}$ represents the samples in other batches that are dissimilar ($\text{Sim}(l_{r_j}^p, l_{r_i}^p) \le \tau$). The background loss $\mathcal{L}_{bkg}$ is symmetrically defined. Finally, $\mathcal{L}_{oca} = (\mathcal{L}_{frg} + \mathcal{L}_{bkg})/2$.
- Design Motivation: Merely pulling "audio-pixel" closer is insufficient. The model needs to "see" that silent objects are also an independent category, and false-negative filtering is utilized to prevent sounding objects of the same class from being mistaken as negative samples.
Object Region Isolation (ORI) Loss
- Function: Ensure that the localization regions of different sound sources are mutually exclusive in the spatial domain for multi-source scenarios.
- Mechanism: Concatenate the $K$ foreground references and 1 background reference into $\mathbf{F}_r \in \mathbb{R}^{B \times (K+1) \times c}$. Compute the similarity map $S_{r_k}$ for each reference with visual features, then measure the distance between $S_{r_n}$ and $1 - S_{r_m}$ using first-order Wasserstein (Earth-Mover) distance + Sinkhorn algorithm: $$\mathcal{L}_{ori} = \sum_i \sum_{n \neq m} D_W(\bar S_{r_n}^i, 1 - \bar S_{r_m}^i)$$
- Design Motivation: Contrastive loss alone cannot guarantee that two sounding objects do not overlap in the pixel space (e.g., violin + cello in the same frame). The Wasserstein distance naturally delineates the "physical distance of region distributions," which is smoother and more differentiable than IoU.

Loss & Training¶

The total loss is $\mathcal{L} = \mathcal{L}_{base} + \lambda_1 \mathcal{L}_{oca} + \lambda_2 \mathcal{L}_{ori}$. The MLLM generates all training captions offline in a single pass, which are cached as text features to avoid repeatedly calling the LLM during training.

Key Experimental Results¶

Main Results¶

MUSIC-Duet Multi-source:

Method	Backbone	CAP(%)	[email protected]	AUC
Mix-and-Localize (CVPR22)	RN18	47.5	26.5	21.5
AVGN (CVPR23)	RN18	50.6	32.5	24.6
NoPrior (CVPR24)	RN18	52.1	38.6	30.1
OA-SSL (Ours)	RN18	61.4	45.9	36.1
T-VSL (CVPR24)	AudioCLIP	62.9	43.2	35.9
OA-SSL (Ours)	AudioCLIP	Higher	Higher	Higher

VGGSound-Duet: [email protected] from 46.9 of NoPrior → Ours 55.2 (+8.3); AUC from 29.2 → 44.8 (+15.6).
Single Sound Source (MUSIC / VGGSound-Single): Comparable to or slightly better than the strongest baselines.

Ablation Study¶

Configuration	[email protected] (Duet)
Baseline (no MLLM, no OCA, no ORI)	38.6
+ OCA only	~44
+ OCA + ORI (full)	45.9
Full but MLLM given random captions	Dropped significantly

Key Findings¶

The gain in multi-source scenarios (+8 CIoU) is much larger than that in single-source scenarios, proving that this method mainly addresses the "instance-level" differentiation challenge.
OCA provides the ability to "distinguish sounding/silent," and ORI provides the ability to "isolate different sound sources," complementing each other.
MLLM captions must possess "action/state" semantics (playing vs. holding) to be effective; using only class names yields almost no gain.

Highlights & Insights¶

MLLM as an offline supervision generator — Utilizing expensive MLLMs only during the training phase and completely removing them during inference is a highly practical "knowledge distillation" strategy that can be generalized to any "self-supervised + lacking fine-grained labels" tasks.
Modeling both sounding and silent classes simultaneously — Most sound source localization works only focus on "where the sound comes from." This paper explicitly enables the model to understand "where there are similar but silent objects," which aligns well with "hard negative engineering."
Wasserstein region isolation loss — Applying OT distance as a spatial exclusion constraint for multiple instances is smoother and more differentiable than traditional mask exclusion or NMS, and can be transferred to tasks like multi-object segmentation or multi-referring expression.
Soft handling of false negatives: Using reference similarity within the batch for thresholding avoids mistake-labeling similar classes as negative samples.

Limitations & Future Work¶

Heavily relies on the generation quality of MLLMs; when objects in the image are rare, the MLLM might output incorrect foreground/background descriptions.
Needs to know the number of sound sources $K$ beforehand (from dataset annotations); for unknown numbers of sound sources in real-world environments, an iterative estimation module is still required.
Not fully evaluated on wild videos (containing background noise, reverberations).
The alignment between the text encoder and the MLLM significantly affects performance. Future implementations can explore joint training or alternative encoders with better audio-text alignment (e.g., CLAP).
Future directions: Employing MLLMs as RL reinforcement rewards, allowing the localization network to iteratively adjust based on MLLM feedback.

vs NoPrior (CVPR 24): NoPrior uses iterative recognition to handle multi-source scenarios without fine-grained semantics; this paper explicitly injects "action semantics" to achieve higher localization accuracy.
vs T-VSL (CVPR 24): T-VSL uses AudioCLIP text-audio alignment for supervision; this work further generates instance-level action descriptions, providing finer supervision.
vs Mix-and-Localize: No longer relies on manually designed mixup strategies; instead, supervision signals are automatically generated by the MLLM.
Insight: For any fine-grained visual/multimodal task involving a gap of "easy-to-distinguish classes + hard-to-distinguish actions/states," the paradigm of "offline MLLM supervision generation" can be applied.

Rating¶

Novelty: ⭐⭐⭐⭐ Bidirectional supervision of silent vs. sound-making generated by MLLM is a novel idea.
Experimental Thoroughness: ⭐⭐⭐⭐ Thorough evaluations on MUSIC + VGGSound single/dual-source with multiple backbones.
Writing Quality: ⭐⭐⭐⭐ Clear loss definitions and complete mathematical derivations.
Value: ⭐⭐⭐⭐ Zero inference cost and significant improvement, serving as a strong new baseline for multi-source AVSL.