Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning¶

Conference: CVPR 2026 arXiv: 2604.06824 Code: https://github.com/VisualAIKHU/GAR-SSL Area: Multimodal VLM Keywords: Sound Source Localization, Multimodal Large Language Model, Training-Free, Meta-Reasoning, Audio-Visual Correspondence

TL;DR¶

This paper proposes GAR-SSL, a training-free sound source localization (SSL) framework that reframes SSL as a three-stage metacognitive reasoning process—Generate, Analyze, and Refine—leveraging the intrinsic reasoning capabilities of MLLMs via prompt engineering alone. The method achieves performance comparable to or surpassing supervised approaches on both single-source and multi-source localization benchmarks.

Background & Motivation¶

Background: Sound source localization (SSL) aims to identify the spatial origin of sounds in images by associating audio and visual information. Existing methods fall into two main categories: contrastive learning-based single-source methods and pseudo-label/graph-based multi-source methods, both fundamentally relying on feature matching.
Limitations of Prior Work: All existing methods treat SSL as a simple feature matching problem, focusing solely on aligning audio and visual embeddings without verifying whether the matched regions genuinely correspond to sound sources or performing causal reasoning. This limits performance in complex acoustic scenarios involving silent objects, off-screen sounds, or multiple sound sources.
Key Challenge: Human sound source localization involves a multi-step reasoning process—perceiving audio-visual signal characteristics, systematically analyzing candidate objects, and refining conclusions. This interpretable verification process far exceeds simple matching, yet existing methods entirely overlook it.
Goal: How can MLLM reasoning capabilities be exploited for interpretable SSL without any training? Specifically: (a) how to generate candidate sound sources; (b) how to verify the plausibility of candidates; and (c) how to refine localization results.
Key Insight: Inspired by human metacognitive processes, the authors observe that MLLMs already possess strong cross-modal understanding, structured reasoning, and instruction-following capabilities, enabling them to serve directly as reasoning engines rather than auxiliary encoders.
Core Idea: SSL is reframed as a coarse-to-fine three-stage cognitive reasoning procedure (Generate → Analyze → Refine), driven entirely by prompt engineering applied to an MLLM, requiring no training whatsoever.

Method¶

Overall Architecture¶

Given an image-audio pair \((I, A)\), GAR-SSL produces sound source localization results through three stages: (1) the Generation stage produces initial bounding boxes and audio classification labels; (2) the Analysis stage evaluates audio-visual consistency through role annotation and anchor voting; (3) the Refinement stage performs geometric correction based on an adaptive gating decision. All stages are implemented via prompt engineering with structured JSON outputs, requiring no additional training.

Key Designs¶

Generation:
- Function: Produce initial spatial hypotheses and semantic constraints for sound sources.
- Mechanism: Comprises two independent subtasks. The audio-visual localization subtask generates an initial bounding box and natural language description via cross-modal grounding \(f_{\text{loc}}(I,A) = (b^{\text{init}}, d)\). The audio classification subtask independently analyzes the audio signal \(f_{\text{aud}}(A) = (c_{\text{aud}}, s_{\text{aud}})\) to predict an open-vocabulary label and confidence score. Both are generated independently, and their consistency is assessed in the Analysis stage.
- Design Motivation: The key mechanism is a "broad hypothesis space"—rather than directly matching a single region as in conventional methods, all objects potentially capable of producing the sound are considered (e.g., for a knocking sound: drums, tables, clapping hands), preventing premature exclusion of potential sources.
Analysis — Open-Set Role Annotation:
- Function: Identify the semantic structure of sound sources by discovering components functionally relevant to sound production.
- Mechanism: Given the audio label \(c_{\text{aud}}\), a role discovery function \(\mathcal{T}_{\text{role}} = f_{\text{role}}(I, A, c_{\text{aud}})\) contextually identifies roles/components directly involved in sound production (e.g., "drumstick," "striking hand"), up to four per instance. A visibility constraint \(\text{vis}(t|I) = 1\) ensures each role is observable in the current frame.
- Design Motivation: Role annotation provides structural constraints for the refinement process, guiding it toward semantically meaningful sound-producing components.
Analysis — Anchor Voting and Audio-Visual Consistency:
- Function: Quantify the alignment between initial localization and audio-visual evidence.
- Mechanism: An anchor voting function \(\mathcal{A}_{\text{anchor}} = f_{\text{anchor}}(I,A,c_{\text{aud}},b^{\text{init}})\) discovers semantic anchors (e.g., "drumstick striking drum head") along with associated confidence scores. An audio-visual consistency score \(\mathcal{S}_{\text{av}} = f_{\text{con}}(\cdot) \in [0,1]\) is then computed to holistically assess the alignment between the predicted box and audio-visual semantic evidence. Multiple-trial consensus (\(n=5\) trials) reduces variance from stochastic decoding: consistency scores are averaged, role annotations selected by frequency, anchors aggregated by confidence, and keep-flags determined by majority vote.
- Design Motivation: Unlike binary judgments, this stage identifies which aspects require adjustment, why, and how, providing targeted guidance for the Refinement stage.
Refinement and Adaptive Gating:
- Function: Correct localization errors based on analysis results while preventing unnecessary adjustments.
- Mechanism: The gating decision \(G=1\) (retain) holds if and only if three conditions are simultaneously satisfied: keep-flag \(k=1\), consistency score \(\mathcal{S}_{\text{av}} \geq \tau_{\text{av}}\), and audio confidence \(s_{\text{aud}} \geq \tau_{\text{aud}}\). When \(G=0\), refinement is triggered using four geometric operations: Delta (anchor-weighted centroid shift), Expand/Shrink (proportional scaling based on external anchor ratios), and Recenter (center displacement preserving box size).
- Design Motivation: Adaptive gating prevents unnecessary adjustments when initial predictions are already sufficiently reliable, avoiding performance degradation and improving efficiency and stability.

Loss & Training¶

This method requires no training. All operations are implemented via prompt engineering applied to Qwen2.5-Omni-7B. The gating mechanism uses fixed thresholds: audio confidence at 0.75 and audio-visual consistency at 0.5.

Key Experimental Results¶

Main Results¶

Multi-source SSL (VGGSound-Duet / MUSIC-Duet):

Method	VGGSound-Duet CIoU@0.3	MUSIC-Duet CIoU@0.3	MUSIC-Duet AUC
OA-SSL (CVPR'25, trained)	55.2%	45.9%	36.1%
Qwen2.5-Omni (direct MLLM)	42.6%	50.6%	40.8%
GAR-SSL (N=5)	77.6%	82.7%	53.2%

Single-source SSL (VGGSound-Single / MUSIC-Solo):

Method	VGGSound-Single AP	VGGSound IoU@0.5	MUSIC-Solo IoU@0.5
OA-SSL (CVPR'25)	51.7%	47.3%	71.1%
GAR-SSL (N=5)	60.5%	60.2%	98.5%

Ablation Study¶

Configuration	VGGSound-Duet CIoU@0.3	AUC	Notes
Stage 1 only	42.6%	28.3%	Generation only
Stage 1+2+3 (N=3)	59.5%	38.2%	Full pipeline
Stage 1+2+3 (N=5)	77.6%	45.8%	More analysis iterations

MLLM Backbone	CAP	CIoU@0.3	AUC
Qwen2.5-Omni-3B	39.9%	49.8%	33.0%
Qwen2.5-Omni-7B	43.5%	59.5%	38.2%

Key Findings¶

The Analysis and Refinement stages contribute +16.9 percentage points to CIoU@0.3 on multi-source scenarios, demonstrating the critical role of iterative analysis and refinement in improving candidate box consistency.
Increasing the number of analysis iterations \(N\) from 1 to 5 consistently improves performance; at \(N=5\), CIoU on MUSIC-Duet improves from 80.7% to 82.7%.
The larger MLLM (7B vs. 3B) consistently improves all metrics, indicating that MLLM reasoning capacity is the key performance bottleneck.
On MUSIC-Duet measured by CIoU@0.3, GAR-SSL surpasses the best trained method OA-SSL by 36.8 percentage points.

Highlights & Insights¶

Reframing SSL as a Cognitive Reasoning Process: Rather than treating sound source localization as feature matching, the method emulates the human coarse-to-fine reasoning process. This paradigm shift enables full exploitation of MLLM reasoning capabilities.
Open-Set Role Annotation and Anchor Voting: Without relying on predefined categories, the MLLM freely discovers components and evidence related to sound production, providing an interpretable reasoning path.
Adaptive Gating Mechanism: A simple yet effective design—refinement is only triggered when the initial prediction is insufficiently reliable, avoiding performance degradation from over-adjustment. This principle is transferable to any multi-stage reasoning system.
The framework demonstrates the substantial potential of MLLMs as zero-shot reasoning engines for complex multimodal perception tasks, surpassing numerous specialized trained methods without any training.

Limitations & Future Work¶

Inference Efficiency: Each sample requires approximately 4 seconds of inference time, and multiple iterations in the Analysis stage further increase overhead.
Strong Dependency on Underlying MLLM Capability: There is a significant performance gap between 3B and 7B models, but larger models also incur higher inference costs.
Lack of Temporal Reasoning: The current method processes only single frames, failing to leverage temporal information from video, which limits performance in dynamic scenes.
Evaluation Limited to VGGSound and MUSIC: Generalization to more real-world scenarios (e.g., noisy environments, overlapping multiple sound sources) has not been tested.

vs. OA-SSL (CVPR'25): OA-SSL uses MLLMs as auxiliary encoders to train visual models, whereas this work requires no training and directly employs MLLMs as reasoning engines. GAR-SSL substantially outperforms OA-SSL on multi-source scenarios, indicating that prior methods have severely underutilized the reasoning capabilities of MLLMs.
vs. Direct MLLM Use (Qwen2.5-Omni): Directly applying an MLLM to SSL yields moderate results (CIoU 42.6%), whereas the structured reasoning pipeline proposed here raises this to 77.6%, demonstrating that prompt design and reasoning structure are critical.
The Generate-Analyze-Refine paradigm constitutes a general multi-stage reasoning pattern transferable to other multimodal tasks requiring spatial localization.

Rating¶

Novelty: ⭐⭐⭐⭐ The reframing of SSL as a metacognitive reasoning process is novel, though the core contribution remains prompt engineering.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple benchmarks with comprehensive ablations, though evaluation in broader real-world scenarios is lacking.
Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with well-formalized mathematics, though excessive formalization renders some content redundant.
Value: ⭐⭐⭐⭐ Demonstrates the potential of MLLMs for zero-shot multimodal localization and offers meaningful reflections on the training paradigm.