What's Making That Sound Right Now? Video-centric Audio-Visual Localization¶
Metadata¶
- Conference: ICCV 2025
- arXiv: 2507.04667
- Code: Project Page
- Area: Human Understanding
- Keywords: Audio-visual localization, temporal modeling, video-level annotation, multi-scenario evaluation, self-supervised learning
TL;DR¶
This paper proposes AVATAR, a video-level audio-visual localization benchmark, and TAVLO, a temporally-aware model that addresses the neglect of temporal dynamics in conventional AVL methods through high-resolution temporal modeling.
Background & Motivation¶
Audio-visual localization (AVL) aims to identify sound-producing objects within visual scenes. Existing work suffers from two critical limitations:
Limitations of image-level association: Existing benchmarks (e.g., Flickr-SoundNet, VGGSS) adopt image-level annotation — annotators watch a complete video but label the sounding object in only a single frame, treating that frame as representative of the entire video. Consequently, existing methods process only single-frame inputs and entirely disregard temporal dynamics. In realistic scenarios, tracking moving sound sources and handling dynamic changes requires spatiotemporal modeling.
Oversimplified assumptions: Existing benchmarks assume that sounding objects are always visible and typically involve only a single sound source. Real-world scenarios, however, frequently include multiple simultaneous sound sources and off-screen sounding objects. Some studies partially address this through mismatched audio-image negative pairs or synthesized mixed audio, but coverage remains insufficient.
Method¶
Overall Architecture¶
The paper presents two core contributions: (1) the AVATAR benchmark dataset; and (2) TAVLO, a temporally-aware AVL model.
AVATAR Benchmark¶
AVATAR is constructed from VGGSound via a semi-automatic annotation pipeline, comprising 5,000 videos, 24,266 frames, and 80 categories. A key innovation is the introduction of four evaluation scenarios:
- Single-sound: Only one instance in the frame produces sound; evaluates one-to-one audio-visual correspondence.
- Mixed-sound: Multiple simultaneous sound sources; requires distinguishing and associating sounds with the correct visual sources.
- Multi-entity: Multiple visually similar objects where only one produces sound; requires spatiotemporal reasoning.
- Off-screen: The sounding object is outside the frame; evaluates the model's ability to avoid false positives.
The semi-automatic annotation pipeline consists of three stages: 1. Candidate video selection: approximately 70k raw videos from VGGSound are filtered by resolution, frame rate, and duration, yielding 39k videos. 2. Automatic clip and frame sampling: RMS energy is used to detect audio-active regions; the sharpest frames are selected via Laplacian filtering. 3. Model-driven annotation: YoloV8 detection + CAV-MAE audio classification + SAM instance segmentation + human verification.
TAVLO Model¶
TAVLO explicitly incorporates temporal information for spatiotemporal AVL. The core architecture includes:
Modality-specific feature encoding:
A visual encoder \(f_v\) (ResNet-18) extracts frame-level features \(\mathbf{V} = f_v(V) \in \mathbb{R}^{T \times H \times W \times D_f}\). An audio encoder \(f_a\) employs rectangular 2D CNN kernels with sizes \(K_w = \lfloor W_a / T \rfloor, K_h = H_a\), ensuring each audio segment is aligned with the corresponding visual frame, producing \(\mathbf{A} = f_a(A) \in \mathbb{R}^{T \times D_f}\).
Positional encoding: Spatial positional encoding \(\text{Pos}_s \in \mathbb{R}^{T \times H \times W \times D_s}\) and temporal positional encoding \(\text{Pos}_t \in \mathbb{R}^{T \times D_t}\) are defined as:
Spatial encodings are added element-wise to visual features; temporal encodings are concatenated across both modalities, yielding final dimension \(D = D_f + D_t\).
AST attention module: Audio-visual features are concatenated as \(\mathbf{Z}^0 = [\tilde{\mathbf{A}}; \tilde{\mathbf{V}}] \in \mathbb{R}^{T \times (1+H \cdot W) \times D}\). A factorized attention strategy is adopted to avoid the quadratic complexity of applying self-attention directly over flattened video features:
- Spatial attention: Multi-head self-attention over the \(1 + H \cdot W\) dimension, capturing intra-frame cross-modal audio-visual interactions.
- Temporal attention: Multi-head self-attention over the \(T\) dimension, modeling cross-frame temporal dependencies.
Loss & Training¶
A cross-modal multiple-instance contrastive learning loss adapted from EZ-VSL is employed with two modifications: (1) a temporal component is introduced to define visual bags at the frame level; (2) mean similarity instead of maximum similarity is used for negative sample bags, reducing the dominance of noisy instances:
The final loss is bidirectional: \(\mathcal{L} = \mathcal{L}_{a \rightarrow v} + \mathcal{L}_{v \rightarrow a}\).
Key Experimental Results¶
Main Results¶
| Method | Single-sound CIoU(%) | Single-sound AUC(%) | Mixed-sound CIoU(%) | Multi-entity CIoU(%) | Off-screen TN(%) |
|---|---|---|---|---|---|
| SLAVC(144k) | 9.07 | 10.60 | 6.31 | 6.41 | 96.46 |
| EZ-VSL(10k) | 9.66 | 11.07 | 8.16 | 6.87 | 96.91 |
| EZ-VSL(144k) | 10.92 | 12.22 | 6.97 | 5.80 | 96.47 |
| SSL-TIE(144k) | 13.10 | 14.23 | 5.19 | 5.50 | 90.82 |
| TAVLO(10k) | 13.42 | 14.08 | 14.13 | 12.08 | 91.18 |
Robustness under Cross-event Scenarios¶
| Method | Total CIoU(%) | Cross-event CIoU(%) | Δ |
|---|---|---|---|
| EZ-VSL(full) | 10.50 | 5.26 | -5.24 |
| SSL-TIE(144k) | 10.39 | 5.03 | -5.36 |
| TAVLO(10k) | 13.37 | 13.04 | -0.33 |
Key Findings¶
- TAVLO trained on only 10k samples surpasses baselines trained on 144k samples.
- The performance advantage is most pronounced in Mixed-sound and Multi-entity scenarios, with CIoU improvements of approximately 6–7 percentage points.
- In cross-event scenarios, baseline CIoU drops by 3–5 percentage points, whereas TAVLO drops by only 0.33, demonstrating that temporal modeling is critical for tracking dynamic sound sources.
- Qualitative analysis shows that TAVLO correctly distinguishes the actively sounding drum in multi-drum scenes and handles off-screen-to-on-screen speaker transitions.
Highlights & Insights¶
- Precise problem formulation: The first systematic extension of AVL to the video level, defining four comprehensive evaluation scenarios.
- Elegant factorized attention design: The AST module processes video features in linear time via spatial-temporal factorized attention, avoiding quadratic complexity.
- High data efficiency: Surpassing 144k-trained baselines with only 10k training samples highlights the inductive bias advantage conferred by temporal modeling.
- Practical semi-automatic annotation pipeline: Combining model-assisted labeling with human verification balances annotation quality and efficiency.
Limitations & Future Work¶
- The approach assumes at least partial audio-visual alignment; independent localization of off-screen sounds remains an open problem.
- The benchmark does not provide scenario-specific optimization strategies for each evaluation setting.
- Threshold selection in off-screen evaluation has a considerable impact on results.
Related Work & Insights¶
- AVL methods: Self-supervised approaches including EZ-VSL, SSL-TIE, and SLAVC; semi-supervised method DMT.
- AVL benchmarks: Flickr-SoundNet, VGGSS, AVSBench, all lacking video-level temporal annotations.
- Video understanding: Spatiotemporal attention factorization strategies (TimeSformer, ViViT).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The video-level AVL benchmark and temporally-aware model represent pioneering contributions to the field.
- Technical Depth: ⭐⭐⭐⭐ — Temporal encoding and AST attention design are well-motivated and technically sound.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Four-scenario comprehensive evaluation and cross-event analysis are convincing.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with well-articulated motivation.