Audio-Visual Instance Segmentation¶
Background & Motivation¶
Humans naturally integrate auditory and visual information in daily life to locate and identify sound sources in their environment. For instance, when hearing a dog barking, we automatically focus our attention on the dog in the visual field and precisely perceive its contour. This capability of joint audio-visual perception holds significant application value in fields such as robot navigation, video editing, and autonomous driving.
Existing audio-visual learning tasks primarily include:
| Task | Output | Granularity | Temporal Modeling |
|---|---|---|---|
| Sound Source Localization | Heatmap | Region-level | Single-frame |
| Audio-Visual Segmentation | Semantic mask | Pixel-level | Single-frame/Short-sequence |
| Audio-Visual Source Separation | Separated audio | Source-level | Multi-frame |
| AVIS (Ours) | Instance mask + ID | Instance-level | Full-video |
Limitations of existing methods:
Sound Source Localization only provides coarse, region-level localization and fails to generate precise pixel-level segmentations.
Audio-Visual Segmentation performs semantic-level segmentation but does not distinguish among different individuals of the same category.
No existing method simultaneously performs instance-level segmentation and cross-frame tracking at the video level.
This work introduces a new task, Audio-Visual Instance Segmentation (AVIS): given a video and its corresponding audio, the goal is to perform instance-level segmentation of all sounding objects and track them across frames. In addition, the AVISeg benchmark dataset and the AVISM baseline method are established.
Method¶
AVISeg Benchmark Dataset¶
AVISeg is the first large-scale annotated dataset for audio-visual instance segmentation:
| Attribute | Value |
|---|---|
| Number of videos | 926 |
| Annotated masks | 90,000+ |
| Object categories | 26 |
| Average video duration | 8.3 seconds |
| Resolution | 720p/1080p |
| Annotation frequency | Annotated every 5 frames |
| Data sources | YouTube, ActivityNet, VGGSound |
Data Collection and Annotation Pipeline¶
- Video Filtering: Filter videos containing distinct sounding sources from large-scale video databases.
- Audio Event Annotation: Annotate the temporal boundaries and categories of audio events in each video.
- Instance Segmentation Annotation: Delineate instance-level segmentation masks for each sounding object.
- Cross-Frame Association: Assign consistent instance IDs to masks of the same object across different frames.
- Quality Control: Implement multiple rounds of verification to ensure annotation quality.
AVISM Baseline Method¶
AVISM (Audio-Visual Instance Segmentation and tracking Model) consists of two core modules:
Frame-Level Sound Localizer (FLSL)¶
FLSL is responsible for locating sounding objects in each frame:
- Audio Feature Extraction: Extract audio embeddings \(\mathbf{a}_t \in \mathbb{R}^{d}\) using a pre-trained AudioSet model.
- Visual Feature Extraction: Extract visual feature maps \(\mathbf{V}_t \in \mathbb{R}^{d \times H \times W}\) using ResNet-50 / Swin-T.
- Cross-Modal Attention:
- Instance Prediction: Predict instance masks and categories using a Mask2Former-style decoder based on attention-enhanced features.
Video-Level Sounding Tracker (VLST)¶
VLST is responsible for cross-frame instance association and tracking:
- Instance Embedding: Generate an appearance embedding \(\mathbf{e}_i^t\) for each detected instance.
- Audio Consistency Constraint: Ensure that the correlation between the same instance and its corresponding audio remains consistent across different frames.
- Matching Strategy: Perform bipartite graph matching by combining appearance similarity and audio correlation:
Evaluation Metrics¶
The AVIS task adopts evaluation metrics adapted from VIS (Video Instance Segmentation) with additional audio-related constraints:
| Metric | Meaning | Description |
|---|---|---|
| FSLA (Frame-level Sound Localization Accuracy) | Frame-level Sound Localization Accuracy | Measures whether the sounding objects in each frame are correctly segmented |
| HOTA (Higher Order Tracking Accuracy) | Higher Order Tracking Accuracy | Comprehensively measures detection and association quality |
| mAP | Mean Average Precision | Standard instance segmentation metric |
| IDsw | ID Switches | Measures tracking consistency |
Experimental Results¶
Main Results¶
| Method | FSLA↑ | HOTA↑ | mAP↑ | IDsw↓ |
|---|---|---|---|---|
| Mask2Former (Vision-only) | 31.25 | 48.92 | 27.8 | 145 |
| AVS + SORT | 35.67 | 52.34 | 31.2 | 128 |
| TAPIS | 38.91 | 56.18 | 34.5 | 107 |
| AVISM (Ours) | 42.78 | 61.73 | 38.9 | 82 |
AVISM significantly outperforms combinations of existing methods across all metrics, validating the superiority of the architecture specifically designed for the AVIS task.
Ablation Study¶
| Configuration | FSLA↑ | HOTA↑ |
|---|---|---|
| Full AVISM | 42.78 | 61.73 |
| w/o Audio input | 32.14 | 49.85 |
| w/o Cross-modal attention | 37.92 | 55.41 |
| w/o Audio consistency constraint | 40.15 | 58.62 |
| w/o VLST (No tracking) | 42.78 | 52.37 |
Audio cue is a core contributing factor (FSLA drops by 10.64% upon its removal), and the VLST tracking module significantly contributes to HOTA.
Category Analysis¶
| Category | FSLA↑ | Difficulty |
|---|---|---|
| Musical instruments (piano, guitar) | 52.3 | Easy |
| Animals (dog, cat, bird) | 43.7 | Medium |
| Vehicles (car, motorcycle) | 38.9 | Medium |
| Human activities (speaking, clapping) | 31.2 | Hard |
The musical instruments category is the easiest to localize due to stationary sound source positions and highly distinctive visual features. Human activities present the greatest difficulty because there may be multiple sound sources (people) in the video with complex motion patterns.
Summary & Future Work¶
This work proposes AVIS, a new audio-visual understanding task, establishes the AVISeg benchmark (926 videos, over 90K masks, 26 categories), and designs the AVISM baseline method (FLSL + VLST). AVISM achieves 42.78% FSLA and 61.73% HOTA, providing a solid foundation for future research.