Audio-Visual Instance Segmentation¶

Background & Motivation¶

Humans naturally integrate auditory and visual information in daily life to locate and identify sound sources in their environment. For instance, when hearing a dog barking, we automatically focus our attention on the dog in the visual field and precisely perceive its contour. This capability of joint audio-visual perception holds significant application value in fields such as robot navigation, video editing, and autonomous driving.

Existing audio-visual learning tasks primarily include:

Task	Output	Granularity	Temporal Modeling
Sound Source Localization	Heatmap	Region-level	Single-frame
Audio-Visual Segmentation	Semantic mask	Pixel-level	Single-frame/Short-sequence
Audio-Visual Source Separation	Separated audio	Source-level	Multi-frame
AVIS (Ours)	Instance mask + ID	Instance-level	Full-video

Limitations of existing methods:

Sound Source Localization only provides coarse, region-level localization and fails to generate precise pixel-level segmentations.

Audio-Visual Segmentation performs semantic-level segmentation but does not distinguish among different individuals of the same category.

No existing method simultaneously performs instance-level segmentation and cross-frame tracking at the video level.

This work introduces a new task, Audio-Visual Instance Segmentation (AVIS): given a video and its corresponding audio, the goal is to perform instance-level segmentation of all sounding objects and track them across frames. In addition, the AVISeg benchmark dataset and the AVISM baseline method are established.

Method¶

AVISeg Benchmark Dataset¶

AVISeg is the first large-scale annotated dataset for audio-visual instance segmentation:

Attribute	Value
Number of videos	926
Annotated masks	90,000+
Object categories	26
Average video duration	8.3 seconds
Resolution	720p/1080p
Annotation frequency	Annotated every 5 frames
Data sources	YouTube, ActivityNet, VGGSound

Data Collection and Annotation Pipeline¶

Video Filtering: Filter videos containing distinct sounding sources from large-scale video databases.
Audio Event Annotation: Annotate the temporal boundaries and categories of audio events in each video.
Instance Segmentation Annotation: Delineate instance-level segmentation masks for each sounding object.
Cross-Frame Association: Assign consistent instance IDs to masks of the same object across different frames.
Quality Control: Implement multiple rounds of verification to ensure annotation quality.

AVISM Baseline Method¶

AVISM (Audio-Visual Instance Segmentation and tracking Model) consists of two core modules:

Frame-Level Sound Localizer (FLSL)¶

FLSL is responsible for locating sounding objects in each frame:

Audio Feature Extraction: Extract audio embeddings \(\mathbf{a}_t \in \mathbb{R}^{d}\) using a pre-trained AudioSet model.
Visual Feature Extraction: Extract visual feature maps \(\mathbf{V}_t \in \mathbb{R}^{d \times H \times W}\) using ResNet-50 / Swin-T.
Cross-Modal Attention:

\[\mathbf{A}_{t} = \text{softmax}\left(\frac{\mathbf{a}_t \mathbf{V}_t^T}{\sqrt{d}}\right) \mathbf{V}_t\]

Instance Prediction: Predict instance masks and categories using a Mask2Former-style decoder based on attention-enhanced features.

Video-Level Sounding Tracker (VLST)¶

VLST is responsible for cross-frame instance association and tracking:

Instance Embedding: Generate an appearance embedding \(\mathbf{e}_i^t\) for each detected instance.
Audio Consistency Constraint: Ensure that the correlation between the same instance and its corresponding audio remains consistent across different frames.
Matching Strategy: Perform bipartite graph matching by combining appearance similarity and audio correlation:

\[\text{cost}(i, j) = \alpha \cdot \text{IoU}(m_i^t, m_j^{t+1}) + \beta \cdot \cos(\mathbf{e}_i^t, \mathbf{e}_j^{t+1}) + \gamma \cdot \text{audio\_sim}(i, j)\]

Evaluation Metrics¶

The AVIS task adopts evaluation metrics adapted from VIS (Video Instance Segmentation) with additional audio-related constraints:

Metric	Meaning	Description
FSLA (Frame-level Sound Localization Accuracy)	Frame-level Sound Localization Accuracy	Measures whether the sounding objects in each frame are correctly segmented
HOTA (Higher Order Tracking Accuracy)	Higher Order Tracking Accuracy	Comprehensively measures detection and association quality
mAP	Mean Average Precision	Standard instance segmentation metric
IDsw	ID Switches	Measures tracking consistency

Experimental Results¶

Main Results¶

Method	FSLA↑	HOTA↑	mAP↑	IDsw↓
Mask2Former (Vision-only)	31.25	48.92	27.8	145
AVS + SORT	35.67	52.34	31.2	128
TAPIS	38.91	56.18	34.5	107
AVISM (Ours)	42.78	61.73	38.9	82

AVISM significantly outperforms combinations of existing methods across all metrics, validating the superiority of the architecture specifically designed for the AVIS task.

Ablation Study¶

Configuration	FSLA↑	HOTA↑
Full AVISM	42.78	61.73
w/o Audio input	32.14	49.85
w/o Cross-modal attention	37.92	55.41
w/o Audio consistency constraint	40.15	58.62
w/o VLST (No tracking)	42.78	52.37

Audio cue is a core contributing factor (FSLA drops by 10.64% upon its removal), and the VLST tracking module significantly contributes to HOTA.

Category Analysis¶

Category	FSLA↑	Difficulty
Musical instruments (piano, guitar)	52.3	Easy
Animals (dog, cat, bird)	43.7	Medium
Vehicles (car, motorcycle)	38.9	Medium
Human activities (speaking, clapping)	31.2	Hard

The musical instruments category is the easiest to localize due to stationary sound source positions and highly distinctive visual features. Human activities present the greatest difficulty because there may be multiple sound sources (people) in the video with complex motion patterns.

Summary & Future Work¶

This work proposes AVIS, a new audio-visual understanding task, establishes the AVISeg benchmark (926 videos, over 90K masks, 26 categories), and designs the AVISM baseline method (FLSL + VLST). AVISM achieves 42.78% FSLA and 61.73% HOTA, providing a solid foundation for future research.