Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation¶
Conference: CVPR 2025
arXiv: 2503.13068
Code: GeWu-Lab/Crab
Institution: Renmin University of China / Tsinghua University / Tencent PCG
Area: Audio-visual understanding / Multimodal learning
Keywords: audio-visual understanding, unified model, interaction-aware LoRA, instruction tuning, multi-task learning
TL;DR¶
This paper proposes Crab, a unified audio-visual scene understanding model. By constructing the AV-UIE dataset (200K samples) with explicit reasoning processes, it clarifies the collaborative relationships across tasks. Combined with interaction-aware LoRA (multi-head LoRA) designed to learn different audio-visual interaction patterns, Crab outperforms specialized models across multiple tasks.
Background & Motivation¶
Background: Audio-visual scene understanding comprises various tasks: temporal localization (AVE, AVVP), spatio-temporal reasoning (AVQA), spatial localization (ARIG), and pixel-level understanding (AVS, Ref-AVS). Although humans possess a unified ability to understand multiple tasks, most existing works design specialized models for individual tasks.
Limitations of Prior Work: - Simple Joint Training: Interference arises between multiple tasks due to the heterogeneous nature of audio-visual data and the complex relationships among tasks. - Existing Unified Models (e.g., VideoLLaMA, GroundingGPT): These models lack explicit cooperation mechanisms between tasks, leading to limited performance. - Existing datasets only provide simple labels (word-level), failing to capture the reasoning and collaborative relationships among tasks.
Key Challenge: How to handle temporal, spatial, and pixel-level multi-granularity tasks simultaneously within a single model while avoiding task interference.
Key Insight: Establishing explicit task cooperation across both data and model levels.
Core Idea: Explicit reasoning dataset (AV-UIE) + interaction-aware LoRA (multi-head) = Unified Audio-Visual Understanding.
Method¶
Overall Architecture¶
- Visual Encoder: CLIP-ViT-L/14, extracting patch-level features.
- Audio Encoder: BEATs, extracting acoustic features.
- Segmentation Decoder: SAM decoder.
- Language Model: LLaMA-2-7b-Chat.
- Multimodal Bridge: Audio Q-Former + Visual Q-Former (32 query tokens each).
Key Designs¶
-
AV-UIE Dataset (Audio-Visual Unified Instruction-tuning with Explicit reasoning)
- Function: Constructing a unified instruction-tuning dataset of 200K samples that includes explicit reasoning processes.
- Mechanism: Expanding the simple labels of existing datasets into an instruction format that incorporates reasoning chains.
- Task Coverage: Temporal localization, spatio-temporal reasoning, spatial localization, pixel-level segmentation, and referring segmentation.
- Effect: Clarifying the cooperative relationships between tasks (e.g., "temporal localization aids spatial localization").
-
Interaction-aware LoRA
- Function: Inserting multi-head LoRA into all linear layers of the LLM to learn different audio-visual interaction patterns.
- Structure: Shared matrix \(\mathbf{A}\) + \(n=3\) LoRA heads (independent \(\mathbf{B}\) matrices).
- The three heads focus respectively on: temporal interaction / spatial interaction / pixel-level interaction.
- rank = 8.
- Design Motivation: Different tasks require focusing on different interaction dimensions of audio-visual data.
- Output: The weighted sum of the three heads serves as the final adaptation.
-
Mask Decoder Design
- Two groups of
<MASK>tokens correspond to visual features at two scales (14th layer and second-to-last layer). - 3 tokens per group.
- Supports audio-visual semantic segmentation (AVSS) and referring audio-visual segmentation (Ref-AVS).
- Two groups of
Loss & Training¶
- Phase 1: Pre-training Alignment
- Visual branch: Video-LLaVA data.
- Audio branch: AudioCaps data.
- Segmentation branch: LVIS data.
- Global batch size 256, 3 epochs.
- Phase 2: Instruction Tuning
- AV-UIE dataset, mixing all tasks.
- Trainable components: Three multimodal branches + interaction-aware LoRA (encoders are frozen).
- Global batch size 512, 5 epochs.
Loss function: \(\mathcal{L} = \lambda_{txt}\mathcal{L}_{txt} + \lambda_{seg}\mathcal{L}_{seg} + \lambda_{bce}\mathcal{L}_{bce} + \lambda_{dice}\mathcal{L}_{dice} + \lambda_{ce}\mathcal{L}_{ce}\)
Key Experimental Results¶
Comprehensive Comparison with Specialized Models¶
| Task | Metric | Prev. SOTA | Crab |
|---|---|---|---|
| AVE Temporal Localization | Acc | MM-Pyramid 77.80 | 80.15 |
| AVQA Spatio-temporal Reasoning | Avg | TSPM 76.79 | 78.94 |
| ARIG Spatial Localization | cIoU | FNAC 27.15 | 41.78 |
| ARIG Spatial Localization | AUC | FNAC 0.31 | 0.42 |
| AVS-MS3 Pixel Segmentation | mIoU | AVSegFormer 58.40 | 58.21 |
AVQA Subcategory Comparison¶
| Method | Audio | Visual | Audio-Visual | Avg |
|---|---|---|---|---|
| LAVISH | 75.97 | 80.22 | 71.26 | 74.46 |
| TSPM | 76.91 | 83.61 | 73.51 | 76.79 |
| Crab | 76.58 | 90.73 | 74.13 | 78.94 |
The performance improvement on the Visual subcategory is significant (90.73 vs. 83.61), which likely benefits from enhanced visual understanding supported by explicit reasoning.
Key Findings¶
- Each LoRA head automatically learns distinct audio-visual understanding capabilities (validated through visualization).
- Although the temporal localization task has the smallest proportion in AV-UIE, it still achieves significant improvement due to cross-task collaboration.
- Comparable performance is achieved against VALOR (78.94 vs. 78.90), which was trained on the million-scale VALOR-1M dataset, using significantly less data.
Highlights & Insights¶
- The multi-head LoRA design is simple yet effective: the shared A matrix reduces the parameter footprint, while the multi-head B matrices capture different interaction patterns.
- Explicit reasoning data is far more effective than simple labels, enabling the model to understand "why different tasks need to cooperate."
- Outperforms specialized methods by a large margin on spatial localization (ARIG) (+14.63 cIoU), demonstrating the cross-task transfer advantage of a unified model.
- The unified model paradigm is more elegant and efficient than assembling a pipeline of specialized models.