Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation¶

Conference: ICCV 2025 arXiv: 2507.22886 Code: OmniAVS Area: Image Segmentation Keywords: Audio-visual segmentation, omnimodal referring, reasoning segmentation, multimodal large language models, query propagation

TL;DR¶

This paper proposes the OmniAVS dataset and OISA model, extending referring audio-visual segmentation (RAVS) beyond simple acoustic attribute perception to omnimodal expressions (arbitrary combinations of text/speech/sound/image) and deep reasoning (understanding sound content + world knowledge), achieving SOTA on the new benchmark and multiple related tasks.

Background & Motivation¶

Referring audio-visual segmentation (RAVS) is an emerging field that aims to segment target objects in audio-visual scenes based on referring expressions. The existing dataset Ref-AVS suffers from three major limitations:

Shallow sound utilization: Expressions only address surface-level acoustic attributes (e.g., "who makes the loudest sound"), without understanding sound content.

Modality limitation: Only text-based referring expressions are supported; speech, sound clips, and image inputs are absent.

Lack of reasoning requirements: Expressions require neither world knowledge nor complex reasoning.

Taking "who is most likely to be sick?" as an example, the model must establish a cognitive chain: sound → coughing → illness, which goes beyond simple acoustic feature recognition. Meanwhile, omnimodal AI systems such as ChatGPT-4o highlight the importance of processing arbitrary combinations of modality inputs.

Core motivation: To construct a RAVS benchmark and baseline model that genuinely understands sound content, supports omnimodal referring, and incorporates complex reasoning.

Method¶

Dataset: OmniAVS¶

Video sources: Creative Commons online videos + TVQA TV drama clips + self-recorded videos, with 2,104 videos curated from 10,871 candidates.

Annotation rules: - Expressions must relate to sounds present in the video, not solely visual cues. - Emphasis on sound content rather than sound behavior (e.g., "warning dog" rather than "barking dog"). - Expressions requiring reasoning are encouraged, and reasoning explanations are provided. - Each expression may refer to zero or more targets.

Dataset scale: 2,104 videos, 103k frames, 4,277 targets, 206k masks, 61,095 expressions, 34,841 reasoning explanations.

Model: OISA (Omnimodal Instructed Segmentation Assistant)¶

Overall Architecture: MLLM (audio encoder + visual encoder + LLM) + mask head (ViT-Adapter + pixel decoder + mask decoder)

MLLM backbone: InternVL2-1B (InternViT-300M-448px + Qwen2-0.5B)
Audio encoder: Whisper-large-v3 + audio MLP

Key Design 1: Audio-Visual Interleaving (AVI)¶

\(N\) frames are sampled from the video to obtain visual tokens \(\{v_1, ..., v_N\}\); the encoded audio is split into \(N\) segments \(\{a_1, ..., a_N\}\) and interleaved in temporal order:

\[\{v_1, a_1, v_2, a_2, ..., v_N, a_N\}\]

Compared to VideoLLaMA's sequential concatenation \(\{v_1,...,v_N, a_1,...,a_N\}\) or video-SALMONN's weighted fusion, the interleaving strategy achieves temporal alignment without additional parameters. Gains are particularly pronounced on the TVQA subset (which contains dense dialogue and requires precise audio-visual alignment).

The complete audio token \(\mathbf{A}\) is further appended at the end of the interleaved sequence, analogous to InternVL2's thumbnail strategy, to supplement untruncated global audio information.

Key Design 2: Query Propagation¶

The MLLM generates a [SEG] token as the target embedding, which is passed to the mask decoder.

VideoLISA's OTSA (One-Token-Seg-All) strategy uses the same [SEG] token to independently segment each frame. However, a single query carries a positional prior and struggles to adapt to target motion (e.g., moving from right to left), leading to ID switches.

Query propagation updates the query frame by frame:

After each frame is segmented, the output query of the current frame is propagated to the next frame.
The query is refined online, smoothly capturing the temporal motion trajectory.
Contextual temporal information is effectively modeled.

\[\text{QP}: \quad q_{t+1} = \text{MaskDecoder}(q_t, F_t) \rightarrow q_{t+1}\]

Training Pipeline¶

Stage 1 — Audio-text alignment: The audio encoder MLP is trained using ASR and Audio Caption datasets; all other parameters are frozen.

Stage 2 — Omnimodal instructed segmentation fine-tuning: Training is performed on a mixture of datasets (ADE20K, COCO-Stuff, RefCOCO series, MeViS, ReVOS, Ref-AVS, OmniAVS, etc.), with LoRA fine-tuning applied to the LLM and full training of mask head parameters. The loss comprises cross-entropy (text) + DICE + BCE (segmentation).

Key Experimental Results¶

OmniAVS Benchmark¶

Method	Overall \(\mathcal{J\&F}\)	I (Text)	VII (Text+Sound+Image)	VIII (Speech+Sound+Image)	METEOR
LMPM	25.8	31.2	-	-	-
MUTR	32.3	35.4	41.6	40.5	-
LISA-13B	36.1	36.4	46.7	45.7	16.5
OISA-1B	41.1	40.1	52.6	53.0	21.7

With only 1B parameters, OISA-1B surpasses LISA-13B by 5.0%, with concurrent improvements in reasoning explanation quality (METEOR +5.2). Multimodal combination inputs (VII/VIII) yield the best results, demonstrating the complementarity of multiple modalities.

Ref-AVS Benchmark¶

Method	Seen \(\mathcal{J}\)	Unseen \(\mathcal{J}\)	Mix \(\mathcal{J\&F}\)
EEMC	34.2	49.5	41.9/58.1
OISA-1B	51.7	58.3	54.5/61.4

Improvements of +17.5 and +8.8 are achieved on the Seen and Unseen splits, respectively.

Ablation Study¶

Audio-visual fusion strategies:

Fusion Method	TVQA Subset	Overall
Attention	37.4	35.8
Concatenation	36.9	35.3
AVI + Concatenation	42.0	40.5

Mask head design:

Query Type	Mask Head	\(\mathcal{J\&F}\)	FPS
OTSA	SAM	38.1	4.3
OTSA	M2F	35.2	15.7
QP	SAM	41.2	4.1
QP	M2F	40.5	12.3

Query propagation improves over OTSA by +5.3 \(\mathcal{J\&F}\) on the M2F head while maintaining a 3× speed advantage.

Key Findings¶

Audio-visual interleaving is the optimal solution for temporal alignment, with the most pronounced advantage on the TVQA subset (dense dialogue).
More modalities lead to better performance (splits VII/VIII achieve the highest scores), confirming that multiple modalities provide complementary information.
Query propagation substantially improves tracking quality for dynamic targets, resolving the ID-switching problem of OTSA.
OmniAVS is 17% more challenging than Ref-AVS (41.1 vs. 58.0), validating the dataset's difficulty.

Highlights & Insights¶

Forward-looking dataset design: 8 modality combinations + reasoning explanations + multi-target referring provide a fine-grained perception benchmark for omnimodal AI.
From "hearing" to "understanding": RAVS is advanced from acoustic attribute detection to sound content reasoning.
1B model outperforms 13B: Demonstrates that task-specific design (AVI + QP) matters more than parameter count alone.

Limitations & Future Work¶

The base LLM is only 0.5B, limiting capability in scenarios requiring deep reasoning (e.g., ReasonSeg).
Disentangling complex overlapping sounds remains a bottleneck (e.g., multiple simultaneous speakers with background noise).
Speech expressions are synthesized via TTS, creating a distribution gap relative to real human speech.

Audio-visual scene understanding: AVSBench, Ref-AVS, Music-AVQA, and related audio-visual joint learning works.
Reasoning segmentation: LISA, VideoLISA, VISA, and other MLLM-based reasoning segmentation methods.
Omnimodal models: ChatGPT-4o, VideoLLaMA, and other multimodal understanding systems.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The OmniAVS dataset defines an entirely new paradigm for omnimodal reasoning segmentation.
Technical Depth: ⭐⭐⭐⭐ — AVI and query propagation are well-motivated and effective designs.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive validation across OmniAVS/Ref-AVS/RefCOCO/MeViS/ReVOS on multiple tasks.
Writing Quality: ⭐⭐⭐⭐ — Dataset motivation and comparison with Ref-AVS are clearly argued.