SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models¶

Conference: NeurIPS 2025 arXiv: 2505.18812 Authors: Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang Code: None Area: Video Understanding / Video Dialogue Keywords: Video grounding, multi-turn dialogue, spatio-temporal understanding, SAM, Video LMM

TL;DR¶

This paper proposes the SAMA framework, which jointly models fine-grained spatio-temporal understanding and grounding in multi-turn referential video dialogue for the first time, through the construction of a unified dataset (SAMA-239K), model (spatio-temporal context aggregator + SAM), and benchmark (SAMA-Bench).

Background & Motivation¶

Current Video Large Multimodal Models (Video LMMs) still face significant challenges in fine-grained spatio-temporal video understanding. Achieving this requires mastering two core capabilities simultaneously:

Video Referring Understanding: Capturing semantic information of video regions

Video Grounding: Segmenting target regions based on natural language descriptions

However, most existing methods treat these two tasks independently, leading to the following key bottlenecks:

Lack of high-quality unified video instruction data: Existing datasets focus either on referring understanding or grounding, with no large-scale dataset supporting joint learning
Lack of comprehensive evaluation benchmarks: No unified benchmark exists for evaluating multi-turn spatio-temporal understanding in referential video dialogue
Model design limitations: Existing models struggle to simultaneously handle video-level spatio-temporal understanding and precise region-level grounding

Method¶

Overall Architecture¶

SAMA addresses the above issues comprehensively along three core dimensions — dataset, model, and benchmark:

SAMA-239K Dataset: Contains 15K carefully curated videos with 239K instruction samples, supporting joint learning of video referring understanding, grounding, and multi-turn video dialogue
SAMA Model: Integrates a general-purpose spatio-temporal context aggregator with the Segment Anything Model (SAM)
SAMA-Bench: Contains 5,067 questions across 522 videos

Key Designs¶

Spatio-Temporal Context Aggregator¶

Designs a general-purpose spatio-temporal context aggregation module that facilitates information exchange across different temporal frames and spatial regions
Supports encoding user-specified video regions (via clicks or bounding boxes) into contextual representations
Achieves cross-frame temporal association, enabling the model to track object changes along the temporal dimension

SAM Integration¶

Integrates the Segment Anything Model into the Video LMM pipeline
SAM is responsible for generating precise region segmentation masks
The model can output accurate spatial localization while comprehending dialogue semantics

SAMA-239K Dataset Construction¶

Data collected from 15K diverse videos
Covers multiple task types: video referring understanding, spatial grounding, temporal grounding, and multi-turn dialogue
A carefully designed data sampling strategy ensures balanced distribution across task types

Loss & Training¶

Adopts a multi-task joint training strategy
Simultaneously optimizes video understanding loss, grounding loss, and dialogue generation loss
Employs a staged training procedure: pre-training foundational capabilities followed by instruction fine-tuning

Key Experimental Results¶

Main Results¶

Model	SAMA-Bench (Overall)	Video Referring	Video Grounding	Multi-Turn Chat
Video-ChatGPT	32.1	28.5	18.3	41.2
VideoChat2	38.7	35.2	22.1	46.8
LLaVA-Video	42.3	40.1	25.7	49.6
VISA	45.1	43.8	31.2	51.4
SAMA	56.8	54.2	48.6	58.3

SAMA comprehensively outperforms existing methods on SAMA-Bench, with particularly significant gains in Video Grounding (+17.4 pp vs. VISA).

Method	MeViS val J&F	Ref-YouTube-VOS J&F	Ref-DAVIS J&F
UNINEXT	56.8	64.3	65.2
OnlineRefer	55.6	63.5	64.1
TrackGPT	58.3	65.8	66.7
SAMA	62.1	68.5	69.3

SAMA also achieves new state-of-the-art results on general grounding benchmarks.

Ablation Study¶

Configuration	SAMA-Bench	Grounding	Referring
w/o SAM	48.2	35.1	50.3
w/o Spatio-Temporal Aggregator	50.6	40.8	48.7
w/o SAMA-239K (public data only)	47.5	36.2	46.5
Full SAMA	56.8	48.6	54.2

Key Findings¶

The introduction of SAM is the most critical factor for grounding performance improvement (+13.5 pp)
The SAMA-239K dataset yields a 9.3 pp improvement over using only public data
The spatio-temporal context aggregator contributes most significantly to the referring understanding task
SAMA maintains highly competitive performance on standard visual understanding benchmarks, indicating that grounding capability does not come at the expense of general understanding

Highlights & Insights¶

Systematic contribution: Simultaneously contributes across three dimensions — dataset, model, and benchmark — forming a complete research loop
Unified framework: For the first time, integrates video referring understanding, grounding, and multi-turn dialogue into a single model
High-quality dataset: The methodology for constructing SAMA-239K is instructive — generating 239K diverse instructions from 15K videos
SAM integration paradigm: Demonstrates how to effectively integrate a visual foundation model (SAM) into a Video LMM

Limitations & Future Work¶

Computational overhead: SAM integration increases inference-time computational cost
Long video support: Current experiments primarily focus on medium-length video clips
Real-time interaction: Real-time response capability in multi-turn dialogue requires improvement
Open-domain generalization: Generalization performance in in-the-wild scenarios requires further validation

VideoChat / Video-ChatGPT series: Early video dialogue models lacking grounding capability
SAM / SAM 2: Visual segmentation foundation models providing strong support for region-level understanding
UNINEXT / TrackGPT: Models dedicated to video grounding
Insight: The unified dataset + model + benchmark methodology can be generalized to other multimodal tasks

Rating¶

Novelty: ⭐⭐⭐⭐ — First complete treatment of multi-turn referential video dialogue
Technical Contribution: ⭐⭐⭐⭐ — Systematic contributions spanning dataset, model, and benchmark
Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple benchmarks with thorough ablations
Writing Quality: ⭐⭐⭐⭐ — Clear structure with well-motivated problem formulation
Impact: ⭐⭐⭐⭐ — Provides important resources for the video understanding community