Online Reasoning Video Segmentation with Just-in-Time Digital Twins¶

Conference: ICCV 2025 arXiv: 2503.21056 Code: None Area: Video Segmentation / Reasoning Segmentation Keywords: Reasoning Segmentation, Digital Twin, Video Understanding, Multi-Agent Framework, Online Processing

TL;DR¶

This paper proposes a multi-agent framework based on the concept of "Just-in-Time Digital Twins" that decouples perception from reasoning. Without any LLM fine-tuning, the framework enables online video reasoning segmentation and comprehensively outperforms existing methods across semantic, spatial, and temporal reasoning tasks.

Background & Motivation¶

Reasoning Segmentation (RS) aims to identify and segment objects of interest based on implicit textual queries—e.g., "segment the object used to hold hot beverages" rather than "coffee cup"—and is a core capability for embodied intelligence.

Three key limitations of existing RS methods:

Limited reasoning capacity: Relying on multimodal LLMs to jointly handle perception and reasoning leads to poor performance on queries requiring multi-step or complex spatial/temporal reasoning. LLMs must compress rich visual information into a limited number of tokens, losing fine-grained spatial and temporal details.

High maintenance cost: LLM fine-tuning is required, and as LLMs iterate rapidly, repeated re-tuning is necessary to avoid catastrophic forgetting.

No support for online processing: Existing methods are primarily designed for static images or offline videos and cannot handle real-time video streams.

Method¶

Overall Architecture¶

A two-stage pipeline: Planning Phase → Execution Phase

Planning Phase: An LLM planner analyzes the implicit query, constructs an execution graph (DAG), and selects the necessary specialist vision models.
Execution Phase: The video is processed online frame by frame; a digital twin is constructed and maintained, reasoning operations are executed, and segmentation masks are produced.

Key Designs¶

Query-Driven Specialist Vision Model Selection:
- The LLM planner analyzes the semantic, spatial, and temporal requirements of the query.
- A structured prompt template is used to output a JSON configuration specifying the required models and their rationales.
- For example, "segment the object that moved behind the dining table after the person sat down" → requires SAM-2 (segmentation) + DepthAnything-2 (spatial relations).
- Core Idea: activate specific models only when needed, rather than always running all models, thereby reducing computational overhead.
Just-in-Time Digital Twin Construction:
- For each frame \(I^{(t)}\), a scene graph \(G_s^{(t)} = (V_s^{(t)}, E_s^{(t)})\) is constructed.
- Node attributes contain three-dimensional features: \(\text{attr}(v_{i,s}^{(t)}) = [h_{\text{vis}}, h_{\text{spa}}, h_{\text{temp}}]\) (visual, spatial, temporal).
- Edges represent inter-object relationships (e.g., "behind," "above," "moving towards").
- On-demand construction: Unlike traditional digital twins that maintain a complete representation, only the information subset required by the query is generated and updated.
- A sliding-window mechanism maintains temporal consistency: \(SG^{(t)} = \{G_s^{(t)} | t-w \leq k \leq t\}\)
Reasoning Graph Construction and Execution:
- Reasoning is modeled as a DAG: \(G = (V, E)\), where \(V = V_p \cup V_s \cup V_r\).
- \(V_p\): perception nodes (specialist vision models), \(V_s\): state nodes (maintaining the digital twin), \(V_r\): reasoning nodes.
- Two types of reasoning nodes:
  - Semantic Reasoning: handled by the base LLM (gpt-4o-mini), which formats the digital twin state as natural language context.
  - Spatial/Temporal Reasoning: handled by the LLM-coder (gpt-4o), which generates executable code to operate on the scene graph.
- Example for evaluating a "behind" relation: \(\text{Behind}(v_i, v_j) = (h_{\text{spa}}^i[z] > h_{\text{spa}}^j[z]) \wedge \text{Overlap}(v_i, v_j)\)

Loss & Training¶

This method requires no training and is entirely based on a combination of pretrained models: - gpt-4o-mini serves as the planner and semantic reasoner. - gpt-4o serves as the code generator. - SAM-2 for segmentation, DepthAnything-2 for spatial relations, OWLv2 for object detection, DINOv2 for visual feature extraction. - Temporal smoothing coefficient \(\alpha = 0.8\), tracking function \(\lambda = 0.5\), default window size \(w = 6\).

Key Experimental Results¶

Main Results — Video Reasoning Segmentation¶

A newly constructed benchmark comprising 200 videos and 895 implicit queries covers three reasoning types (semantic/spatial/temporal) and three difficulty levels (L1/L2/L3).

Method	Sem.-L1	Sem.-L3	Spa.-L1	Spa.-L3	Tem.-L1	Tem.-L3
LISA-7B	0.635	0.274	0.226	0.229	0.398	0.229
LISA-13B	0.669	0.301	0.258	0.234	0.237	0.177
VISA	0.563	0.432	0.521	0.411	0.354	0.218
Ours	0.865	0.810	0.789	0.741	0.721	0.690

The proposed method leads by a large margin across all categories and difficulty levels, with particularly pronounced advantages in spatial reasoning (+26.8% vs. VISA) and temporal reasoning (+47.2% vs. VISA).

Ablation Study¶

Model Selection	DT Update	Temporal Integration	Sem.-L1	Spa.-L1	Tem.-L1
✗	✓	✓	0.821	0.753	0.701
✓	✗	✓	0.831	0.721	0.675
✓	✓	✗	0.842	0.757	0.654
✓	✓	✓	0.865	0.789	0.721

LLM configuration ablation (semantic reasoning):

Base LLM	LLM-coder	L1	L2	L3
gpt4o-mini	gpt4o-mini	0.832	0.804	0.801
gpt4o-mini	gpt4o	0.865	0.841	0.810
gpt4o	gpt4o	0.879	0.865	0.822

Key Findings¶

Existing methods (LISA-13B) suffer a sharp performance drop from L1 to L3 (\(\mathcal{J}\): 0.669→0.301), whereas the proposed method remains stable (0.865→0.810), with less than 10% degradation across difficulty levels.
The method also achieves state-of-the-art performance on the ReVOS benchmark (Overall \(\mathcal{J}\): 0.748 vs. VISA 0.488).
It likewise achieves SOTA on image reasoning segmentation (ReasonSeg) (long query gIoU: 69.5 vs. LISA-13B 63.2).
Disabling digital twin updates has the greatest impact on temporal reasoning; disabling temporal integration also significantly affects temporal reasoning performance.

Highlights & Insights¶

Perception–Reasoning Decoupling: Avoids having the LLM directly process pixel-level visual information; specialist models are used to preserve fine-grained spatial and temporal details.
"Just-in-Time" Digital Twin Concept: Scene representations are constructed on demand, balancing computational efficiency with information completeness.
Fine-Tuning-Free Design: The modular architecture allows any LLM or vision model to be replaced with a better alternative at any time, minimizing maintenance cost.
Online Processing Capability: Real-time frame-by-frame video stream processing makes the system suitable for practical deployment in embodied AI scenarios.
Code-Generation-Based Reasoning: Spatial and temporal reasoning is converted into executable code, circumventing the limitations of LLMs in handling numerical computation.

Limitations & Future Work¶

Dependence on the GPT-4o API results in relatively high inference costs and latency.
The robustness of the scene-graph-based digital twin representation under extreme conditions such as occlusion and rapid motion is not thoroughly discussed.
The benchmark is of moderate scale (200 videos, 895 queries); validation at larger scale remains to be conducted.
Errors in the planning phase may cascade and affect subsequent execution steps in an unrecoverable manner.
The sliding window size is fixed at 6 frames, which may be insufficient for queries with very long temporal dependencies.

LISA pioneered the embedding-as-mask paradigm, but its single-token design limits multi-step reasoning.
VISA was the first to extend RS to the video domain, but frame sampling may cause critical temporal information to be missed.
The digital twin concept is borrowed from the industrial/robotics domain and introduced into computer vision, representing a meaningful cross-domain transfer.
Using LLMs as planners and reasoners rather than end-to-end perception models constitutes a more flexible and scalable system design paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The "Just-in-Time Digital Twin" concept is novel; the perception–reasoning-decoupled agent design is pioneering in the context of video RS.
Experimental Thoroughness: ⭐⭐⭐⭐ The newly constructed benchmark covers three reasoning types and three difficulty levels; multi-dataset evaluation and detailed ablations are provided.
Writing Quality: ⭐⭐⭐⭐ The presentation is clear, the formalization is complete, and the system design is well described.
Value: ⭐⭐⭐⭐⭐ The work makes an important contribution to embodied AI and video understanding; its design philosophy is broadly applicable.