DynamicVL: Benchmarking MLLMs for Dynamic City Understanding¶

Conference: NeurIPS 2025
arXiv: 2505.21076
Code: GitHub
Area: Multimodal VLM
Keywords: Remote sensing imagery, urban dynamic understanding, multi-temporal analysis, vision-language benchmark, change detection

TL;DR¶

This paper proposes DVL-Suite, a framework comprising the DVL-Bench evaluation benchmark and the DVL-Instruct instruction-tuning dataset, covering 42 U.S. cities and 14,871 high-resolution multi-temporal remote sensing images. It systematically evaluates 18 MLLMs on long-term urban dynamic understanding and introduces DVLChat as a baseline model.

Background & Motivation¶

Remote sensing enables urban monitoring via satellite imagery, yet most existing studies are limited to bi-temporal comparisons and lack vision-language datasets spanning longer temporal ranges. Although MLLMs perform well on general visual understanding tasks, two key bottlenecks remain in multi-temporal remote sensing analysis: (1) the absence of temporally aligned visual-language datasets covering long time series, and (2) existing multi-temporal remote sensing MLLMs being tested only on high-level semantic understanding without pixel-level precise quantitative analysis capabilities.

Existing datasets (e.g., CDVQA, TEOChatlas, EarthDial) are either limited to bi-temporal settings, task-restricted, or low-resolution (224–512 pixels). DVL-Suite addresses these gaps by providing 1024×1024 high-resolution imagery, averaging 6.73–6.94 temporal frames per scene (2005–2023), and covering six task categories spanning pixel-level to scene-level analysis.

Method¶

Overall Architecture¶

DVL-Suite consists of two components:

DVL-Bench: An evaluation benchmark containing 3,469 multi-temporal image sequences with 1,391 referring segmentation instructions, 5,854 QA pairs, and 1,437 comprehensive descriptions.
DVL-Instruct: An instruction-tuning dataset with 63,771 text pairs and 11,402 multi-temporal images for training DVLChat.

Data are sourced from NAIP (National Agriculture Imagery Program) with a GSD of 1.0 m, covering 42 major U.S. cities.

Key Designs¶

Six-Task Taxonomy¶

The paper defines a hierarchical task framework covering urban dynamic understanding from fine-grained to global levels:

BCA (Basic Change Analysis): Identifies and compares multi-temporal land-use changes, covering 20 change events across 5 land cover types (vegetation, non-vegetation, water, buildings, playgrounds).
CSE (Change Speed Estimation): Tracks and quantifies temporal trends of urban elements (e.g., building expansion rate, vegetation loss).
EA (Environmental Assessment): Evaluates urban livability and economic indicators through visual analysis.
RCD (Referring Change Detection): Performs dense reasoning and precise spatial localization of changed regions, requiring pixel-level segmentation.
RCC (Region Change Captioning): Generates detailed change descriptions for user-specified geographic regions.
DTC (Dense Temporal Captioning): Generates comprehensive reports documenting long-term temporal changes.

Data Annotation Pipeline¶

A semi-automatic annotation workflow is adopted:

Urban experts perform base annotation (semantic change region segmentation, key frame identification).
GPT-4.1 integrates expert annotations to generate diverse instructions.
Three-round quality control: self-checking, cross-checking, and supervised review.
BCA/CSE: Correct answers computed from segmentation masks; distractors generated at ±20% and ±40%.
RCD: Domain experts design event-specific prompts with manual mask annotation.
DTC/RCC: Annotators identify key frames → write stage-wise descriptions → GPT-4.1 refinement.

DVLChat Model Design¶

Built upon the LISA architecture with two key modifications:

Dual LoRA Routing Mechanism: Prefix tokens route requests — [QA] activates the VQA LoRA and [SE] activates the change detection LoRA, preventing inter-task interference.
Multi-temporal Image Interleaving: Image features from multiple temporal frames are interleaved prior to decoding to enable cross-temporal analysis.
Segmentation Capability: <SEG> token embeddings are decoded through SAM's frozen visual backbone and unfrozen decoder to generate precise segmentation masks.

The underlying MLLM is Qwen2.5-VL, though the architecture is MLLM-agnostic.

Loss & Training¶

Two independent LoRA modules are trained separately for VQA and segmentation tasks.
The QA branch uses instruction–ground-truth pairs from DVL-Instruct.
The segmentation branch uses mask annotations from the RCD task.
Training is conducted on 8 H100 GPUs.

Key Experimental Results¶

Main Results¶

Table 1: QA Task Results (Accuracy %)

Model	AVG	BCA-Single	BCA-Multi	CSE-Single	CSE-Multi	EA
o4-mini	34.1	62.8	36.1	33.8	12.4	25.3
GPT-4.1	32.5	66.1	39.7	31.3	5.4	20.2
Qwen2.5-VL 32B	31.4	62.0	33.3	36.9	3.2	21.6
DVLChat 7B	33.3	64.9	21.3	31.3	18.6	30.6
TEOChat	17.2	35.1	8.7	17.0	10.8	14.6

Table 2: Captioning Task Results (0–5 scale)

Model	RCC-AVG	DTC-AVG
o4-mini	4.58	4.14
GPT-4.1	4.46	3.98
DVLChat 7B	3.98	3.40
InternVL3 78B	3.92	3.33
TEOChat	1.66	1.45

Ablation Study¶

Referring Change Detection: The specialized model ChangeMamba achieves 32.41% IoU; DVLChat achieves 29.06% (a gap of only 3.35%), outperforming LISA (13.85%) and PSALM (26.93%).
Non-monotonic Model Scaling: The Qwen2.5-VL series peaks at 31.4% with 32B parameters and drops to 29.7% at 72B; InternVL3 similarly declines after peaking at 14B — indicating that simply increasing parameter count is insufficient to improve precise change detection.

Key Findings¶

The strongest commercial model, o4-mini, achieves only 34.1% overall QA accuracy, exposing severe deficiencies in long-sequence understanding and quantitative analysis among current MLLMs.
CSE multi-choice accuracy peaks at only 13.6%, and the Change Rate Precision (CRP) consistently remains below 1.21, indicating models cannot capture fine-grained temporal changes.
DVLChat at 7B surpasses general-purpose 72B–78B models on multiple tasks through domain-specific training data, demonstrating that domain data matters more than model scale.
A significant gap exists between open-source and commercial models on captioning tasks (approximately 1-point difference in average DTC scores).

Highlights & Insights¶

The first long-sequence remote sensing VL benchmark spanning pixel-level to scene-level tasks, filling the gap in multi-temporal analysis.
The dual LoRA routing design elegantly integrates QA and segmentation capabilities within a single model without mutual interference.
The non-monotonic scaling phenomenon reveals a profound insight: improving general capability and domain-specific precise analysis capability requires fundamentally different strategies.
The semi-automatic annotation pipeline (domain experts + GPT-4.1) achieves a good balance between quality and efficiency.

Limitations & Future Work¶

NAIP imagery contains near-infrared band information that current MLLMs cannot effectively exploit.
DVLChat does not yet leverage pixel-level segmentation data to enhance cross-task numerical quantification capabilities.
DVLChat still lags behind commercial models in overall performance, requiring dedicated algorithms and larger-scale parameters.
Coverage is limited to U.S. cities, lacking global geographic diversity.

Compared to existing multi-temporal remote sensing VL datasets such as TEOChat and EarthDial, DVL-Suite spans significantly longer temporal sequences (average 6.94 frames vs. 2.07 frames) at higher resolution (1024 vs. 224–512 pixels).
The dual LoRA routing mechanism is generalizable to other multi-task scenarios requiring the integration of understanding and segmentation.
The non-monotonic scaling phenomenon has implications for scaling law research.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic long-sequence remote sensing VL benchmark with a comprehensive task taxonomy.
Technical Depth: ⭐⭐⭐ — The DVLChat architecture is straightforward but practical; the core contributions lie in the data and benchmark design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Evaluates 18 models with thorough multi-dimensional analysis.
Value: ⭐⭐⭐⭐ — Directly applicable to urban planning, disaster assessment, and related domains.