DynamicVL: Benchmarking MLLMs for Dynamic City Understanding¶
Conference: NeurIPS 2025
arXiv: 2505.21076
Code: GitHub
Area: Multimodal VLM
Keywords: Remote sensing imagery, urban dynamic understanding, multi-temporal analysis, vision-language benchmark, change detection
TL;DR¶
This paper proposes DVL-Suite, a framework comprising the DVL-Bench evaluation benchmark and the DVL-Instruct instruction-tuning dataset, covering 42 U.S. cities and 14,871 high-resolution multi-temporal remote sensing images. It systematically evaluates 18 MLLMs on long-term urban dynamic understanding and introduces DVLChat as a baseline model.
Background & Motivation¶
Remote sensing enables urban monitoring via satellite imagery, yet most existing studies are limited to bi-temporal comparisons and lack vision-language datasets spanning longer temporal ranges. Although MLLMs perform well on general visual understanding tasks, two key bottlenecks remain in multi-temporal remote sensing analysis: (1) the absence of temporally aligned visual-language datasets covering long time series, and (2) existing multi-temporal remote sensing MLLMs being tested only on high-level semantic understanding without pixel-level precise quantitative analysis capabilities.
Existing datasets (e.g., CDVQA, TEOChatlas, EarthDial) are either limited to bi-temporal settings, task-restricted, or low-resolution (224–512 pixels). DVL-Suite addresses these gaps by providing 1024×1024 high-resolution imagery, averaging 6.73–6.94 temporal frames per scene (2005–2023), and covering six task categories spanning pixel-level to scene-level analysis.
Method¶
Overall Architecture¶
DVL-Suite consists of two components:
- DVL-Bench: An evaluation benchmark containing 3,469 multi-temporal image sequences with 1,391 referring segmentation instructions, 5,854 QA pairs, and 1,437 comprehensive descriptions.
- DVL-Instruct: An instruction-tuning dataset with 63,771 text pairs and 11,402 multi-temporal images for training DVLChat.
Data are sourced from NAIP (National Agriculture Imagery Program) with a GSD of 1.0 m, covering 42 major U.S. cities.
Key Designs¶
Six-Task Taxonomy¶
The paper defines a hierarchical task framework covering urban dynamic understanding from fine-grained to global levels:
- BCA (Basic Change Analysis): Identifies and compares multi-temporal land-use changes, covering 20 change events across 5 land cover types (vegetation, non-vegetation, water, buildings, playgrounds).
- CSE (Change Speed Estimation): Tracks and quantifies temporal trends of urban elements (e.g., building expansion rate, vegetation loss).
- EA (Environmental Assessment): Evaluates urban livability and economic indicators through visual analysis.
- RCD (Referring Change Detection): Performs dense reasoning and precise spatial localization of changed regions, requiring pixel-level segmentation.
- RCC (Region Change Captioning): Generates detailed change descriptions for user-specified geographic regions.
- DTC (Dense Temporal Captioning): Generates comprehensive reports documenting long-term temporal changes.
Data Annotation Pipeline¶
A semi-automatic annotation workflow is adopted:
- Urban experts perform base annotation (semantic change region segmentation, key frame identification).
- GPT-4.1 integrates expert annotations to generate diverse instructions.
- Three-round quality control: self-checking, cross-checking, and supervised review.
- BCA/CSE: Correct answers computed from segmentation masks; distractors generated at ±20% and ±40%.
- RCD: Domain experts design event-specific prompts with manual mask annotation.
- DTC/RCC: Annotators identify key frames → write stage-wise descriptions → GPT-4.1 refinement.
DVLChat Model Design¶
Built upon the LISA architecture with two key modifications:
- Dual LoRA Routing Mechanism: Prefix tokens route requests —
[QA]activates the VQA LoRA and[SE]activates the change detection LoRA, preventing inter-task interference. - Multi-temporal Image Interleaving: Image features from multiple temporal frames are interleaved prior to decoding to enable cross-temporal analysis.
- Segmentation Capability:
<SEG>token embeddings are decoded through SAM's frozen visual backbone and unfrozen decoder to generate precise segmentation masks.
The underlying MLLM is Qwen2.5-VL, though the architecture is MLLM-agnostic.
Loss & Training¶
- Two independent LoRA modules are trained separately for VQA and segmentation tasks.
- The QA branch uses instruction–ground-truth pairs from DVL-Instruct.
- The segmentation branch uses mask annotations from the RCD task.
- Training is conducted on 8 H100 GPUs.
Key Experimental Results¶
Main Results¶
Table 1: QA Task Results (Accuracy %)
| Model | AVG | BCA-Single | BCA-Multi | CSE-Single | CSE-Multi | EA |
|---|---|---|---|---|---|---|
| o4-mini | 34.1 | 62.8 | 36.1 | 33.8 | 12.4 | 25.3 |
| GPT-4.1 | 32.5 | 66.1 | 39.7 | 31.3 | 5.4 | 20.2 |
| Qwen2.5-VL 32B | 31.4 | 62.0 | 33.3 | 36.9 | 3.2 | 21.6 |
| DVLChat 7B | 33.3 | 64.9 | 21.3 | 31.3 | 18.6 | 30.6 |
| TEOChat | 17.2 | 35.1 | 8.7 | 17.0 | 10.8 | 14.6 |
Table 2: Captioning Task Results (0–5 scale)
| Model | RCC-AVG | DTC-AVG |
|---|---|---|
| o4-mini | 4.58 | 4.14 |
| GPT-4.1 | 4.46 | 3.98 |
| DVLChat 7B | 3.98 | 3.40 |
| InternVL3 78B | 3.92 | 3.33 |
| TEOChat | 1.66 | 1.45 |
Ablation Study¶
- Referring Change Detection: The specialized model ChangeMamba achieves 32.41% IoU; DVLChat achieves 29.06% (a gap of only 3.35%), outperforming LISA (13.85%) and PSALM (26.93%).
- Non-monotonic Model Scaling: The Qwen2.5-VL series peaks at 31.4% with 32B parameters and drops to 29.7% at 72B; InternVL3 similarly declines after peaking at 14B — indicating that simply increasing parameter count is insufficient to improve precise change detection.
Key Findings¶
- The strongest commercial model, o4-mini, achieves only 34.1% overall QA accuracy, exposing severe deficiencies in long-sequence understanding and quantitative analysis among current MLLMs.
- CSE multi-choice accuracy peaks at only 13.6%, and the Change Rate Precision (CRP) consistently remains below 1.21, indicating models cannot capture fine-grained temporal changes.
- DVLChat at 7B surpasses general-purpose 72B–78B models on multiple tasks through domain-specific training data, demonstrating that domain data matters more than model scale.
- A significant gap exists between open-source and commercial models on captioning tasks (approximately 1-point difference in average DTC scores).
Highlights & Insights¶
- The first long-sequence remote sensing VL benchmark spanning pixel-level to scene-level tasks, filling the gap in multi-temporal analysis.
- The dual LoRA routing design elegantly integrates QA and segmentation capabilities within a single model without mutual interference.
- The non-monotonic scaling phenomenon reveals a profound insight: improving general capability and domain-specific precise analysis capability requires fundamentally different strategies.
- The semi-automatic annotation pipeline (domain experts + GPT-4.1) achieves a good balance between quality and efficiency.
Limitations & Future Work¶
- NAIP imagery contains near-infrared band information that current MLLMs cannot effectively exploit.
- DVLChat does not yet leverage pixel-level segmentation data to enhance cross-task numerical quantification capabilities.
- DVLChat still lags behind commercial models in overall performance, requiring dedicated algorithms and larger-scale parameters.
- Coverage is limited to U.S. cities, lacking global geographic diversity.
Related Work & Insights¶
- Compared to existing multi-temporal remote sensing VL datasets such as TEOChat and EarthDial, DVL-Suite spans significantly longer temporal sequences (average 6.94 frames vs. 2.07 frames) at higher resolution (1024 vs. 224–512 pixels).
- The dual LoRA routing mechanism is generalizable to other multi-task scenarios requiring the integration of understanding and segmentation.
- The non-monotonic scaling phenomenon has implications for scaling law research.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First systematic long-sequence remote sensing VL benchmark with a comprehensive task taxonomy.
- Technical Depth: ⭐⭐⭐ — The DVLChat architecture is straightforward but practical; the core contributions lie in the data and benchmark design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Evaluates 18 models with thorough multi-dimensional analysis.
- Value: ⭐⭐⭐⭐ — Directly applicable to urban planning, disaster assessment, and related domains.