GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models¶

Conference: ICLR 2026 arXiv: 2510.07791 Code: GitHub Area: Spatial-Temporal Intelligence / Vision-Language Model Evaluation Keywords: Geo-temporal reasoning, vision-language models, multi-camera networks, benchmark, spatial-temporal intelligence

TL;DR¶

This paper introduces GTR-Bench, a novel benchmark for geo-temporal reasoning of moving targets in large-scale camera networks. Evaluation reveals that the strongest model, Gemini-2.5-Pro (34.9%), falls far short of human performance (78.61%), exposing three critical deficiencies in current VLMs: imbalanced utilization of spatial-temporal context, weak temporal prediction capability, and insufficient map-video alignment.

Background & Motivation¶

Spatial-temporal intelligence as a core capability: Spatial intelligence underlies human interaction with the physical world. Its extension—spatial-temporal intelligence—is critical for autonomous driving, embodied AI, and related domains, encompassing spatial attributes (size, distance), temporal attributes (time intervals, speed), and reasoning over dynamic events.

Limitations of existing benchmarks: Current geographic reasoning benchmarks (e.g., ReasonMap) focus solely on static geometric tasks and graphical contexts (e.g., subway maps), while spatial-temporal reasoning benchmarks (e.g., VSI-Bench, STI-Bench) primarily adopt egocentric perspectives from single or few cameras, relying on image/video contexts.

Absence of geo-level spatial-temporal reasoning evaluation: No existing benchmark evaluates VLMs' ability to perform geo-temporal reasoning by jointly utilizing graphical context (maps) and multi-view video observations within large-scale camera networks.

Pressing practical demand: Real-world applications such as traffic management and emergency response require comprehensive spatial-temporal analysis across multiple camera views, including vehicle/pedestrian trajectory reasoning and traffic flow prediction.

Uniqueness of the new challenge: Geo-temporal reasoning (GTR) demands repeated perspective switching between maps and videos, joint reasoning across multiple non-overlapping fields of view, and inference over spatial-temporal regions unobserved in any video.

Cognitive science perspective: Traditional spatial-temporal intelligence covers only first-person (egocentric) and third-person (allocentric) perspectives, whereas the geographic perspective can provide VLMs with omniscient understanding of dynamic objects.

Method¶

Overall Architecture¶

GTR-Bench is a hierarchical geo-temporal reasoning benchmark comprising 3 basic reasoning tasks and 4 combinatorial reasoning tasks, with a total of 420 questions and 364 video clips. The benchmark covers two real-world scenarios: outdoor (CityFlow dataset, vehicles) and indoor (MTMMC dataset, pedestrians), each contributing 210 questions.

Basic Tasks: - Geo-Location (GL): Given start and end locations, infer the intermediate location (camera) that the target passes through. - Arrival Time-Interval (ATI): Given start/end points and an intermediate location, infer the time interval at which the target arrives at the intermediate location. - Motion-State (MS): Given start/end points and an intermediate location, infer the target's motion state (direction, speed, distance) at the intermediate location.

Combinatorial Tasks: - Causal Reordering (CR): Given unordered video clips and a map, determine the correct temporal order in which the target passes through cameras. - Next Spot Forecasting (NSF): Given the last observation and a map, predict the next camera location and the arrival time interval. - Trajectory Forecasting (TF): Based on multiple historical observations, predict the complete future trajectory (camera sequence and time intervals). - Multi-Target Trajectory Forecasting (MTTF): Predict the future meeting point (location and time) of two distinct targets.

Key Designs¶

Evaluation Metric Design¶

Function: Design appropriate evaluation metrics for different tasks.
Design Motivation: Basic tasks and CR are standard multiple-choice questions, but forecasting tasks require simultaneous assessment of spatial correctness and temporal precision.
Mechanism:
- Basic tasks + CR: Standard MCQ accuracy.
- Forecasting tasks (NSF/TF/MTTF): Propose ST-IoU (Spatial-Temporal IoU), integrating spatial accuracy (whether the Camera ID is correct) and temporal IoU (intersection-over-union of predicted and ground-truth time intervals): $$\text{ST-IoU} = \frac{1}{N}\sum_{i=1}^{N}\mathbb{I}(C_{p_i}=C_{gt_i}) \times \frac{|T_{p_i} \cap T_{gt_i}|}{|T_{p_i} \cup T_{gt_i}|}$$

Spatial-Temporal Complexity Grading¶

Function: Categorize tasks into Long/Medium/Short levels according to geo-temporal complexity.
Design Motivation: Ensure evaluation covers different spatial and temporal scales, prioritizing dynamic cues over static background.
Mechanism: Levels are determined by physical thresholds on trajectory length ($track_d$) and duration ($track_t$), with distinct thresholds for indoor and outdoor settings (shorter time but longer distance outdoors due to driving scenarios), ensuring balanced distribution across the three complexity levels.

Benchmark Construction Pipeline¶

Function: Automatically convert raw video data into standardized questions.
Design Motivation: Accommodate the temporal, geographic, and formatting requirements of different tasks.
Mechanism:
Data Preprocessing: Video segmentation → camera calibration (homography matrix) → trajectory projection onto map → motion parameter computation (speed, direction) → data cleaning and validation → LLM-generated motion descriptions.
Task Construction: Trajectory sampling → information integration (map + video + template) → question-answer formation → distractor generation (sampling from different building areas, algorithmically synthesizing spurious cameras, randomizing IDs).
Quality Control: Two-stage human screening—Stage 1 ensures question diversity and removes instances with large trajectory errors; Stage 2 involves expert verification of answers and coverage of reasonable difficulty levels.

Loss & Training¶

This paper presents a benchmark study and does not involve model training. Evaluation settings are as follows: - Videos are uniformly sampled, with total frames across multiple videos capped at 20. - Temperature = 0.1, max_new_token = 16384. - Open-source models are deployed via LMDeploy on 8 NVIDIA V100 GPUs. - Traditional ReID methods are included as comparison baselines.

Key Experimental Results¶

Main Results¶

Model	Type	Rank	GL(Out/In)	ATI(Out/In)	MS(Out/In)	CR(Out/In)	NSF(Out/In)	TF(Out/In)	MTTF(Out/In)	Avg.
Gemini-2.5-Pro	PM	1	60.0/63.3	46.7/13.3	33.3/26.7	56.7/70.0	19.1/25.1	13.2/28.1	19.2/14.4	34.93
GPT-5	PM	2	53.3/60.0	76.7/30.0	40.0/43.3	40.0/86.2	12.0/11.3	12.1/2.6	7.3/1.8	34.05
Claude-4-Sonnet	PM	3	73.3/66.7	50.0/33.3	50.0/43.3	63.3/58.6	8.1/2.6	6.2/4.0	16.9/0.0	34.03
InternVL3-38B	OM	5	40.0/50.0	73.3/56.7	30.0/26.7	53.3/37.9	8.3/11.1	8.2/4.4	20.6/10.2	30.76
Qwen2.5-VL-32B	OM	6	43.3/33.3	60.0/56.7	33.3/43.3	66.7/70.0	0.7/3.3	0.0/0.0	15.7/0.0	30.45
Human	-	-	90.0/98.2	84.3/90.8	90.9/89.5	89.8/97.4	68.3/74.6	51.2/57.4	55.8/62.5	78.61

Ablation Study¶

Spatial Reasoning vs. Spatial-Temporal Reasoning (MCQ Acc vs. ST-IoU):

Model	NSF-MCQ/ST-IoU(Out)	TF-MCQ/ST-IoU(Out)	MTTF-MCQ/ST-IoU(Out)	NSF-MCQ/ST-IoU(In)
GPT-4o	53.3/20.5	41.7/0.0	76.7/23.1	30.0/13.0
Gemini-2.5-Pro	38.5/19.1	45.5/13.2	51.7/19.2	43.3/25.1
GPT-5	73.3/12.0	58.3/12.1	83.3/7.3	50.0/11.3
GLM-4.1V-9B	40.0/10.3	30.0/0.0	76.7/25.4	10.3/2.9

MCQ accuracy is generally much higher than ST-IoU, indicating that models can roughly localize spatial positions but fail to handle temporal constraints. GPT-5 achieves 83.3% MCQ accuracy on MTTF but only 7.3% ST-IoU, a gap of 76 percentage points.

Key Findings¶

Large human-machine gap: The strongest model, Gemini-2.5-Pro (34.93%), lags behind humans (78.61%) by 43.68 percentage points; open-source models average only 23.82%.
Sharp performance drop from basic to combinatorial tasks: Models perform reasonably on basic tasks (GL and ATI reaching 60–76%), but ST-IoU on combinatorial forecasting tasks (NSF/TF/MTTF) is generally below 30%, with many open-source models approaching 0.
Outdoor vs. indoor discrepancy: Most models perform better outdoors (clearer spatial cues, more regular motion patterns), but Gemini-2.5-Pro anomalously excels indoors, possibly because advanced models leverage reasoning more effectively in complex scenes.
Imbalanced spatial-temporal context utilization: Top-tier models (e.g., Gemini-2.5-Pro) utilize spatial, temporal, and motion-state context in a balanced manner, whereas open-source models (e.g., InternVL3-38B) exhibit notably weaker temporal reasoning.
Temporal prediction as the bottleneck: All models demonstrate substantially stronger spatial localization than temporal prediction, with a large gap between MCQ Acc and ST-IoU (e.g., 76 percentage points for GPT-5).

Highlights & Insights¶

Original task formulation: This work is the first to extend spatial-temporal reasoning to geo-level large-scale camera networks, introducing joint reasoning over maps and multi-view videos—more representative of real-world applications than conventional single-video egocentric reasoning.
Elegant ST-IoU metric design: The multiplicative combination of spatial accuracy and temporal IoU enables a single metric to assess joint spatial-temporal prediction quality.
Hierarchical task structure: The basic-to-combinatorial progression precisely pinpoints capability bottlenecks in models.
In-depth three-deficiency analysis: Beyond reporting performance numbers, the paper reveals fundamental shortcomings in current VLMs' spatial-temporal intelligence through context utilization analysis, MCQ vs. ST-IoU comparisons, and failure case studies.
Inclusion of ReID baselines: Traditional Re-ID methods (45.72%) outperform most VLMs on forecasting tasks, indicating that current VLMs still fall short in leveraging visual feature matching.

Limitations & Future Work¶

Limited data scale: Although carefully constructed, 420 questions may be insufficient to comprehensively assess model performance across more diverse scenarios.
Video sampling constraints: Capping total frames at 20 may discard important temporal information in videos, disadvantaging models that rely on dense frames.
Only two scenario types: Coverage is limited to outdoor vehicles and indoor pedestrians, excluding other settings (e.g., UAV perspectives, maritime scenarios).
Lack of improvement proposals: The paper identifies problems but does not propose targeted solutions or model improvement directions (e.g., fine-tuning, prompt engineering).
Simplified map representation: Maps are presented in simplified form, without involving more complex real-world map data (e.g., high-definition maps, 3D building models).
Scalability: Future work could extend to scenarios with more cameras (>31), longer time spans, and more diverse target types.

ReasonMap / SpatialLLM: Static geographic reasoning benchmarks handling only graphical contexts—inspiring GTR to incorporate dynamic targets into geographic reasoning.
STI-Bench / VSI-Bench / ST-VLM: Egocentric spatial-temporal reasoning benchmarks—demonstrating the necessity of extending from single-view to multi-camera network reasoning.
CityFlow / MTMMC: Multi-camera tracking datasets—GTR-Bench repurposes their real-world trajectory data to construct higher-level reasoning tasks.
Implications for future research: Promising directions include (1) designing dedicated temporal reasoning modules for VLMs, (2) developing map-video alignment pre-training strategies, and (3) leveraging graph structures to model camera network topology.

Rating¶

Dimension	Score	Notes
Novelty	⭐⭐⭐⭐	First to define the geo-temporal reasoning (GTR) task, extending VLM evaluation to multi-camera networks with a novel problem formulation.
Experimental Thoroughness	⭐⭐⭐⭐	Evaluates 13 mainstream VLMs plus human and ReID baselines, with rich analysis dimensions (context utilization, spatial-temporal comparison, failure cases).
Writing Quality	⭐⭐⭐⭐	Clear structure, well-defined tasks, rich tables and figures, though some analyses could be more in-depth.
Value	⭐⭐⭐⭐	Reveals critical bottlenecks in VLMs' spatial-temporal intelligence, with important reference value for autonomous driving and intelligent surveillance.