USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=ETzBStUFJy
Code: https://github.com/usail-hkust/UAgentEnv
Area: LLM Reasoning
Keywords: Spatiotemporal Reasoning, Urban Agents, Process-level Evaluation, Reflective Reasoning, Benchmark
TL;DR¶
USTBench decomposes the spatiotemporal reasoning capabilities of LLMs acting as urban agents into four process dimensions: Understanding—Prediction—Planning—Reflection. Within the interactive urban environment UAgentEnv, the authors construct 62,466 structured QAs and 9 real-world urban downstream tasks. Evaluating 14 mainstream LLMs reveals that while they perform well in understanding and prediction, they generally struggle with long-range planning and reflection. Furthermore, models specifically trained for reasoning (e.g., DeepSeek-R1) do not consistently outperform standard models in urban tasks.
Background & Motivation¶
Background: Urban systems (transportation, population flow, planning) are inherently spatiotemporally intertwined and dynamic. While traditional data-driven methods have progressed in prediction and decision-making, they suffer from poor generalization to unseen scenarios and opaque reasoning processes. Recent work has begun utilizing LLMs as "urban agents"—leveraging their ability to integrate multi-source information, adapt across tasks, and provide interpretable natural language reasoning for tasks like traffic light control, congestion prediction, and route planning.
Limitations of Prior Work: Existing urban LLM evaluations (STBench, CityBench, CityGPT, UrbanPlanBench) focus almost exclusively on outcome-level metrics—such as prediction accuracy or traffic efficiency—ignoring the underlying reasoning process. This obscures critical reasoning flaws. For instance, the paper highlights that in congestion prediction, the reasoning model DeepSeek-R1 slightly underperforms compared to the standard Llama3.3. Only through process-level dissection is it revealed that DeepSeek-R1 is inherently weak in understanding and predicting temporal trends. Without fine-grained evaluation, such anomalies remain unexplained.
Key Challenge: Urban tasks require multi-step spatiotemporal reasoning, yet evaluations often provide only a final score. Moreover, urban environments are real-time and provide feedback (as traffic patterns shift). The capability for reflection—associating "previous action → observed consequence" into a causal chain to adjust subsequent reasoning—is vital for agents, yet existing benchmarks fail to evaluate this dimension.
Goal: To build a benchmark capable of "dissecting" the spatiotemporal reasoning process of LLMs, answering "at which step does reasoning succeed or fail," while maintaining standardized end-to-end task comparisons.
Key Insight: Explicitly decompose urban spatiotemporal reasoning into four processes within an agent-environment interaction loop: Understanding (interpreting spatial structure and temporal patterns) → Prediction (inferring future states) → Planning (selecting optimal long-term actions) → Reflection (using feedback for error correction and improvement). Each process is evaluated via individual QAs to locate weaknesses and study dependencies among them.
Core Idea: Utilize a dual-layer framework of "Process-level QA Diagnosis + End-to-end Task Evaluation," paired with an interactive environment, UAgentEnv, that generates realistic urban observations. This transforms the evaluation of urban LLM agents from "black-box scoring" into "step-by-step dissection."
Method¶
Overall Architecture¶
USTBench takes five categories of real urban data (geospatial OSM, traffic flow, socioeconomic GDP/population, human mobility trajectories, POI check-ins) as input and outputs dual-layer diagnostic results of LLM spatiotemporal reasoning capabilities. The system is connected by two components: the underlying UAgentEnv (providing unified interaction for 9 real-world tasks) and the USTBench evaluation protocol (process-level QA + end-to-end tasks).
The pipeline operates as follows: UAgentEnv encapsulates real urban data into "observations" per task (spatial structures verbalized as sparse adjacency matrices; temporal dynamics as discrete time series). LLM agents process these observations within a modular workflow of "Understanding → Prediction → Planning," producing actions or predictions and storing experiences in memory. The evaluation side operates simultaneously: partitioning the interaction process into 62,466 structured QAs for process-level diagnosis, while conducting end-to-end evaluation on 9 real tasks using domain metrics. Finally, 14 mainstream LLMs (paired as reasoning vs. non-reasoning, e.g., Qwen2.5-32B vs. QwQ-32B) are evaluated followed by "capability-wise ablation" to analyze dependencies.
graph TD
A["Five Types of Real Urban Data<br/>Geo/Traffic/Socio-Econ/Mobility/POI"] --> B["1. UAgentEnv Interaction Environment<br/>9 Tasks + Unified Agent Framework"]
B --> C["Observation Construction<br/>Spatial→Matrix / Temporal→Series"]
C --> D["2. 4D Process Decomposition<br/>Understand→Predict→Plan→Reflect Loop"]
D -->|62,466 Structured QAs| E["3. Dual-layer Evaluation<br/>Process-level Diagnosis + End-to-end Tasks"]
D -->|9 Real-world Tasks| E
E --> F["14 LLM Diagnosis +<br/>Capability Ablation"]
Key Designs¶
1. UAgentEnv: An Interactive Environment and Unified Agent Framework for Nine Urban Tasks To diagnose reasoning, a stable source of "real urban observations" is necessary, ensuring different tasks are comparable under a unified interface. UAgentEnv integrates public real-world data across five dimensions (OSM geography, historical traffic, 2000–2019 global GDP/population, NYC taxi trajectories, FourSquare check-ins), covering 4 prediction tasks (Next POI, Congestion Prediction, Socioeconomic Indicators, Traffic OD) and 5 decision-making tasks (Traffic Light Control, POI Siting, Route Planning, Road Planning, Urban Planning).
Crucially, it applies a unified agent framework to all tasks: each task provides a description, data schema, and domain knowledge, feeding real-time dynamics as context. The agent follows a modular workflow, retrieving history from memory. Upon receiving environmental feedback, it performs reflection to diagnose errors and writes useful experiences back to memory. This "Perception → Reasoning → Action → Reflection → Memory" loop allows process-level QAs to be naturally extracted from real interactions.
2. 4D Process Decomposition + Verifiable QA Generation The core design decomposes urban spatiotemporal reasoning into 4 processes evaluated via QA with accuracy scores (62,466 total; 40% basic understanding + 60% high-level reasoning). The dimensions are: ① Spatiotemporal Understanding (27,000 QAs), covering 3 spatial types (distance/adjacency/connectivity) and 5 temporal types (duration/extremes/sequence/periodicity/trend); ② Prediction (15,336 QAs), predicting state \(s_{i+1}\) based on \(o_i\) using real future values as ground truth; ③ Planning (15,000 QAs), selecting actions \(a_i\) to optimize long-term goals; ④ Reflection (8,130 QAs), assessing previous actions/predictions given current observations and feedback \(f_i\) to determine correctness.
The challenge lies in establishing ground truth for planning, as real cities rarely expose "optimal actions" due to noise. The authors use simulation-driven exhaustive search to enumerate future sequences within a planning horizon \(H\), selecting the action with the maximum expected cumulative discounted reward:
where \(R\) is progress toward the goal and \(\gamma\) balances immediate and future rewards. Observations for decision tasks are collected using a semi-random heuristic agent that explores to ensure diverse trajectories.
3. Dual-layer Evaluation + Ability Ablation Process-level QAs identify where reasoning fails, while end-to-end downstream evaluation measures the impact on real applications (e.g., MAPE for socioeconomic prediction, Accuracy for congestion).
Capability ablation reveals the dependency structure: for top-tier models like DeepSeek-R1, removing spatiotemporal understanding significantly increases prediction error and degrades planning (indicating heavy reliance on initial understanding). Removing prediction also hurts planning for strong models, but for mid-tier models (Qwen2.5-32B), bypassing prediction slightly improves planning as noisy predictions can be misleading. For weak models (Qwen2.5-7B), intermediate reasoning and reflection can even be detrimental, where unreliable intermediate results propagate errors.
Key Experimental Results¶
Main Results¶
14 LLMs were evaluated (7 non-reasoning + 7 reasoning pairs). In spatiotemporal understanding, reasoning models often exceed 80% accuracy, but drop below 70% in long-range spatial (connectivity) and temporal (trend/periodicity) analysis.
| Capability Dimension | Top Model | Overall Acc | Common Weakness |
|---|---|---|---|
| ST Understanding | o4-mini | 0.7924 | Connectivity, Trend (Trend < 0.30) |
| Prediction | o4-mini | 0.7872 | Long-range (Congestion, OD) |
| Planning | gpt-oss-20B | 0.4468 | Significantly lower than others |
| Reflection | DeepSeek-R1 | 0.5179 | Most models < 0.50 |
Note: Random baseline is ~0.25 for most sub-tasks (1-of-4) and ~0.11 for trends.
In end-to-end tasks, LLMs generally outperform classical methods, with up to a 337.31% increase in prediction accuracy and 53.48% in decision effectiveness.
| Task | Metric | Classic Method | Best LLM |
|---|---|---|---|
| Socio-Econ Pred | MAPE ↓ | 7.09% | 4.97% (o4-mini) |
| Congestion Pred | Acc. ↑ | 17.18% | 75.73% (o4-mini) |
| Urban Planning | Service ↑ | 0.6100 | 0.6858 (DeepSeek-R1) |
| Road Planning | Cost ↓ | 18.95 | 18.40 (QwQ-32B) |
Ablation Study¶
Effect of bypassing reasoning processes on downstream performance:
| Config | DeepSeek-R1 (Strong) | Qwen2.5-32B (Mid) | Qwen2.5-7B (Weak) |
|---|---|---|---|
| Full Pipeline | Baseline | Baseline | Baseline |
| w/o ST Understanding | Error↑, Plan↓ | Damaged | Damaged |
| w/o Prediction | Plan↓ (Used Pred) | Plan slightly↑ (Noise removed) | Intermediate Reasoning Harmful |
| w/o Reflection | Largest Drop | Moderate sensitivity | Reflection is a hindrance |
Key Findings¶
- Reasoning Post-training \(\neq\) Stronger Urban Performance: While models like QwQ and DeepSeek-R1 show 7–20% gains over non-reasoning versions, it is inconsistent. GPT-4o often matches or exceeds DeepSeek-R1-Distill-70B. In long-term trend prediction (Congestion, OD), non-reasoning bases (Qwen2.5, Llama3.3) sometimes outperform reasoning variants, suggesting reasoning gains in logic/math do not automatically transfer to urban spatiotemporal domains.
- Hierarchy of Capabilities: Mastery of long-range temporal understanding correlates with better long-term prediction. Post-training Qwen2.5-7B specifically on spatiotemporal understanding (creating Qwen2.5-7B-ST) allowed it to surpass both its base and the DeepSeek-R1-Distill-Qwen-7B variant, validating the "Understanding → Prediction/Planning" support structure.
- Reflection is the Bottleneck: Most models score <50% in reflection. DeepSeek-R1 exhibits stronger reflection (fewer errors, higher correction rate) but still struggles with dynamic feedback integration. Non-reasoning models tend to be "overconfidently wrong," while reasoning models occasionally show "internal inconsistency."
Highlights & Insights¶
- Verifiable Process-level Ground Truth: By using simulation for planning, real future values for prediction, and environment feedback for reflection, the authors quantify "intermediate reasoning correctness" rather than relying on subjective judgment.
- Paired Experimental Design: Using same-architecture pairs (e.g., Llama3.3 vs. DeepSeek-R1-70B) isolates "reasoning post-training" as the variable, supporting the counter-intuitive finding that reasoning models aren't always superior in urban contexts.
- Directional Insights of Ablation: The discovery that strong models benefit from intermediate reasoning while weak models are hindered by it provides direct guidance for deployment: weak models may perform better with end-to-end output rather than forced Chain-of-Thought (CoT).
Limitations & Future Work¶
- The paper focuses on evaluation rather than enhancement. Although spatiotemporal post-training was briefly validated (Qwen2.5-7B-ST), a systematic enhancement method is not provided.
- Decision tasks rely on simulated environments; real-world validation is missing. Dimensions like social reasoning and multi-agent interaction are not yet covered.
- The independence assumption of the four stages is questionable in complex tasks where boundaries might overlap.
Related Work & Insights¶
- Comparison with Outcome-only Benchmarks (STBench, CityGPT): USTBench is the first to combine process-level and end-to-end diagnosis while explicitly incorporating reflection and reasoning model baselines.
- Support for Agent Frameworks (PreAct, ReflAct): While those works propose new reasoning mechanisms, USTBench provides the diagnostic platform to dissect exactly where those mechanisms succeed or fail.
- Guidance for Urban LLMs (UrbanGPT, LLMLight): By proving that general reasoning training is inconsistent in the urban domain, USTBench highlights the necessity for domain-specific adaptation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First benchmark to dissect urban spatiotemporal reasoning into four verifiable process dimensions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage with 14 LLMs, 60k+ QAs, 9 tasks, and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and convincing examples; some technical details are deferred to appendices.
- Value: ⭐⭐⭐⭐⭐ Provides actionable insights on reasoning model performance and the hierarchy of urban agent capabilities.