UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos¶

Conference: ICLR 2026 arXiv: 2510.15018 Code: urbanverseproject.github.io Area: Robotics / Simulation Keywords: urban simulation, real-to-sim, embodied AI, 3D asset library, navigation policy

TL;DR¶

UrbanVerse is a data-driven real-to-sim system that converts crowdsourced city-tour videos into physically-aware, interactive simulation environments. It comprises a 100K+ annotated 3D asset library and an automated scene construction pipeline, generating 160 high-quality scenes in IsaacSim. A PPO navigation policy trained on these scenes achieves an 89.7% success rate in zero-shot real-world transfer, completing a 337 m long-range task with only 2 human interventions.

Background & Motivation¶

Embodied AI agents operating in urban environments—such as delivery robots and quadrupedal robots—are advancing rapidly. Training such agents requires large quantities of diverse, high-fidelity urban environments, yet existing simulation approaches face a fundamental tension:

Manually crafted scenes: e.g., CARLA provides only 15 scenes, are not scalable, and incur high labor costs.
Procedurally generated scenes: e.g., MetaUrban/UrbanSim rely on hard-coded rules, producing distributions that deviate from the real world (e.g., randomly placed scooters that do not reflect realistic parking patterns).
Passive real-world data: City-tour videos offer rich diversity but lack action labels and interactivity.
3D reconstruction methods: e.g., 3DGS can reconstruct scenes from video, but the resulting static textured meshes lack semantic and physical attributes.

Root Cause: a fundamental trade-off between scale and fidelity. Simply increasing quantity through procedural generation does not yield generalization—if scenes fail to faithfully reflect real-world distributions, sheer quantity provides little benefit.

Core Idea of UrbanVerse: extract scene semantics and layout from real city-tour videos, then instantiate them into physically interactive simulation scenes using high-quality 3D assets—the "digital cousin" paradigm. This approach combines the diversity of real-world data with the interactivity of simulation.

Method¶

Overall Architecture¶

UrbanVerse consists of two main pillars: 1. UrbanVerse-100K: A 100K+ urban 3D asset library annotated with physical attributes. 2. UrbanVerse-Gen: An automated video-to-simulation pipeline.

Input: YouTube city-tour videos → Output: physically interactive simulation scenes in IsaacSim.

Key Designs¶

UrbanVerse-100K Asset Library (Section 3.1): Starting from 800K noisy 3D assets in Objaverse, a three-stage semi-automated pipeline filters and annotates assets:
Asset Filtering: Ten annotators used a Three.js viewer over three weeks to label and filter assets, removing eight categories of quality issues (broken meshes, missing textures, planar geometry, abnormal scale, etc.), retaining 158K usable assets.
Urban Ontology Construction: Building on OpenStreetMap tag structures and extending categories from ADE20K, Cityscapes, and related datasets, a three-level urban semantic ontology with 667 leaf categories is established.
Attribute Annotation: GPT-4.1 annotates 33 attributes (semantics, affordances, physical properties such as mass and friction) per asset using thumbnails and four rotated snapshots; total API cost: $1,334.

The final library contains: 102,530 GLB objects + 288 PBR ground materials + 306 HDRI sky maps.

UrbanVerse-Gen Scene Construction Pipeline (Section 3.2): A unified 3D urban scene graph $\mathcal{V} = \langle\mathcal{O}, \mathcal{G}, \mathcal{S}\rangle$ (object / ground / sky nodes) is defined.

Three processing stages: - Scene Distillation: Semantic and 3D layout information is extracted from video. MASt3R estimates metric depth and camera poses; YoloWorld+SAM2 performs open-vocabulary object parsing; Mask2Former segments road surfaces and sidewalks. Cross-frame fusion yields persistent object nodes containing category, centroid, 3D bounding box, orientation, and appearance crops. - Asset Matching and Diversification: For each node, $k_{cousin}$ matching assets are retrieved from UrbanVerse-100K via three steps: CLIP semantic matching → bounding-box geometric filtering (mBBD) → DINOv2 appearance ranking. Ground materials are matched to PBR textures via pixel-level MSE; sky conditions are matched to HDRI maps via HSV histograms. - Simulation Scene Generation: Scenes are instantiated in UrbanSim (IsaacSim): ground plane fitting and material application → HDRI sky assignment → object placement (centroid alignment + collision detection + physical property assignment).

Digital Cousin Diversification Strategy: Each video-derived scene layout can spawn $k_{cousin}=5$ digital cousin variants with distinct appearances but identical layouts, achieved by substituting different matched assets. This intra-layout diversity complements inter-layout diversity, enhancing policy generalization.
PPO Navigation Policy Training:
Actor-Critic architecture with continuous action space.
Observations: RGB image (135×240) + goal relative position.
Three-layer CNN encoder [16, 32, 64] + three-layer MLP (128).
Reward design: arrival reward (+2000) + collision penalty (−200) + position tracking (coarse + fine) + velocity reward.
Training loads 16 scenes per batch, rotating to a new batch every 100 episodes.

Loss & Training¶

PPO optimization; learning rate 1e-4 (adaptive); $\gamma=0.99$; GAE $\tau=0.95$; PPO clip $\varepsilon=0.2$; KL threshold 0.01; entropy coefficient 0.002; 1,500 epochs; mixed-precision training on a single L40S GPU.

Key Experimental Results¶

Main Results¶

Scene Construction Fidelity (KITTI-360, 45 sequences, avg. 198.7 m):

SfM	Scene Parser	Category (%)	Asset (%)	Distance (m)	Orientation (°)	Volume (m³)	mAP25
MASt3R	YoWorldSAM2	93.1	75.1	1.4	19.8	0.8	28.2
VGGT	YoWorldSAM2	91.5	70.6	2.1	20.1	1.3	9.4

CraftBench Generalization (10 artist-designed scenes):

Method	SR (%)	CT	RC (%)
MBRA	35.6	25.6	52.9
S2E	33.1	27.7	55.7
PPO-UrbanSim	9.1	31.5	19.4
PPO-UrbanVerse	41.9	35.5	62.4

Zero-Shot Sim-to-Real (16 real urban scenes):

Method	Wheeled SR (%)	Quadruped SR (%)
NoMad	33.3	37.5
S2E	47.9	58.6
PPO-UrbanSim	18.8	18.8
PPO-UrbanVerse	77.1	89.7

Ablation Study¶

Configuration	Key Metric	Notes
1 layout → 32 layouts	SR: low → 41.9%	Scene quantity follows a scaling power law
1 cousin → 5 cousins	SR: low → higher	Intra-layout diversity also matters
UrbanVerse vs. PG scenes	Human score: 3.58 vs. 2.9/5	>70% of users prefer UrbanVerse
Pretrain + target-scene fine-tuning	SR: 0% → 80%	Real-to-sim-to-real loop is effective

Key Findings¶

Scaling power law holds: A power-law relationship exists between the number of scenes / digital cousins and policy performance, with high linear-fit $R^2$.
Real-world distribution is critical: An equivalent number of procedurally generated scenes yields almost no generalization improvement (flat PG curve).
PPO-UrbanVerse surpasses navigation foundation models: A simple PPO policy trained on UrbanVerse scenes outperforms large-scale pretrained visual navigation foundation models such as NoMad and CityWalker.
Zero-shot transfer is exceptionally strong: Quadruped Go2 achieves 89.7% success rate in real-world settings, surpassing S2E by +31.1%.
337 m long-range task: Completed on public streets with only 2 human interventions.
Human evaluation: Automatically generated UrbanVerse scenes are rated 3.58/5, compared to 4.08/5 for artist-crafted scenes—a small gap.

Highlights & Insights¶

End-to-end pipeline: A complete closed loop from video collection → asset library construction → scene generation → policy training → real-world deployment.
100K-scale asset library: Addresses quality and scale challenges in 3D asset acquisition, constituting an independent and significant contribution.
Discovery of scaling laws: Quantitative evidence for data scaling laws in embodied AI, supporting the claim that "more scenes = better policies."
Validation across two robot morphologies: The same policy generalizes to both wheeled and quadrupedal robots, suggesting that environment understanding rather than specific kinematics is being learned.
160 scenes across 24 countries: Cross-cultural and cross-geographic diversity is key to real-world generalization.
Real-to-sim-to-real closed loop: For known deployment environments, the workflow—capture a video → generate simulation → fine-tune policy → deploy—offers substantial practical value.

Limitations & Future Work¶

The digital cousin approach inevitably introduces a gap relative to the real scene, as asset substitution cannot perfectly replicate original objects.
Orientation error (19.8°) remains large and may impact precise navigation.
Only PPO is explored; more advanced reinforcement learning algorithms are not investigated.
Only navigation tasks are evaluated; manipulation tasks are not addressed.
Scene dynamics are absent—pedestrians, vehicles, and other moving obstacles are not modeled.
HDRI sky maps provide static lighting that cannot simulate time-of-day variation.
Data sourcing is constrained by the Creative Commons licensing of YouTube videos.

Digital Cousins (Dai et al., 2024): Multi-variant generation for indoor scenes; UrbanVerse extends this paradigm to large-scale outdoor urban environments.
MetaUrban / UrbanSim (Wu et al., 2025): The simulation platform underpinning UrbanVerse; this work addresses the limitations of procedural generation therein.
ViNT / NoMad (Shah et al., 2023; Sridhar et al., 2024): Visual navigation foundation models trained on passive data, lacking interactive learning.
S2E (He et al., 2025): Trains obstacle-avoidance policies in simulation and transfers them to the real world, but at a smaller scene scale and diversity.
Data Scaling Laws (Lin et al., 2025): Scaling laws for imitation learning data; UrbanVerse identifies analogous trends in the RL + simulation scene setting.
Insight: Crowdsourced video represents a nearly unlimited source of simulation material, a paradigm worth extending to other domains (e.g., indoor environments, factories).

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐