ICCV2025 Autonomous Driving End-to-end autonomous driving Sim2Real transfer multimodal large language models simulation data augmentation hard-case scenarios

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case¶

Conference: ICCV2025 arXiv: 2508.04642 Code: Project Page Area: Autonomous Driving / Sim2Real Keywords: End-to-end autonomous driving, Sim2Real transfer, multimodal large language models, simulation data augmentation, hard-case scenarios

TL;DR¶

This paper proposes RoboTron-Sim, a framework that constructs a hard-case simulation dataset (HASS), introduces Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego encoder (I2E), enabling MLLMs to effectively leverage simulated hard cases to improve real-world driving performance. On nuScenes hard scenarios, it achieves ~48% reduction in L2 distance and ~46% reduction in collision rate, establishing state-of-the-art open-loop planning performance.

Background & Motivation¶

State of the Field — Data Scarcity Bottleneck: End-to-end autonomous driving systems are highly data-driven, yet real-world data for high-risk, long-tail scenarios (e.g., nighttime driving, heavy rain, pedestrian jaywalking) is extremely scarce. In nuScenes, the day-to-night ratio is approximately 7:1 and the straight-to-turn ratio approximately 8:1.

Limitations of Prior Work — Sim2Real Gap: Conventional approaches (e.g., VAD) that directly mix simulated and real data yield marginal gains — only ~1% improvement in L2 distance. The root cause is the inherent domain gap between simulated and real inputs (visual style, sensor parameters, coordinate systems, etc.), which impedes cross-domain knowledge transfer.

New Opportunities and Challenges with MLLMs: Multimodal large language models possess strong reasoning and generalization capabilities, showing early promise for cross-domain fusion (LLaVA-OneVision outperforms VAD in Sim2Real settings). However, geometric misalignment between simulated and real data continues to constrain performance.

Root Cause — Core Research Question: How can MLLMs effectively leverage simulation data to improve real-world autonomous driving? This work represents the first in-depth investigation of Sim2Real transfer limitations for MLLMs in autonomous driving.

Method¶

Overall Architecture¶

RoboTron-Sim comprises two main components: (1) a data layer — construction of the hard-case simulation dataset HASS; and (2) a model layer — an MLLM-based driving framework equipped with SPE and an I2E Encoder to bridge the Sim2Real gap.

1. HASS Dataset Construction¶

Scenario Categorization Strategy¶

Common scenarios are divided into Easy-to-Drive (E2D, e.g., daytime straight driving) and Hard-to-Drive (H2D, e.g., nighttime, fog, heavy rain)
Long-tail scenarios: extremely rare but high-risk events, covering 13 categories of edge cases (pedestrian jaywalking, sudden lane changes, wrong-way intrusion, road construction, etc.)
H2D and long-tail scenarios are the primary targets for supplementary data generation

Data Generation¶

Built on the CARLA simulator, using Think2Drive (a world-model-driven RL architecture) as the core data generation engine
Sensor configuration aligned with nuScenes: six 900×1600 cameras providing 360° coverage
Total of 47,553 simulated samples

Data Balancing¶

Day/Night: 58.65% / 41.35% (real data: 87.97% / 12.03%)
Clear/Rain: 48.38% / 51.61% (real data: 80.16% / 19.84%)
Straight/Turn: 46.42% / 53.58% (real data: 88.86% / 11.14%)

Coordinate Alignment¶

CARLA's left-handed coordinate system is converted to nuScenes' right-handed coordinate system
The coordinate origin is unified to the vehicle roof center

2. Scenario-aware Prompt Engineering (SPE)¶

A structured environmental description is prepended to the input sequence: "You are driving in [City Name] under [Simulation/Real-World] scenario."

Domain awareness: explicitly informs the model of the data source (simulation/real-world), making it aware of differences such as sensor noise
Geographic conditioning: embeds city-name priors (e.g., traffic rules, left/right-hand driving conventions), activating commonsense knowledge embedded in the LLM to adaptively adjust driving strategies

3. Image-to-Ego Encoder (I2E Encoder)¶

Design Motivation: Camera intrinsic and extrinsic parameters differ between simulated and real-world settings, creating a critical cross-domain geometric gap
Mechanism: Camera intrinsics and extrinsics are used to compute the image-to-ego transformation matrix, which is mapped to an embedding space via a two-layer MLP to capture the spatial context of each viewpoint
Integration: The encoded output is concatenated with text tokens, enabling the model to directly incorporate spatial reasoning into the decision-making process

4. MLLM Baseline Architecture¶

Visual feature extractor → two-layer MLP projector → LLM decoder (based on LLaVA-OneVision)
Input: 6 cameras × 5 consecutive frames + high-level instructions (e.g., "turn left at the next intersection")
Output: future trajectory waypoints + predicted vehicle speed
Velocity supervision is introduced to enhance ego-state awareness

Key Experimental Results¶

Main Results — Open-Loop Planning on nuScenes (Tab. 3)¶

Setting	Method	L2 (m) ↓	Collision (%) ↓	Out-of-bound (%) ↓
w/o ego pose	OmniDrive	0.84	0.94	4.29
w/o ego pose	RoboTron-Sim	0.56	0.58	3.02
w/ ego pose	EMMA	0.32	—	—
w/ ego pose	OmniDrive	0.33	0.30	3.00
w/ ego pose	RoboTron-Sim	0.23	0.26	2.62

Scenario-Specific Improvements (Tab. 4, L2 Distance)¶

Scenario	nuScenes Only	+HASS	Improvement
Night (H2D)	1.40	0.81	↓42.1%
Turn (H2D)	1.32	0.64	↓51.5%
Rain (H2D)	1.15	0.56	↓51.3%
Day (E2D)	0.59	0.54	↓8.5%

Ablation Study (Tab. 6)¶

SPE	I2E	L2 (m) ↓	Collision (%) ↓	Out-of-bound (%) ↓
✗	✗	0.91	0.94	3.22
✓	✗	0.86	0.79	2.68
✓	✓	0.56	0.58	3.02

Data Efficiency¶

Using only 20% real data + HASS matches the performance of 100% real data
Training on purely simulated data (0% real) still yields a reasonable L2 of 1.24 m

HASS vs. GASS Comparison (Tab. 10)¶

GASS (synthesized following nuScenes distribution): H2D L2 = 1.07 m
HASS (hard-case targeted synthesis): H2D L2 = 0.67 m (↓37.4%), collision rate reduced from 1.74% to 0.96%

Cross-Benchmark Generalization (NAVSIM)¶

RoboTron-Sim + HASS achieves PDMS = 85.6, establishing state of the art on NAVSIM

Deployment Efficiency¶

RoboTron-Sim-7B latency: 612.8 ms
RoboTron-Sim-0.5B latency: 141.4 ms, comparable to VAD (115.3 ms) with competitive performance

Highlights & Insights¶

First systematic study of Sim2Real for MLLMs: This work is the first to thoroughly investigate the limitations of and solutions for MLLM-based utilization of simulation data in autonomous driving, addressing an important gap in the literature.
Targeted hard-case synthesis: HASS does not synthesize data uniformly but specifically supplements H2D and long-tail scenarios. The GASS comparison experiment clearly demonstrates the value of the targeted strategy — H2D improvement jumps from ~22% to ~48%.
Elegance of SPE: Rather than employing complex domain adaptation networks, domain awareness is achieved with a single line of text prompt. This leverages commonsense knowledge already embedded in the LLM (e.g., traffic rules in different cities), representing a lightweight Sim2Real solution uniquely suited to the MLLM era.
I2E Encoder decouples sensor configuration: By explicitly injecting geometric transformation matrices, the model is freed from dependence on a specific sensor configuration, contributing an additional 34.9% reduction in L2 distance — the largest single performance gain.
Remarkable data efficiency: 20% real data + HASS ≈ 100% real data, which has significant practical implications for reducing costly real-world data collection.

Limitations & Future Work¶

Open-loop evaluation only: All experiments are conducted in the nuScenes open-loop setting; no closed-loop testing (e.g., CARLA Leaderboard) is performed. Open-loop metrics are known to have a significant gap with real driving performance.
Inherent limitations of CARLA: HASS relies on CARLA, whose visual fidelity remains limited. Performance may improve further with next-generation simulation engines (e.g., Unreal Engine 5).
Limited long-tail coverage: Only 13 categories of edge cases are included, whereas real-world long-tail distributions are far more complex. Automatically discovering and generating new hard scenarios remains an open problem.
Hard-coded SPE format: The prompt template is manually designed with a fixed format; learnable prompts or more flexible domain description strategies are not explored.
High inference latency: The 7B model incurs 612.8 ms latency, falling short of the real-time autonomous driving requirement (≤100 ms). Although the 0.5B variant approaches VAD's latency, it comes with a performance trade-off.
Single simulator source: Only CARLA is used; combinations of multiple simulators or neural rendering approaches for training data generation are not explored.

UniAD/VAD: End-to-end autonomous driving baselines with limited capacity for leveraging simulation data
EMMA (Google): A multimodal end-to-end autonomous driving model achieving L2 = 0.32 m with ego pose
OmniDrive: A full-stack framework leveraging LLMs for 3D perception, reasoning, and planning
LLaVA-OneVision: The base model underlying the MLLM backbone in this work
Think2Drive: A world-model-based RL driving agent used for HASS data collection
Senna/DriveVLM: Autonomous driving systems combining MLLMs with end-to-end models
Key Insight: Prompt engineering for MLLMs can serve as a lightweight domain adaptation strategy; explicit injection of geometric information is highly effective in cross-domain settings

Rating¶

Novelty: ⭐⭐⭐⭐ (Sim2Real from the MLLM perspective is a novel angle; SPE and I2E designs are conceptually clear)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (comprehensive ablations, multi-benchmark validation, data efficiency analysis, deployment cost, VQA generalization)
Writing Quality: ⭐⭐⭐⭐ (clear structure, well-motivated, though the abundance of tables somewhat affects readability)
Value: ⭐⭐⭐⭐ (significant reference value for the Sim2Real + MLLM direction; practical utility pending closed-loop validation)