Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation¶
Conference: ICCV 2025 arXiv: 2506.17213 Code: orangesodahub.github.io/InfGen Area: Autonomous Driving Keywords: Traffic Simulation, Long-term Simulation, Autoregressive Generation, Scenario Generation, Next-Token Prediction
TL;DR¶
This paper proposes InfGen, a unified autoregressive next-token prediction model that interleaves closed-loop motion simulation with scenario generation (dynamic agent insertion and removal), achieving for the first time stable long-term (30-second) traffic simulation. InfGen reaches state-of-the-art performance on short-term benchmarks and significantly outperforms all existing methods on long-term tasks.
Background & Motivation¶
Problem Definition¶
Traffic simulation aims to provide realistic driving experiences for autonomous driving systems. An ideal simulator should deliver complete trip-level realistic traffic flow, encompassing realistic environments, ego-vehicle dynamics, and all non-ego traffic participants.
Limitations of Prior Work¶
The fundamental assumption of existing methods is that the agent set remains fixed throughout the simulation horizon, which entirely breaks down in long-term simulation:
Agent disappearance: As the ego vehicle moves into new regions, agents from the initial log progressively leave the field of view.
Scene depopulation: When the ego vehicle enters map regions not covered by the log, those regions contain no agents.
Unrealistic empty scenes: State-of-the-art models such as SMART produce scenes devoid of agents around the ego vehicle after 30 seconds of simulation (Figure 1).
Three categories of prior work exhibit the following limitations: - Closed-loop motion simulation (SMART, CatK): simulates only the motion of pre-existing agents without generating new ones. - Scenario generation (SceneGen, TrafficGen): generates only static initial scenes or short open-loop scenarios. - Adversarial scenario generation: focuses on near-collision scenarios and is not suited for general long-term simulation.
Root Cause¶
Long-term traffic simulation must simultaneously address two problems: (1) closed-loop motion simulation of existing agents; and (2) dynamic generation of new agents and removal of departing ones. InfGen unifies both tasks within an interleaved next-token prediction framework.
Method¶
Overall Architecture¶
InfGen formulates long-term traffic simulation as interleaved expansion over a "dynamic agent matrix": - Temporal axis expansion (motion simulation): predicts the next-step motion token for each active agent. - Spatial axis expansion (scenario generation): inserts new agent rows (pose tokens) or removes departing agent rows.
Formally: $\(p(\mathcal{A}'_{t+1:T'} | \mathcal{M}, \mathcal{A}_{0:t_0}) = \prod_{t=t_0}^{T'-1} p_{\text{scene}}(\mathcal{A}'_{t+1} | \mathcal{M}, \mathcal{A}_{t+1}) \times p_{\text{motion}}(\mathcal{A}_{t+1} | \mathcal{M}, \mathcal{A}'_{0:t})\)$
Key Designs¶
1. Unified Tokenization Scheme¶
- Function: Converts map, motion, pose, and mode-control information into discrete token sequences.
- Mechanism:
Four tokenizers are employed:
- Map Tokenizer: Segments road elements into fixed-length vectors encoding start/end points, direction, and road type.
- Motion Tokenizer: Encodes 0.5-second continuous trajectory segments into a discrete motion vocabulary \(\mathcal{V}_{\text{motion}}\) via k-disks clustering, using nearest-neighbor indices.
- Pose Tokenizer: Encodes the initial pose of new agents as a position token (grid index centered on the ego vehicle) and a heading token (360° uniform quantization).
- Mode Control Tokenizer: Four special tokens govern task switching:
- <BEGIN MOTION>: the next token is a motion token.
- <ADD AGENT>: the next token is a pose token, inserting a new agent.
- <KEEP AGENT>: the current agent is retained.
- <REMOVE AGENT>: the current agent will be removed.
- Design Motivation: Reduces the complex mixed-task simulation problem to a simple sequence prediction problem. The four control tokens enable the model to learn when to switch tasks and how to decide agent insertion and removal.
2. Interleaved Next-Token Prediction¶
- Function: Alternately executes motion simulation and scenario generation at each timestep.
- Mechanism:
Temporal motion simulation (blue stream): For each active agent \(i\), its motion token \(m_i^t\) serves as query \(q_{m_i^t}\) and passes through three attention layers: 1. Temporal Attention: self-attention over the agent's past \(t_w\) motion tokens. 2. Agent-Agent Attention: cross-attention over other active agents within range \(r^{a \leftrightarrow a}\) at the same timestep. 3. Map-Agent Attention: cross-attention over map tokens within range \(r^{m \leftrightarrow a}\).
A motion head and a control head each sample from their respective token distributions. The control token is restricted to <KEEP AGENT> or <REMOVE AGENT>.
Spatial scenario generation (green stream): Uses a learnable agent query \(a_0\) passed through three attention layers, where Grid Attention replaces Temporal Attention. Grid Attention attends to occupancy grid tokens (binary occupancy indicators constructed from position tokens): $\(q'_{a_0} = \text{MHCA}^g(q_{a_0}, \Gamma(\{k_{g_j}\}), \Gamma(\{v_{g_j}\}))\)$
The control token is restricted to <ADD AGENT> or <BEGIN MOTION>. Upon <ADD AGENT>, a new row is inserted and assigned a pose token; upon <BEGIN MOTION>, scenario generation concludes and the next timestep's motion simulation begins.
- Design Motivation:
- Interleaved execution naturally couples the two tasks, enabling the model to autonomously determine whether agents should be added or removed based on the current scene state.
- Occupancy grid encoding provides the scenario generation module with spatial awareness of the current scene, preventing redundant insertion at already-occupied locations.
- The autoregressive nature of next-token prediction allows the model to generalize from short logs during training to long-term simulation at inference (6× extension).
3. Occupancy Grid Encoder¶
- Function: Encodes the spatial distribution of agents in the current scene as occupancy grid features for use in scenario generation.
- Mechanism: Each position in the position token vocabulary \(\mathcal{V}_\text{pos}\) is labeled 0 (empty) or 1 (occupied), converted to features via MLP, and fed into Grid Attention.
- Design Motivation: Enables the scenario generation module to efficiently reason about the spatial distribution of agents and determine where to insert new ones.
Loss & Training¶
The total training loss is a weighted sum of standard NTP losses over multiple token types: $\(\mathcal{L} = \lambda_1 \mathcal{L}_\text{motion} + \lambda_2 \mathcal{L}_\text{pos} + \lambda_3 \mathcal{L}_\text{head} + \lambda_4 \mathcal{L}_\text{control} + \lambda_5 \mathcal{L}_\text{shape} + \lambda_6 \mathcal{L}_\text{type}\)$
where \(\lambda_1 = \lambda_3 = 1\), \(\lambda_2 = \lambda_4 = 10\), \(\lambda_5 = 0.2\), \(\lambda_6 = 5\).
Training token sequences are constructed by arranging tokens in a fixed order at each timestep: motion token → control token (REMOVE/KEEP) → pose token (ADD) → BEGIN MOTION. Tokens of the same type are ordered by agent distance to the ego vehicle from nearest to farthest.
Training configuration: batch size 8, 8× A5000 GPUs, AdamW with cosine annealing, initial learning rate 0.0005.
Key Experimental Results¶
Main Results¶
Short-term simulation (WOSAC benchmark, 9s):
| Method | Composite↑ | Kinematic↑ | Interactive↑ | Map↑ |
|---|---|---|---|---|
| TrafficBots | 0.6976 | 0.3994 | 0.7103 | 0.8342 |
| GUMP | 0.7404 | 0.4773 | 0.7872 | 0.8339 |
| SMART-7M | 0.7521 | 0.4799 | 0.8048 | 0.8573 |
| CatK | 0.7603 | 0.4611 | 0.8103 | 0.8732 |
| InfGen | 0.7514 | 0.4754 | 0.7936 | 0.8502 |
Long-term simulation (30s, extended WOSAC metrics):
| Method | Composite↑ | Kinematic↑ | Interactive↑ | Map↑ | Placement \(N_+\)↑ | \(N_-\)↑ | \(D_+\)↑ | \(D_-\)↑ |
|---|---|---|---|---|---|---|---|---|
| SMART-7M | 0.6519 | 0.5839 | 0.7542 | 0.8102 | 0.4324 | 0.5713 | 0.4964 | 0.3371 |
| CatK | 0.6584 | 0.5850 | 0.7584 | 0.8186 | 0.4424 | 0.5842 | 0.5233 | 0.3371 |
| InfGen | 0.6606 | 0.5966 | 0.7619 | 0.8087 | 0.4542 | 0.6273 | 0.5635 | 0.3169 |
Ablation Study¶
Agent Count Error (ACE) metric:
| Method | Mean ACE↓ | ACE Slope↓ | Note |
|---|---|---|---|
| SMART-7M | 12.0 | 0.31 | Scene progressively empties |
| CatK | 12.2 | 0.32 | Same issue |
| InfGen | 8.1 | 0.15 | Error growth rate half that of baselines |
Motion-only simulation (agent insertion/removal disabled, 30s):
| Method | Composite↑ | Kinematic↑ | Interactive↑ | Map↑ |
|---|---|---|---|---|
| SMART-7M | 0.7428 | 0.5413 | 0.7626 | 0.8349 |
| CatK | 0.7316 | 0.5216 | 0.7347 | 0.8495 |
| InfGen | 0.7432 | 0.5495 | 0.7685 | 0.8213 |
Key Findings¶
- InfGen significantly outperforms baselines in long-term simulation: The advantage is most pronounced on Placement metrics, validating the core value of dynamic scenario generation.
- ACE Slope of 0.15 vs. baselines' 0.31–0.32: InfGen's scene density error growth rate is half that of baselines.
- Competitive short-term performance: Without any short-term-specific tuning, short-term performance approaches that of CatK.
- Similar performance across methods in motion-only setting: Confirms that long-term performance gaps stem primarily from scenario generation capability rather than motion prediction quality.
- Train short, infer long: Trained on ~9s logs, the model stably simulates 30s (6× extension), demonstrating the generalization potential of the autoregressive framework.
- Qualitative evidence: SMART scenes become empty after 18s; InfGen consistently maintains realistic traffic density.
Highlights & Insights¶
- Contribution to problem formulation: The paper explicitly identifies the "fixed agent set" assumption as the fundamental bottleneck in long-term simulation—a critical yet previously overlooked issue.
- Inspiration from Chameleon and interleaved multimodal generation: The interleaved generation paradigm from vision-language modeling is successfully transferred to "temporal motion–spatial layout" alternation.
- Unified NTP framework: Motion simulation and scenario generation share a single transformer; four control tokens enable elegant task switching.
- Evaluation framework contribution: The proposed ACE metric and extended WOSAC metrics establish an evaluation standard for long-term traffic simulation research.
- Elegant design of the Occupancy Grid Encoder: Provides the scenario generation module with visibility into the current spatial occupancy state, preventing unreasonable overlapping placements.
Limitations & Future Work¶
- Trip-level simulation not yet achieved: 30 seconds remains far shorter than real trips (>5 minutes), primarily constrained by WOMD map coverage.
- Limitations of pure supervised learning: The model may overfit causal relationships in training data; future work plans to incorporate interactive reinforcement learning.
- Slightly lower Map metric: Newly inserted agents may appear in non-drivable areas or near road boundaries, slightly degrading map compliance.
- Limited agent types: The work primarily focuses on vehicles; modeling of pedestrians and cyclists may be insufficient.
- Evaluated only on WOMD: Generalization to nuPlan or other simulation environments remains untested.
Related Work & Insights¶
- Distinction from SMART: SMART is a pure motion simulation NTP model; InfGen extends it with scenario generation capability.
- Distinction from SceneGen/TrafficGen: The latter generate static initial scenes; InfGen dynamically generates agents during closed-loop simulation.
- Analogy to Chameleon: The interleaved generation idea is transferred from "text + image" to "motion + scene layout."
- Distinction from SLEDGE: SLEDGE combines a generative model with a rule-based simulator; InfGen is fully end-to-end.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First work to introduce interleaved generation into traffic simulation, unifying motion and scenario generation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on both short-term and long-term benchmarks, though genuine trip-level simulation is absent.
- Writing Quality: ⭐⭐⭐⭐⭐ — Problem motivation is clearly articulated; qualitative comparisons are compelling.
- Value: ⭐⭐⭐⭐⭐ — Represents an important step toward realistic trip-level traffic simulation.