Latent Chain-of-Thought World Modeling for End-to-End Autonomous Driving¶

Conference: CVPR 2026 arXiv: 2512.10226 Code: None Area: LLM Reasoning Keywords: Latent space reasoning, chain-of-thought, world model, end-to-end driving, VLA model

TL;DR¶

LCDrive proposes a Latent Chain-of-Thought (Latent CoT) framework that replaces natural language CoT with action proposal tokens and world model prediction tokens for reasoning, achieving lower latency and superior trajectory quality in end-to-end autonomous driving via cold-start + RL post-training.

Background & Motivation¶

Background: Vision-Language-Action (VLA) models have become a dominant paradigm for end-to-end autonomous driving, and textual CoT reasoning has been introduced to improve performance in long-tail scenarios.
Limitations of Prior Work: (i) Natural language is ill-suited for representing spatiotemporal geometry and multi-agent interactions; (ii) autoregressive generation of long text introduces significant latency; (iii) generated actions may severely deviate from textual reasoning (e.g., text says "turn left" but the action actually turns right).
Key Challenge: Although textual CoT leverages the reasoning capabilities of LLMs, text is not the optimal representational medium for driving decisions.
Goal: Design a more efficient and better-aligned reasoning representation to replace textual CoT.
Key Insight: Express reasoning as a structured sequence in latent vector space rather than natural language.
Core Idea: Construct latent CoT by interleaving action proposal tokens (sharing vocabulary with output actions) and world model tokens (predicting future scene states).

Method¶

Overall Architecture¶

Three-stage training: (1) starting from a pretrained non-reasoning VLA, cold-start the latent CoT via teacher-forcing on GT world model states and the model's own action proposals; (2) train a lightweight LWM prediction head to predict world model embeddings from proposed actions; (3) RL post-training with trajectory-level rewards to optimize the latent reasoning process and final action prediction.

Key Designs¶

Latent CoT Representation:
- Function: Provides reasoning traces that are more efficient and better aligned than text.
- Mechanism: The reasoning sequence interleaves two types of tokens — (i) action proposal tokens: using the same vocabulary of 1024 motion primitives as the model's output actions (obtained via k-means clustering of training data), representing candidate actions; (ii) world model tokens: derived from a learned latent world model, representing future scene states after executing the candidate action. This forms a structured reasoning chain of "propose action → predict consequence → refine action → predict consequence."
- Design Motivation: Action proposal tokens are naturally aligned with output actions (shared vocabulary), eliminating the action–reasoning misalignment found in textual CoT. World model tokens directly encode physical interactions with greater precision than textual descriptions.
Cold-Start + LWM Prediction Head Training:
- Function: Initializes latent reasoning capability and enables the model to autonomously predict world states at inference time.
- Mechanism: The cold-start stage applies teacher-forcing on GT future rollout states and the model's own action proposals to establish initial reasoning patterns. A lightweight LWM prediction head is simultaneously trained to predict world model embeddings from proposed actions, removing the need for GT states at inference time.
- Design Motivation: Learning latent reasoning from random initialization is extremely difficult; a meaningful reasoning scaffold must first be established.
RL Post-Training:
- Function: Optimizes the entire reasoning process with trajectory-level rewards.
- Mechanism: Building on the scaffold established by cold-start, reinforcement learning is applied to optimize both the latent reasoning tokens and the final action predictions using trajectory-level rewards (e.g., collision avoidance, ride comfort, traffic rule compliance). RL yields larger gains for the latent reasoning model than for non-reasoning baselines.
- Design Motivation: Teacher-forcing can only imitate GT behavior, whereas RL allows the model to explore and discover superior reasoning-decision strategies.

Loss & Training¶

Cold-start: action prediction loss + world model prediction loss. RL post-training: GRPO or similar policy gradient methods with a composite trajectory-level reward.

Key Experimental Results¶

Main Results¶

Method	Inference Latency	Trajectory Quality	RL Gain	Notes
LCDrive (Latent CoT)	Lowest	Best	Largest	Latent reasoning
Text CoT VLA	High	Second best	Moderate	Natural language reasoning
Non-reasoning VLA	Low	Baseline	Smallest	No reasoning

Ablation Study¶

Configuration	Key Metric	Notes
Full LCDrive	Best	Cold-start + LWM + RL
w/o RL post-training	Significant drop	RL contributes most to latent reasoning
w/o world model tokens	Drop	Action proposals alone are insufficient
w/o cold-start	Severe drop	Direct RL fails to establish reasoning

Key Findings¶

LCDrive achieves lower inference latency than textual CoT reasoning, as latent token sequences are more compact (no redundant natural language tokens).
RL post-training yields substantially greater improvements for the latent reasoning model than for non-reasoning models, indicating that latent CoT provides a more favorable optimization landscape.
Qualitative analysis shows that latent CoT reasoning produces more coherent decisions in multi-agent interaction scenarios.

Highlights & Insights¶

The insight that "reasoning does not necessarily require language" is particularly profound — the essence of driving decisions is spatial reasoning, not linguistic reasoning.
Action–reasoning alignment is naturally achieved through a shared vocabulary, eliminating the core weakness of textual CoT.
The synergy between RL and latent reasoning is a significant finding — latent space is more amenable to RL optimization than language space.

Limitations & Future Work¶

Cold-start relies on GT future states, requiring complete scene annotations.
The accuracy of the LWM prediction head affects reasoning quality.
Evaluation is currently conducted on a single dataset; generalizability requires further validation.

vs. DriveVLM / AR1: These methods employ textual CoT reasoning, incurring high latency and potential action–text misalignment.
vs. MILE / LAW: These methods use latent world models but not as part of a reasoning chain. LCDrive integrates both into a structured reasoning framework.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Replacing textual CoT with latent CoT represents a conceptual breakthrough
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on large-scale driving datasets
Writing Quality: ⭐⭐⭐⭐⭐ Precise problem formulation and thorough comparative analysis
Value: ⭐⭐⭐⭐⭐ Significant implications for the VLA reasoning paradigm