ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving¶

Conference: CVPR 2026 arXiv: 2512.22939 Code: Available Area: Autonomous Driving Keywords: End-to-end autonomous driving, VLM reasoning, latent space reasoning, multi-scale trajectory planning, vision-language-action

TL;DR¶

ColaVLA proposes a unified vision-language-action (VLA) framework that transfers VLM reasoning from textual chain-of-thought to latent space. Through a Cognitive Latent Reasoner and a Hierarchical Parallel Planner, the framework completes scene understanding and trajectory decoding with only two VLM forward passes, achieving state-of-the-art performance on both nuScenes open-loop and closed-loop benchmarks.

Background & Motivation¶

End-to-end autonomous driving is evolving from modular pipelines toward unified learning. The integration of VLMs introduces cross-modal priors and commonsense reasoning, yet current VLM-based planners face three fundamental challenges:

Modality mismatch: A natural gap exists between discrete text tokens and continuous trajectory coordinates, which may produce format violations or physically inconsistent waypoints.

High chain-of-thought latency: Autoregressive token-by-token decoding leads to ever-growing sequences, with inference latency exceeding 3,700 ms in systems such as OmniDrive and SOLVE-VLM.

Non-causal planners hinder deployment: Existing planners cannot achieve parallel decoding while preserving causal structure.

The core idea of ColaVLA is to transfer all reasoning entirely into a unified latent space, avoiding lengthy text generation while retaining the knowledge priors and generalization capability of VLMs.

Method¶

Overall Architecture¶

ColaVLA consists of two core modules:

Cognitive Latent Reasoner: Completes driving strategy inference in latent space through four stages—comprehension → recognition → rethinking → decision—requiring only two VLM forward passes.
Hierarchical Parallel Planner: Leverages meta-action priors from the reasoner to decode multi-scale, causally consistent trajectories in a single forward pass.

Key Designs¶

1. Driving Scene Comprehension¶

Fixed driving prompt embeddings \(\mathbf{T}\), multi-view visual embeddings \(\mathbf{V}\), and ego-state tokens \(\mathbf{E}\) are concatenated and fed through a shared VLM Transformer to obtain globally interacted visual tokens:

\[\mathbf{Q}_V = \mathcal{D}_{\text{vlm}}([\mathbf{T}; \mathbf{V}; \mathbf{E}]) \in \mathbb{R}^{L_v \times D}\]

Only the visual slice is retained; text and ego embeddings are discarded to ensure prompt immutability and avoid redundant information.

2. Critical Entity Recognition¶

An ego-adaptive router is introduced to align visual tokens with ego state via FiLM conditioning:

\[\tilde{\mathbf{Q}}_V = (1 + \gamma(\mathbf{E})) \odot \mathbf{Q}_V + \beta(\mathbf{E})\]

The router then scores and selects the Top-K safety-critical visual tokens \(\mathbf{Q}^*\). During training, Gumbel-Softmax relaxation maintains differentiability; at inference, Top-K selection is applied directly. This step compresses 1,200 visual tokens to \(K=256\), forming an efficient information bottleneck.

3. Latent Rethinking¶

The fixed prompt \(\mathbf{T}\), the selected \(K\) visual tokens \(\mathbf{Q}^*\), ego token \(\mathbf{E}\), and \(C\) learnable meta-queries \(\mathbf{M}\) are concatenated for a second VLM forward pass:

\[\mathbf{Q}_M = \mathcal{D}_{\text{vlm}}([\mathbf{T}; \mathbf{Q}^*; \mathbf{E}; \mathbf{M}]) \in \mathbb{R}^{C \times D}\]

Each meta-query is initialized to a driving meta-action (e.g., straight cruising, unprotected left turn, emergency braking), obtained by clustering training trajectories.

4. Strategic Decision Synthesis¶

Meta-query embeddings are modulated via FiLM and cross-attention, then mapped to driving strategy logits by an MLP. Training uses focal loss to emphasize hard and safety-critical samples.

5. Hierarchical Parallel Planner¶

The prediction horizon of \(T\) steps is partitioned into \(S\) nested scales \(\mathcal{I}_1 \subset \cdots \subset \mathcal{I}_S = \mathcal{T}\), refining trajectories from coarse to fine:

Stage-aware trajectory queries: The meta-action embedding selected by the reasoner is expanded via temporal embeddings into multi-scale targets.
Causality-preserving hybrid attention: A hybrid attention mask \(\mathcal{M}\) is designed so that tokens at scale \(s\) can only attend to scale \(s-1\) and context tokens, preventing future information leakage.
Confidence-guided parallel decoding: Multiple candidate strategies are processed simultaneously; two MLP heads estimate confidence and regress trajectories respectively. Only the hypothesis nearest to the ground truth receives supervision, preventing mode collapse.

Loss & Training¶

Multi-stage training: Stage 1 pre-trains the VLM on OmniDrive-nuScenes QA pairs (updating only LoRA parameters); Stage 2 jointly fine-tunes the integrated action planner.
Built on LLaVA v1.5 (LLaMA-7B); EVA-02-L is used as the image encoder and SQ-Former for visual reasoning.
AdamW optimizer with cosine annealing; learning rate \(1 \times 10^{-4}\).

Key Experimental Results¶

Main Results¶

Table 1: nuScenes Open-Loop Planning Results

Method	Type	Avg L2 (m) ↓	Avg Col. (%) ↓
UniAD	Action+Ego	0.46	0.37
VAD-Base	Action+Ego	0.37	0.33
SOLVE-E2E	Action+Ego	0.31	0.30
SOLVE-VLM	Text	0.28	0.20
ColaVLA	Action+Ego	0.30	0.23

Table 2: NeuroNCAP Closed-Loop Simulation Results

Method	NeuroNCAP Score ↑	Avg Col. (%) ↓
UniAD	0.73	88.6
VAD	0.66	92.5
ImpromptuVLA†	2.06	65.1
BridgeAD-B‡	3.06	44.3
ColaVLA	3.48	36.8

Ablation Study¶

Reasoning Module	Rethinking Stage	Avg L2 (cm) ↓
✗	✗	32.2
✓	✗	31.3
✓	✓	30.4

Planner Type	NeuroNCAP Score ↑
MLP-based	1.05
Diffusion-based	1.02
Ours	1.50

Inference latency comparison: ColaVLA 727 ms vs. OmniDrive 3,727 ms vs. SOLVE-VLM 3,719 ms (single H20 GPU), achieving a 5× speedup.

Key Findings¶

Latent space reasoning reduces latency by more than 5× compared to textual chain-of-thought while maintaining or improving planning quality.
In closed-loop evaluation, the collision rate drops from 65.1% (ImpromptuVLA) to 36.8%, with static collisions reduced by 73%.
The hierarchical interpolation strategy (predicting endpoints first, then filling intermediate points) outperforms sequential, reverse, and single-scale strategies.
Top-K = 256 safety-critical tokens achieves the optimal accuracy–efficiency trade-off.

Highlights & Insights¶

Paradigm innovation: This work is the first to systematically propose a complete framework that transfers VLM reasoning from text space to a unified latent space, eliminating modality mismatch and autoregressive latency.
Cognition-inspired design: The four-stage reasoning process (comprehension → recognition → rethinking → decision) emulates human driving cognition, with each stage serving a clear information-processing objective.
Causally consistent parallel decoding: A carefully designed hybrid attention mask enables simultaneous multi-scale trajectory decoding in a single forward pass, balancing efficiency and causality.
Closed-loop SOTA: ColaVLA substantially outperforms prior methods on the safety-critical NeuroNCAP benchmark, validating the effectiveness of latent space reasoning for real-world deployment.

Limitations & Future Work¶

Validation is limited to the nuScenes dataset; generalization to larger-scale or cross-domain data remains untested.
Meta-action categories are hard-coded via clustering and may fail to cover all long-tail driving scenarios.
The method still relies on LiDAR and pre-trained perception modules; performance in a camera-only setting has not been verified.
Closed-loop evaluation is conducted solely on the NeuroNCAP simulator, without real-world road validation.

UniAD/VAD: Pioneering end-to-end driving pipelines, but reliant on sparse trajectory supervision and lacking high-level semantic reasoning.
DriveVLM/OmniDrive/EMMA: VLM-based textual reasoning planners with high inference latency.
ImpromptuVLA/SOLVE-VLM: Dual-system designs combining VLMs with planners, yet still constrained by text-level reasoning.
The latent space reasoning paradigm is transferable to tasks requiring rapid decision-making, such as robotic manipulation and visual navigation.

Rating¶

Dimension	Score (1–5)
Novelty	5
Technical Depth	5
Experimental Thoroughness	4
Writing Quality	4
Value	4
Overall	4.5