Skip to content

ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Conference: CVPR 2026 arXiv: 2512.22939 Code: Available Area: Autonomous Driving Keywords: End-to-end autonomous driving, VLM reasoning, latent space reasoning, multi-scale trajectory planning, vision-language-action

TL;DR

ColaVLA proposes a unified vision-language-action (VLA) framework that transfers VLM reasoning from textual chain-of-thought to latent space. Through a Cognitive Latent Reasoner and a Hierarchical Parallel Planner, the framework completes scene understanding and trajectory decoding with only two VLM forward passes, achieving state-of-the-art performance on both nuScenes open-loop and closed-loop benchmarks.

Background & Motivation

End-to-end autonomous driving is evolving from modular pipelines toward unified learning. The integration of VLMs introduces cross-modal priors and commonsense reasoning, yet current VLM-based planners face three fundamental challenges:

Modality mismatch: A natural gap exists between discrete text tokens and continuous trajectory coordinates, which may produce format violations or physically inconsistent waypoints.

High chain-of-thought latency: Autoregressive token-by-token decoding leads to ever-growing sequences, with inference latency exceeding 3,700 ms in systems such as OmniDrive and SOLVE-VLM.

Non-causal planners hinder deployment: Existing planners cannot achieve parallel decoding while preserving causal structure.

The core idea of ColaVLA is to transfer all reasoning entirely into a unified latent space, avoiding lengthy text generation while retaining the knowledge priors and generalization capability of VLMs.

Method

Overall Architecture

ColaVLA consists of two core modules:

  • Cognitive Latent Reasoner: Completes driving strategy inference in latent space through four stages—comprehension → recognition → rethinking → decision—requiring only two VLM forward passes.
  • Hierarchical Parallel Planner: Leverages meta-action priors from the reasoner to decode multi-scale, causally consistent trajectories in a single forward pass.

Key Designs

1. Driving Scene Comprehension

Fixed driving prompt embeddings \(\mathbf{T}\), multi-view visual embeddings \(\mathbf{V}\), and ego-state tokens \(\mathbf{E}\) are concatenated and fed through a shared VLM Transformer to obtain globally interacted visual tokens:

\[\mathbf{Q}_V = \mathcal{D}_{\text{vlm}}([\mathbf{T}; \mathbf{V}; \mathbf{E}]) \in \mathbb{R}^{L_v \times D}\]

Only the visual slice is retained; text and ego embeddings are discarded to ensure prompt immutability and avoid redundant information.

2. Critical Entity Recognition

An ego-adaptive router is introduced to align visual tokens with ego state via FiLM conditioning:

\[\tilde{\mathbf{Q}}_V = (1 + \gamma(\mathbf{E})) \odot \mathbf{Q}_V + \beta(\mathbf{E})\]

The router then scores and selects the Top-K safety-critical visual tokens \(\mathbf{Q}^*\). During training, Gumbel-Softmax relaxation maintains differentiability; at inference, Top-K selection is applied directly. This step compresses 1,200 visual tokens to \(K=256\), forming an efficient information bottleneck.

3. Latent Rethinking

The fixed prompt \(\mathbf{T}\), the selected \(K\) visual tokens \(\mathbf{Q}^*\), ego token \(\mathbf{E}\), and \(C\) learnable meta-queries \(\mathbf{M}\) are concatenated for a second VLM forward pass:

\[\mathbf{Q}_M = \mathcal{D}_{\text{vlm}}([\mathbf{T}; \mathbf{Q}^*; \mathbf{E}; \mathbf{M}]) \in \mathbb{R}^{C \times D}\]

Each meta-query is initialized to a driving meta-action (e.g., straight cruising, unprotected left turn, emergency braking), obtained by clustering training trajectories.

4. Strategic Decision Synthesis

Meta-query embeddings are modulated via FiLM and cross-attention, then mapped to driving strategy logits by an MLP. Training uses focal loss to emphasize hard and safety-critical samples.

5. Hierarchical Parallel Planner

The prediction horizon of \(T\) steps is partitioned into \(S\) nested scales \(\mathcal{I}_1 \subset \cdots \subset \mathcal{I}_S = \mathcal{T}\), refining trajectories from coarse to fine:

  • Stage-aware trajectory queries: The meta-action embedding selected by the reasoner is expanded via temporal embeddings into multi-scale targets.
  • Causality-preserving hybrid attention: A hybrid attention mask \(\mathcal{M}\) is designed so that tokens at scale \(s\) can only attend to scale \(s-1\) and context tokens, preventing future information leakage.
  • Confidence-guided parallel decoding: Multiple candidate strategies are processed simultaneously; two MLP heads estimate confidence and regress trajectories respectively. Only the hypothesis nearest to the ground truth receives supervision, preventing mode collapse.

Loss & Training

  • Multi-stage training: Stage 1 pre-trains the VLM on OmniDrive-nuScenes QA pairs (updating only LoRA parameters); Stage 2 jointly fine-tunes the integrated action planner.
  • Built on LLaVA v1.5 (LLaMA-7B); EVA-02-L is used as the image encoder and SQ-Former for visual reasoning.
  • AdamW optimizer with cosine annealing; learning rate \(1 \times 10^{-4}\).

Key Experimental Results

Main Results

Table 1: nuScenes Open-Loop Planning Results

Method Type Avg L2 (m) ↓ Avg Col. (%) ↓
UniAD Action+Ego 0.46 0.37
VAD-Base Action+Ego 0.37 0.33
SOLVE-E2E Action+Ego 0.31 0.30
SOLVE-VLM Text 0.28 0.20
ColaVLA Action+Ego 0.30 0.23

Table 2: NeuroNCAP Closed-Loop Simulation Results

Method NeuroNCAP Score ↑ Avg Col. (%) ↓
UniAD 0.73 88.6
VAD 0.66 92.5
ImpromptuVLA† 2.06 65.1
BridgeAD-B‡ 3.06 44.3
ColaVLA 3.48 36.8

Ablation Study

Reasoning Module Rethinking Stage Avg L2 (cm) ↓
32.2
31.3
30.4
Planner Type NeuroNCAP Score ↑
MLP-based 1.05
Diffusion-based 1.02
Ours 1.50

Inference latency comparison: ColaVLA 727 ms vs. OmniDrive 3,727 ms vs. SOLVE-VLM 3,719 ms (single H20 GPU), achieving a 5× speedup.

Key Findings

  1. Latent space reasoning reduces latency by more than 5× compared to textual chain-of-thought while maintaining or improving planning quality.
  2. In closed-loop evaluation, the collision rate drops from 65.1% (ImpromptuVLA) to 36.8%, with static collisions reduced by 73%.
  3. The hierarchical interpolation strategy (predicting endpoints first, then filling intermediate points) outperforms sequential, reverse, and single-scale strategies.
  4. Top-K = 256 safety-critical tokens achieves the optimal accuracy–efficiency trade-off.

Highlights & Insights

  1. Paradigm innovation: This work is the first to systematically propose a complete framework that transfers VLM reasoning from text space to a unified latent space, eliminating modality mismatch and autoregressive latency.
  2. Cognition-inspired design: The four-stage reasoning process (comprehension → recognition → rethinking → decision) emulates human driving cognition, with each stage serving a clear information-processing objective.
  3. Causally consistent parallel decoding: A carefully designed hybrid attention mask enables simultaneous multi-scale trajectory decoding in a single forward pass, balancing efficiency and causality.
  4. Closed-loop SOTA: ColaVLA substantially outperforms prior methods on the safety-critical NeuroNCAP benchmark, validating the effectiveness of latent space reasoning for real-world deployment.

Limitations & Future Work

  1. Validation is limited to the nuScenes dataset; generalization to larger-scale or cross-domain data remains untested.
  2. Meta-action categories are hard-coded via clustering and may fail to cover all long-tail driving scenarios.
  3. The method still relies on LiDAR and pre-trained perception modules; performance in a camera-only setting has not been verified.
  4. Closed-loop evaluation is conducted solely on the NeuroNCAP simulator, without real-world road validation.
  • UniAD/VAD: Pioneering end-to-end driving pipelines, but reliant on sparse trajectory supervision and lacking high-level semantic reasoning.
  • DriveVLM/OmniDrive/EMMA: VLM-based textual reasoning planners with high inference latency.
  • ImpromptuVLA/SOLVE-VLM: Dual-system designs combining VLMs with planners, yet still constrained by text-level reasoning.
  • The latent space reasoning paradigm is transferable to tasks requiring rapid decision-making, such as robotic manipulation and visual navigation.

Rating

Dimension Score (1–5)
Novelty 5
Technical Depth 5
Experimental Thoroughness 4
Writing Quality 4
Value 4
Overall 4.5