MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving¶

Conference: CVPR 2026 arXiv: 2602.21952 Code: https://github.com/hotdogcheesewhite/MindDriver Area: Autonomous Driving Keywords: Multimodal Reasoning, Chain-of-Thought, VLM Autonomous Driving, Progressive Reasoning, Reinforcement Fine-tuning

TL;DR¶

This paper proposes MindDriver, a progressive multimodal reasoning framework that emulates the human "perception→imagination→action" cognitive process. The model first performs textual semantic understanding, then imagines future scene images (bridging semantic and physical spaces), and finally predicts trajectories. Combined with feedback-guided automatic data annotation and progressive reinforcement fine-tuning, MindDriver achieves state-of-the-art performance on both the nuScenes open-loop and Bench2Drive closed-loop benchmarks.

Background & Motivation¶

Background: VLMs are increasingly being applied to end-to-end autonomous driving, directly predicting trajectories from raw sensor inputs. Chain-of-Thought (CoT) reasoning has been introduced to enhance scene reasoning and interpretability.

Limitations of Prior Work: (a) Text-based CoT reasons in semantic space and then directly predicts trajectories in physical space, resulting in spatial misalignment—the gap between the semantic space and the physical trajectory space is too large, leading to decision inconsistencies. (b) Recent approaches replace text CoT with future images (e.g., FSDrive), but lack planning-oriented guidance, leaving the model without clear cues about which objects to attend to, and failing to leverage the large-scale pretraining knowledge embedded in LLMs.

Key Challenge: A well-aligned bridge is needed between semantic-space reasoning (derived from LLM pretraining) and physical-space trajectory prediction—one that exploits semantic knowledge while connecting to the physical space.

Goal: Design a progressively smooth reasoning pathway from semantic to physical space; address the scarcity and misalignment of multimodal reasoning training data.

Key Insight: The human driving cognitive model of "perception–imagination–action"—first understanding the scene (semantics), then imagining future changes (imagery), and finally planning actions based on that imagination (trajectory).

Core Idea: Use textual reasoning to guide future scene image generation, and then use the imagined images to guide trajectory prediction, achieving progressive alignment along the text→image→trajectory chain.

Method¶

Overall Architecture¶

MindDriver takes six surround-view camera images, historical front-view frames, driving commands, and ego-vehicle status as input. Through a unified text-reasoning and visual-generation model, it executes three-stage progressive reasoning: (1) Semantic Understanding (textual scene analysis and decision-making) → (2) Semantic-to-Physical Space Imagination (text-guided future scene image generation) → (3) Physical-Space Trajectory Planning (trajectory prediction from imagined images). This is complemented by a feedback-guided automatic data annotation pipeline and progressive reinforcement fine-tuning.

Key Designs¶

Progressive Multimodal Reasoning:
- Function: Decomposes reasoning into three steps—text→image→trajectory—where each step is conditioned on the previous one. Special tokens (<think>, <dream>, <answer>) delimit the three stages.
- Motivation: Direct text→trajectory prediction involves too large a gap (spatial misalignment); direct image→trajectory prediction lacks semantic guidance and cannot leverage LLM knowledge.
- Unified Architecture: Extends the VQ-VAE visual codebook into the LLM vocabulary, enabling the model to generate both text tokens and visual tokens within the same autoregressive framework with a shared prediction head.
- Training Objective: \(\mathcal{L} = -\sum_i \log P_\theta(y_i | y_{<i})\), unifying autoregressive generation of text and visual tokens.
Feedback-Guided Auto-annotation:
- Function: Automatically generates high-quality, well-aligned multimodal reasoning training data.
- Core Pipeline: (1) Use Qwen2.5-VL-72B to generate initial text CoT from video context (multi-frame, not single-frame); (2) Apply three rounds of filtering—format filtering (rule-based structural integrity check), decision filtering (comparison against GT decisions derived from GT trajectories), and logic filtering (reasoning quality evaluation by the stronger Qwen3-235B text LLM to avoid self-evaluation bias); (3) Failed samples are returned for re-annotation with specific error feedback (format errors, decision deviations, and logical inconsistencies).
- Video Context Design: Scene analysis and potential risk assessment are based on multi-frame video rather than single images, enabling the capture of object motion trends.
- Design Motivation: Manually annotating multimodal reasoning chains is infeasible; automated annotation with multi-round feedback ensures annotation quality.
Progressive Reinforcement Fine-tuning:
- Function: Uses the GRPO algorithm in two stages to reinforce alignment, replacing the token-level uniform supervision of standard SFT.
- Stage 1 (Dream Semantically Consistent Image): Optimizes the model to generate semantically consistent future scene images conditioned on textual reasoning. The reward function uses CLIP similarity: \(r_{Img} = \text{CosSim}(E_{CLIP}(I_{dream}), E_{CLIP}(I_{GT}))\)
- Stage 2 (Predict Precise Trajectory): Optimizes the model to predict precise trajectories conditioned on imagined images. The reward function is based on L2 distance: \(r_{L2} = (\lambda - ADE) / \alpha\), where ADE denotes average displacement error.
- Design Motivation: Standard SFT applies equal-weight supervision across all tokens, biasing the model toward fluent text generation rather than maintaining multimodal balance. Progressive RFT first aligns text→image, then image→trajectory, optimizing each stage in turn.

Loss & Training¶

SFT stage: learning rate 1e-4, batch size 32, 12 epochs on nuScenes / 6 epochs on Bench2Drive.
RFT stage: learning rate 3e-6, batch size 16, Stage 1: 700 steps + Stage 2: 500 steps (nuScenes).
Base model: Qwen2.5-VL-3B + MoVQGAN detokenizer.
Trained on 16 Nvidia H20 GPUs.

Key Experimental Results¶

Main Results (nuScenes Open-Loop, with Ego Status)¶

Method	L2 Avg↓ (ST-P3)	CR Avg↓ (ST-P3)	L2 Avg↓ (UniAD)	CR Avg↓ (UniAD)
VAD (ICCV23)	0.37	0.33	-	-
BEV-Planner (CVPR24)	0.35	0.34	-	-
FSDrive (NeurIPS25)	0.35	0.14	0.67	0.32
AutoVLA (NeurIPS25)	0.48	0.13	0.86	0.35
MindDriver	0.33	0.12	0.65	0.20

Bench2Drive Closed-Loop¶

Method	DS↑	SR(%)↑	Effi↑	Comf↑
UniAD-Base (CVPR23)	45.81	16.36	129.21	43.58
ReasonPlan (CoRL25)	64.01	34.55	180.64	25.63
AutoVLA (NeurIPS25)	78.84	57.73	146.93	39.33
MindDriver	65.48	39.55	143.21	34.63

Future Frame Generation¶

Method	FID↓
Drive-WM (CVPR24)	15.8
GEM (CVPR25)	10.5
FSDrive (NeurIPS25)	10.1
MindDriver	9.4

Key Findings¶

Strong open-loop performance: Under the UniAD evaluation protocol, MindDriver achieves a collision rate of only 0.20%, substantially lower than FSDrive (0.32%) and AutoVLA (0.35%), demonstrating that progressive reasoning genuinely improves trajectory safety.
Competitive but not top closed-loop performance: DS of 65.48 vs. AutoVLA's 78.84; note that AutoVLA is not trained on the Bench2Drive training set (marked with ‡), making the comparison conditions unequal.
Best future image generation quality: FID of 9.4 vs. FSDrive's 10.1, indicating that text-guided generation indeed improves the quality of future scene imagination.
Larger gains without ego status: Without vehicle state inputs, MindDriver achieves L2 of 0.53 vs. FSDrive's 0.55, suggesting that aligned progressive reasoning provides greater advantages under limited information.

Highlights & Insights¶

Cognitive-inspired "perception–imagination–action" design: Formalizes the human driving mental model into a trainable multimodal reasoning chain; the progressive text→image→trajectory pathway is more natural than a direct leap.
Images as a bridge between semantic and physical spaces: Images inherently encode both semantic information (scene understanding) and physical information (spatial layout), making them an ideal intermediate representation in the CoT chain.
Stage-wise reward design in progressive RFT: Stage 1 uses CLIP semantic rewards to optimize imagination alignment; Stage 2 uses L2 geometric rewards to optimize trajectory prediction—more targeted than end-to-end SFT.
Video-context CoT rather than single-frame: Multi-frame input captures object motion trends, enabling more accurate reasoning than static single-frame analysis.

Limitations & Future Work¶

Closed-loop gap with AutoVLA: DS of 65.48 vs. 78.84; the additional inference latency introduced by progressive reasoning may impair real-time decision-making.
Increased inference cost from image generation: Generating future scene images requires additional computation, affecting real-time applicability.
Dependence on generation quality: Inaccurate imagined images can mislead trajectory prediction (error cascading).
Only a 3B model is explored: Whether larger VLMs could further improve progressive reasoning remains uninvestigated.
Future directions: Lightweight image generation (e.g., generating semantic maps for key regions only rather than full images); extension to multi-step imagination.

vs. FSDrive (NeurIPS25): FSDrive replaces text CoT with images but lacks textual guidance. MindDriver first performs textual reasoning to guide image generation, reducing FID from 10.1 to 9.4.
vs. AutoVLA (NeurIPS25): AutoVLA employs adaptive reasoning length and video CoT, achieving stronger closed-loop performance (78.84 vs. 65.48), but exhibits a higher open-loop collision rate (0.35 vs. 0.20).
vs. EMMA (Waymo): EMMA uses hierarchical text CoT but still suffers from semantic–physical space misalignment. MindDriver addresses this fundamental issue by introducing an image bridge.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Progressive multimodal reasoning represents a significant paradigm innovation in the CoT direction for autonomous driving.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers open-loop, closed-loop, future frame generation, and ablation studies, though closed-loop comparisons are not entirely under fair conditions.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, the cognitive analogy is intuitive, and the pipeline diagrams are detailed.
Value: ⭐⭐⭐⭐⭐ Establishes a new paradigm for VLM-driven autonomous driving; the data annotation pipeline has strong potential for reuse.