Skip to content

FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Conference: AAAI 2026 arXiv: 2507.23318 Code: Not released Area: Multimodal VLM / Autonomous Driving Keywords: VLA model acceleration, visual token pruning, foreground reconstruction, autonomous driving, plug-and-play

TL;DR

FastDriveVLA is proposed to train a lightweight plug-and-play ReconPruner module (only 0.07B parameters) via MAE-style foreground pixel reconstruction. By employing an adversarial foreground-background reconstruction strategy, the method prioritizes the retention of foreground tokens critical for driving decisions. It achieves state-of-the-art performance across all pruning ratios on the nuScenes open-loop planning benchmark, and can be transferred to different VLA models sharing the same visual encoder without retraining.

Background & Motivation

Vision-Language-Action (VLA) models have demonstrated strong scene understanding and action reasoning capabilities in end-to-end autonomous driving. However, the large number of tokens produced by visual encoders (e.g., 3,249 tokens) imposes substantial computational overhead and inference latency, severely hindering real-vehicle deployment. Existing VLM token pruning methods suffer from two fundamental limitations:

Attention-based methods (FastV, SparseVLM): These rely on text-visual attention scores to assess token importance, but in driving scenarios, instructions are fixed and concise, providing insufficient guidance for effective token selection.

Similarity-based methods (VisPruner, DivPrune): These select token subsets based on diversity, but in driving scenarios, foreground regions (lanes, pedestrians, vehicles) are critical for decision-making, and similarity-based selection may erroneously retain large numbers of irrelevant background tokens.

Core insight: Human drivers focus on relevant foreground regions—retaining visual tokens that encode foreground information is key to effective decision-making, while background tokens can be safely discarded.

Method

Overall Architecture

A lightweight ReconPruner module (0.07B parameters) is trained and inserted after the visual encoder of the VLA model. The module estimates a foreground saliency score for each token via MAE-style reconstruction, and retains the top-K highest-scoring tokens while discarding the rest. Once trained, ReconPruner can be applied in a plug-and-play manner to different VLA models sharing the same visual encoder without retraining.

Key Designs

  1. ReconPruner Architecture: Consists of a PrunerLayer (a single decoder layer from Qwen2.5-VL-3B) and a Scorer (a linear layer \(\mathbb{R}^{D \times 1}\)). A learnable query token \(Q \in \mathbb{R}^{1 \times D}\) is introduced and jointly fed into the PrunerLayer along with the visual tokens. After Hadamard product fusion, the Scorer outputs saliency scores \(S \in \mathbb{R}^{N \times 1}\).
  2. Adversarial Foreground-Background Reconstruction Strategy: Relying solely on foreground reconstruction leads to a degenerate solution where ReconPruner assigns high scores to all tokens to maximize reconstruction performance. Inspired by GANs, an additional constraint requires low-scoring tokens to reconstruct background regions, forming an adversarial objective: foreground tokens reconstruct foreground (high quality), and background tokens reconstruct background (high quality), compelling the model to distinguish between the two. The Straight-Through Estimator (STE) is employed to handle the non-differentiability of binary masks.
  3. nuScenes-FG Dataset: Driving scene foreground is defined as humans, roads, vehicles, traffic signs, and traffic barriers. Grounded-SAM is used to perform segmentation annotation on nuScenes, constructing 241K image-mask pairs across six camera viewpoints.
  4. Plug-and-Play Generalization: A ReconPruner trained once for a specific visual encoder (e.g., CLIP-ViT) can be transferred to different VLA models such as Impromptu-VLA that share the same encoder.

Loss & Training

\[\mathcal{L}_{all} = \alpha \mathcal{L}_{fore} + (1-\alpha) \mathcal{L}_{back}, \quad \alpha=0.5\]

Both foreground and background losses are weighted combinations of MSE and SSIM (\(\lambda=0.2\)). The reconstruction decoder consists of 6 Qwen2.5-VL-3B decoder layers followed by a feed-forward reconstruction head.

Key Experimental Results

Main Results: nuScenes Open-Loop Planning (based on Impromptu-VLA)

Method Retained Tokens L2 Avg (cm)↓ Collision Avg (%)↓ Intersection Avg (%)↓ Relative Performance
Original (100%=3249) 3249 31.83 0.24 2.80 100%
FastV (↓25%) 2436 32.29 0.31 2.87 98.6%
SparseVLM (↓25%) 2436 32.18 0.28 2.81 98.9%
DivPrune (↓25%) 2436 32.24 0.30 2.86 98.7%
FastDriveVLA (↓25%) 2436 31.80 0.26 2.77 100.1%
FastV (↓50%) 1624 32.59 0.33 2.99 97.7%
VisPruner (↓50%) 1624 32.25 0.27 2.95 98.7%
FastDriveVLA (↓50%) 1624 32.10 0.25 2.94 99.1%

Ablation Study: Contribution of Key Designs

Configuration Collision Avg (%) Note
Foreground reconstruction only High Degenerate solution: all tokens receive high scores
+ Adversarial background reconstruction Significant reduction Effectively distinguishes foreground/background
+ nuScenes-FG foreground masks Best High-quality annotations further improve performance
Plug-and-play transfer On par with original training target Validates cross-VLA transferability

Key Findings

  • Near-lossless at 25% pruning: After pruning 25% of tokens, FastDriveVLA achieves a lower L2 error than the unpruned model (31.80 vs. 31.83), with the collision rate increasing only marginally from 0.24% to 0.26%.
  • Significant advantage at 50% pruning: Compared to collision rates of 0.27–0.33% for other methods at 50% pruning, FastDriveVLA achieves only 0.25%, with near-zero performance degradation.
  • Foreground-aware > generic pruning: All generic VLM pruning methods perform worse than the foreground-aware strategy in driving scenarios.

Highlights & Insights

  • Driving-scenario-specific design: The approach is grounded in human driving intuition (attend to foreground, ignore background), injecting domain knowledge into the token pruning strategy in a clear and effective manner.
  • MAE reconstruction as a foreground detection proxy: This elegantly avoids the need for an additional detection model—foreground tokens can reconstruct meaningful pixels, while background tokens yield flat reconstruction outputs; the difference in reconstruction quality serves as the distinguishing signal.
  • Adversarial training resolves the degenerate solution: A clever application of GAN principles—not for generative adversarial learning, but for adversarial foreground/background reconstruction.
  • Extremely lightweight design: ReconPruner contains only 0.07B parameters (PrunerLayer + Scorer), introducing negligible inference overhead.

Limitations & Future Work

  • The foreground definition is static (predefined categories), without accounting for dynamic importance variations (e.g., a pedestrian suddenly appearing should receive higher weight).
  • Validation is limited to nuScenes; other driving datasets such as Waymo and KITTI are not evaluated.
  • nuScenes-FG relies on Grounded-SAM for automatic annotation, which may introduce labeling noise.
  • Concrete inference speedup ratios and FPS improvements are not analyzed.
  • The reconstruction decoder (6 layers of Qwen2.5-VL-3B) introduces additional overhead during training, though it is not required at inference time.
Method Category Representative Methods Pruning Criterion Performance in Driving Scenarios
Attention-based FastV, SparseVLM Text-visual attention scores Poor: driving instructions are fixed and concise
Similarity-based VisPruner, DivPrune Token diversity Poor: irrelevant background tokens are retained
Projector compression TokenPacker, Matryoshka Full model retraining High cost, not plug-and-play
Foreground reconstruction (Ours) FastDriveVLA Foreground saliency scores Best: retains decision-critical tokens

Rating

  • Novelty: ⭐⭐⭐⭐ Foreground reconstruction-based pruning is novel; the adversarial strategy is elegant
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-ratio comparisons with complete ablation studies
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly grounded in human driving intuition
  • Value: ⭐⭐⭐⭐ Directly practical for real-vehicle deployment of VLA models