FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning¶

Conference: AAAI 2026 arXiv: 2507.23318 Code: Not released Area: Multimodal VLM / Autonomous Driving Keywords: VLA model acceleration, visual token pruning, foreground reconstruction, autonomous driving, plug-and-play

TL;DR¶

FastDriveVLA is proposed to train a lightweight plug-and-play ReconPruner module (only 0.07B parameters) via MAE-style foreground pixel reconstruction. By employing an adversarial foreground-background reconstruction strategy, the method prioritizes the retention of foreground tokens critical for driving decisions. It achieves state-of-the-art performance across all pruning ratios on the nuScenes open-loop planning benchmark, and can be transferred to different VLA models sharing the same visual encoder without retraining.

Background & Motivation¶

Vision-Language-Action (VLA) models have demonstrated strong scene understanding and action reasoning capabilities in end-to-end autonomous driving. However, the large number of tokens produced by visual encoders (e.g., 3,249 tokens) imposes substantial computational overhead and inference latency, severely hindering real-vehicle deployment. Existing VLM token pruning methods suffer from two fundamental limitations:

Attention-based methods (FastV, SparseVLM): These rely on text-visual attention scores to assess token importance, but in driving scenarios, instructions are fixed and concise, providing insufficient guidance for effective token selection.

Similarity-based methods (VisPruner, DivPrune): These select token subsets based on diversity, but in driving scenarios, foreground regions (lanes, pedestrians, vehicles) are critical for decision-making, and similarity-based selection may erroneously retain large numbers of irrelevant background tokens.

Core insight: Human drivers focus on relevant foreground regions—retaining visual tokens that encode foreground information is key to effective decision-making, while background tokens can be safely discarded.

Method¶

Overall Architecture¶

A lightweight ReconPruner module (0.07B parameters) is trained and inserted after the visual encoder of the VLA model. The module estimates a foreground saliency score for each token via MAE-style reconstruction, and retains the top-K highest-scoring tokens while discarding the rest. Once trained, ReconPruner can be applied in a plug-and-play manner to different VLA models sharing the same visual encoder without retraining.

Key Designs¶

ReconPruner Architecture: Consists of a PrunerLayer (a single decoder layer from Qwen2.5-VL-3B) and a Scorer (a linear layer \(\mathbb{R}^{D \times 1}\)). A learnable query token \(Q \in \mathbb{R}^{1 \times D}\) is introduced and jointly fed into the PrunerLayer along with the visual tokens. After Hadamard product fusion, the Scorer outputs saliency scores \(S \in \mathbb{R}^{N \times 1}\).
Adversarial Foreground-Background Reconstruction Strategy: Relying solely on foreground reconstruction leads to a degenerate solution where ReconPruner assigns high scores to all tokens to maximize reconstruction performance. Inspired by GANs, an additional constraint requires low-scoring tokens to reconstruct background regions, forming an adversarial objective: foreground tokens reconstruct foreground (high quality), and background tokens reconstruct background (high quality), compelling the model to distinguish between the two. The Straight-Through Estimator (STE) is employed to handle the non-differentiability of binary masks.
nuScenes-FG Dataset: Driving scene foreground is defined as humans, roads, vehicles, traffic signs, and traffic barriers. Grounded-SAM is used to perform segmentation annotation on nuScenes, constructing 241K image-mask pairs across six camera viewpoints.
Plug-and-Play Generalization: A ReconPruner trained once for a specific visual encoder (e.g., CLIP-ViT) can be transferred to different VLA models such as Impromptu-VLA that share the same encoder.

Loss & Training¶

\[\mathcal{L}_{all} = \alpha \mathcal{L}_{fore} + (1-\alpha) \mathcal{L}_{back}, \quad \alpha=0.5\]

Both foreground and background losses are weighted combinations of MSE and SSIM (\(\lambda=0.2\)). The reconstruction decoder consists of 6 Qwen2.5-VL-3B decoder layers followed by a feed-forward reconstruction head.

Key Experimental Results¶

Main Results: nuScenes Open-Loop Planning (based on Impromptu-VLA)¶

Method	Retained Tokens	L2 Avg (cm)↓	Collision Avg (%)↓	Intersection Avg (%)↓	Relative Performance
Original (100%=3249)	3249	31.83	0.24	2.80	100%
FastV (↓25%)	2436	32.29	0.31	2.87	98.6%
SparseVLM (↓25%)	2436	32.18	0.28	2.81	98.9%
DivPrune (↓25%)	2436	32.24	0.30	2.86	98.7%
FastDriveVLA (↓25%)	2436	31.80	0.26	2.77	100.1%
FastV (↓50%)	1624	32.59	0.33	2.99	97.7%
VisPruner (↓50%)	1624	32.25	0.27	2.95	98.7%
FastDriveVLA (↓50%)	1624	32.10	0.25	2.94	99.1%

Ablation Study: Contribution of Key Designs¶

Configuration	Collision Avg (%)	Note
Foreground reconstruction only	High	Degenerate solution: all tokens receive high scores
+ Adversarial background reconstruction	Significant reduction	Effectively distinguishes foreground/background
+ nuScenes-FG foreground masks	Best	High-quality annotations further improve performance
Plug-and-play transfer	On par with original training target	Validates cross-VLA transferability

Key Findings¶

Near-lossless at 25% pruning: After pruning 25% of tokens, FastDriveVLA achieves a lower L2 error than the unpruned model (31.80 vs. 31.83), with the collision rate increasing only marginally from 0.24% to 0.26%.
Significant advantage at 50% pruning: Compared to collision rates of 0.27–0.33% for other methods at 50% pruning, FastDriveVLA achieves only 0.25%, with near-zero performance degradation.
Foreground-aware > generic pruning: All generic VLM pruning methods perform worse than the foreground-aware strategy in driving scenarios.

Highlights & Insights¶

Driving-scenario-specific design: The approach is grounded in human driving intuition (attend to foreground, ignore background), injecting domain knowledge into the token pruning strategy in a clear and effective manner.
MAE reconstruction as a foreground detection proxy: This elegantly avoids the need for an additional detection model—foreground tokens can reconstruct meaningful pixels, while background tokens yield flat reconstruction outputs; the difference in reconstruction quality serves as the distinguishing signal.
Adversarial training resolves the degenerate solution: A clever application of GAN principles—not for generative adversarial learning, but for adversarial foreground/background reconstruction.
Extremely lightweight design: ReconPruner contains only 0.07B parameters (PrunerLayer + Scorer), introducing negligible inference overhead.

Limitations & Future Work¶

The foreground definition is static (predefined categories), without accounting for dynamic importance variations (e.g., a pedestrian suddenly appearing should receive higher weight).
Validation is limited to nuScenes; other driving datasets such as Waymo and KITTI are not evaluated.
nuScenes-FG relies on Grounded-SAM for automatic annotation, which may introduce labeling noise.
Concrete inference speedup ratios and FPS improvements are not analyzed.
The reconstruction decoder (6 layers of Qwen2.5-VL-3B) introduces additional overhead during training, though it is not required at inference time.

Method Category	Representative Methods	Pruning Criterion	Performance in Driving Scenarios
Attention-based	FastV, SparseVLM	Text-visual attention scores	Poor: driving instructions are fixed and concise
Similarity-based	VisPruner, DivPrune	Token diversity	Poor: irrelevant background tokens are retained
Projector compression	TokenPacker, Matryoshka	Full model retraining	High cost, not plug-and-play
Foreground reconstruction (Ours)	FastDriveVLA	Foreground saliency scores	Best: retains decision-critical tokens

Rating¶

Novelty: ⭐⭐⭐⭐ Foreground reconstruction-based pruning is novel; the adversarial strategy is elegant
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-ratio comparisons with complete ablation studies
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly grounded in human driving intuition
Value: ⭐⭐⭐⭐ Directly practical for real-vehicle deployment of VLA models