FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning¶
Conference: AAAI 2026 arXiv: 2507.23318 Code: Not released Area: Multimodal VLM / Autonomous Driving Keywords: VLA model acceleration, visual token pruning, foreground reconstruction, autonomous driving, plug-and-play
TL;DR¶
FastDriveVLA is proposed to train a lightweight plug-and-play ReconPruner module (only 0.07B parameters) via MAE-style foreground pixel reconstruction. By employing an adversarial foreground-background reconstruction strategy, the method prioritizes the retention of foreground tokens critical for driving decisions. It achieves state-of-the-art performance across all pruning ratios on the nuScenes open-loop planning benchmark, and can be transferred to different VLA models sharing the same visual encoder without retraining.
Background & Motivation¶
Vision-Language-Action (VLA) models have demonstrated strong scene understanding and action reasoning capabilities in end-to-end autonomous driving. However, the large number of tokens produced by visual encoders (e.g., 3,249 tokens) imposes substantial computational overhead and inference latency, severely hindering real-vehicle deployment. Existing VLM token pruning methods suffer from two fundamental limitations:
Attention-based methods (FastV, SparseVLM): These rely on text-visual attention scores to assess token importance, but in driving scenarios, instructions are fixed and concise, providing insufficient guidance for effective token selection.
Similarity-based methods (VisPruner, DivPrune): These select token subsets based on diversity, but in driving scenarios, foreground regions (lanes, pedestrians, vehicles) are critical for decision-making, and similarity-based selection may erroneously retain large numbers of irrelevant background tokens.
Core insight: Human drivers focus on relevant foreground regions—retaining visual tokens that encode foreground information is key to effective decision-making, while background tokens can be safely discarded.
Method¶
Overall Architecture¶
A lightweight ReconPruner module (0.07B parameters) is trained and inserted after the visual encoder of the VLA model. The module estimates a foreground saliency score for each token via MAE-style reconstruction, and retains the top-K highest-scoring tokens while discarding the rest. Once trained, ReconPruner can be applied in a plug-and-play manner to different VLA models sharing the same visual encoder without retraining.
Key Designs¶
- ReconPruner Architecture: Consists of a PrunerLayer (a single decoder layer from Qwen2.5-VL-3B) and a Scorer (a linear layer \(\mathbb{R}^{D \times 1}\)). A learnable query token \(Q \in \mathbb{R}^{1 \times D}\) is introduced and jointly fed into the PrunerLayer along with the visual tokens. After Hadamard product fusion, the Scorer outputs saliency scores \(S \in \mathbb{R}^{N \times 1}\).
- Adversarial Foreground-Background Reconstruction Strategy: Relying solely on foreground reconstruction leads to a degenerate solution where ReconPruner assigns high scores to all tokens to maximize reconstruction performance. Inspired by GANs, an additional constraint requires low-scoring tokens to reconstruct background regions, forming an adversarial objective: foreground tokens reconstruct foreground (high quality), and background tokens reconstruct background (high quality), compelling the model to distinguish between the two. The Straight-Through Estimator (STE) is employed to handle the non-differentiability of binary masks.
- nuScenes-FG Dataset: Driving scene foreground is defined as humans, roads, vehicles, traffic signs, and traffic barriers. Grounded-SAM is used to perform segmentation annotation on nuScenes, constructing 241K image-mask pairs across six camera viewpoints.
- Plug-and-Play Generalization: A ReconPruner trained once for a specific visual encoder (e.g., CLIP-ViT) can be transferred to different VLA models such as Impromptu-VLA that share the same encoder.
Loss & Training¶
Both foreground and background losses are weighted combinations of MSE and SSIM (\(\lambda=0.2\)). The reconstruction decoder consists of 6 Qwen2.5-VL-3B decoder layers followed by a feed-forward reconstruction head.
Key Experimental Results¶
Main Results: nuScenes Open-Loop Planning (based on Impromptu-VLA)¶
| Method | Retained Tokens | L2 Avg (cm)↓ | Collision Avg (%)↓ | Intersection Avg (%)↓ | Relative Performance |
|---|---|---|---|---|---|
| Original (100%=3249) | 3249 | 31.83 | 0.24 | 2.80 | 100% |
| FastV (↓25%) | 2436 | 32.29 | 0.31 | 2.87 | 98.6% |
| SparseVLM (↓25%) | 2436 | 32.18 | 0.28 | 2.81 | 98.9% |
| DivPrune (↓25%) | 2436 | 32.24 | 0.30 | 2.86 | 98.7% |
| FastDriveVLA (↓25%) | 2436 | 31.80 | 0.26 | 2.77 | 100.1% |
| FastV (↓50%) | 1624 | 32.59 | 0.33 | 2.99 | 97.7% |
| VisPruner (↓50%) | 1624 | 32.25 | 0.27 | 2.95 | 98.7% |
| FastDriveVLA (↓50%) | 1624 | 32.10 | 0.25 | 2.94 | 99.1% |
Ablation Study: Contribution of Key Designs¶
| Configuration | Collision Avg (%) | Note |
|---|---|---|
| Foreground reconstruction only | High | Degenerate solution: all tokens receive high scores |
| + Adversarial background reconstruction | Significant reduction | Effectively distinguishes foreground/background |
| + nuScenes-FG foreground masks | Best | High-quality annotations further improve performance |
| Plug-and-play transfer | On par with original training target | Validates cross-VLA transferability |
Key Findings¶
- Near-lossless at 25% pruning: After pruning 25% of tokens, FastDriveVLA achieves a lower L2 error than the unpruned model (31.80 vs. 31.83), with the collision rate increasing only marginally from 0.24% to 0.26%.
- Significant advantage at 50% pruning: Compared to collision rates of 0.27–0.33% for other methods at 50% pruning, FastDriveVLA achieves only 0.25%, with near-zero performance degradation.
- Foreground-aware > generic pruning: All generic VLM pruning methods perform worse than the foreground-aware strategy in driving scenarios.
Highlights & Insights¶
- Driving-scenario-specific design: The approach is grounded in human driving intuition (attend to foreground, ignore background), injecting domain knowledge into the token pruning strategy in a clear and effective manner.
- MAE reconstruction as a foreground detection proxy: This elegantly avoids the need for an additional detection model—foreground tokens can reconstruct meaningful pixels, while background tokens yield flat reconstruction outputs; the difference in reconstruction quality serves as the distinguishing signal.
- Adversarial training resolves the degenerate solution: A clever application of GAN principles—not for generative adversarial learning, but for adversarial foreground/background reconstruction.
- Extremely lightweight design: ReconPruner contains only 0.07B parameters (PrunerLayer + Scorer), introducing negligible inference overhead.
Limitations & Future Work¶
- The foreground definition is static (predefined categories), without accounting for dynamic importance variations (e.g., a pedestrian suddenly appearing should receive higher weight).
- Validation is limited to nuScenes; other driving datasets such as Waymo and KITTI are not evaluated.
- nuScenes-FG relies on Grounded-SAM for automatic annotation, which may introduce labeling noise.
- Concrete inference speedup ratios and FPS improvements are not analyzed.
- The reconstruction decoder (6 layers of Qwen2.5-VL-3B) introduces additional overhead during training, though it is not required at inference time.
Related Work & Insights¶
| Method Category | Representative Methods | Pruning Criterion | Performance in Driving Scenarios |
|---|---|---|---|
| Attention-based | FastV, SparseVLM | Text-visual attention scores | Poor: driving instructions are fixed and concise |
| Similarity-based | VisPruner, DivPrune | Token diversity | Poor: irrelevant background tokens are retained |
| Projector compression | TokenPacker, Matryoshka | Full model retraining | High cost, not plug-and-play |
| Foreground reconstruction (Ours) | FastDriveVLA | Foreground saliency scores | Best: retains decision-critical tokens |
Rating¶
- Novelty: ⭐⭐⭐⭐ Foreground reconstruction-based pruning is novel; the adversarial strategy is elegant
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-ratio comparisons with complete ablation studies
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly grounded in human driving intuition
- Value: ⭐⭐⭐⭐ Directly practical for real-vehicle deployment of VLA models