Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving¶
Conference: CVPR 2026 arXiv: 2508.13305 Code: https://github.com/MinhaoXiong/Prune2Drive.git Area: Autonomous Driving / VLM Acceleration / Token Pruning Keywords: Multi-view VLM, visual token pruning, diversity-aware sampling, view-adaptive pruning, autonomous driving
TL;DR¶
Prune2Drive is the first plug-and-play token pruning framework designed for multi-view autonomous driving VLMs. It combines T-FPS (Token-wise Farthest Point Sampling) to preserve semantic and spatial diversity with view-adaptive pruning rate optimization to automatically allocate token budgets across camera views. Retaining only 10% of tokens on DriveLM, it achieves 6.40× prefill speedup with only a 3% performance drop.
Background & Motivation¶
Autonomous driving VLMs (e.g., DriveMM, DriveLMM-o1) must process high-resolution images from six surround-view cameras (729 tokens per image), yielding over 4,000 visual tokens in total, resulting in prohibitively slow \(O(n^2)\) attention computation. Existing token pruning methods (FastV, SparseVLM) are designed for single-image settings and suffer from three shortcomings: (1) reliance on attention weights, making them incompatible with FlashAttention; (2) positional bias, causing tokens at later positions to be systematically retained; and (3) neglect of differential view contributions — front and rear cameras differ in importance for driving decisions and should not be pruned uniformly.
Core Problem¶
How to design a training-free token pruning method for multi-view autonomous driving that does not rely on attention weights and accounts for differential view contributions?
Method¶
Overall Architecture¶
The framework consists of two core components: (1) T-FPS (Token-wise Farthest Point Sampling): iteratively selects the most diverse subset of tokens in the token embedding space using farthest point sampling; (2) View-adaptive pruning rate optimization: employs a Tree-structured Parzen Estimator (TPE) to automatically search for the optimal token retention rate for each camera view on a small validation set.
Key Designs¶
-
T-FPS diversity-aware selection: Inspired by the FPS algorithm in point cloud processing, but replacing Euclidean distance with cosine distance. Starting from a randomly selected token, each step selects the token with the maximum cosine distance from the already-selected set until the target count is reached. Key advantages: (a) no dependence on attention weights → compatible with FlashAttention; (b) maximizes semantic and spatial coverage → avoids discarding low-attention but important objects (e.g., distant vehicles); (c) negligible computational overhead — only 0.02s for \(N=729\), contributing less than 0.1% of total FLOPs.
-
View-adaptive pruning rate optimization: The retention rate \(\alpha_i\) for each view is treated as an optimizable variable. The objective \(\mathcal{M}(\alpha) = R(\alpha) - \lambda P(\alpha)\) balances performance reward against total token count penalty. TPE is used to search for the optimal configuration on a validation set of 500 samples, converging in only 3 H100 GPU hours. Results show that the front-view camera automatically receives a higher retention rate (reflecting its greater importance for driving decisions), while rear and side views are more aggressively pruned.
-
Theoretical guarantees: It is proven that combining T-FPS (a k-center greedy approximation minimizing Hausdorff distance) with view-adaptive rates (importance-weighted budget allocation) yields tighter error bounds than uniform random sampling with equal-ratio pruning.
Loss & Training¶
The method is entirely training-free. T-FPS is applied directly after the visual encoder output, and view-adaptive rates are searched offline once on a small validation set and then fixed. The framework is compatible with LLaVA-OneVision-7B (DriveMM) and InternVL2.5-8B (DriveLMM-o1).
Key Experimental Results¶
DriveLM (DriveMM, 10% token retention):
| Method | Token/Image | Avg Score | Prefill Speedup | FLOPs |
|---|---|---|---|---|
| Vanilla | 729 | 59.1 | 1× | 100% |
| FastV | 72 | 54.1 | 5.78× | 14.2% |
| SparseVLM | 72 | 55.9 | 4.06× | 14.4% |
| Prune2Drive | 72 | 57.4 | 6.40× | 13.4% |
DriveLMM-o1: 10% retention → 68.3 vs. vanilla 74.2 (−6%), outperforming FastV (65.3) and DART (67.4).
General VLM (LLaVA-1.5, 128 tokens): 97.3% avg performance, surpassing SparseVLM at 96.2%.
Video AD (OmniDrive): 49.0 vs. vanilla 50.3, outperforming FastV (44.3) and SparseVLM (46.8).
Ablation Study¶
- Distance metric: cosine ≈ L1 ≈ L2 >> minimum distance (nearest-point sampling causes severe degradation of −15%, validating the diversity principle).
- TPE > Evolutionary > GridSearch: differences are small (<1%), but all outperform manual rate assignment.
- Match Score at 25% retention even exceeds the original model (34.0 vs. 33.9): moderate pruning has a regularization effect by removing redundant or distracting tokens.
- T-FPS also generalizes to general VLMs: 94.3% on LLaVA-1.5 at 64 tokens (vs. SparseVLM at 89.5%) — the margin is even larger.
- Failure mode: objects with large uniform-color regions (e.g., orange buses) may be undersampled due to high feature similarity.
Highlights & Insights¶
- First token pruning method specifically designed for multi-view autonomous driving — not a naive transfer of single-image approaches.
- The FPS-inspired T-FPS design is elegant: farthest point sampling is applied in the token embedding space rather than 3D space to guarantee semantic diversity.
- View-adaptive rate optimization directly addresses the practical asymmetry in importance between front-view and rear-view cameras.
- The 6.40× prefill speedup has direct industrial value for real-time autonomous driving systems.
- Compatible with FlashAttention with negligible overhead (0.02s per image).
Limitations & Future Work¶
- Evaluation is conducted only on offline benchmarks; closed-loop simulation validation is absent.
- T-FPS may undersample objects with large uniform-color regions due to high feature similarity (small cosine distance → not selected by FPS).
- View-adaptive rates are fixed values obtained from offline search, with no dynamic adaptation to varying inputs.
- Comparison with recent token compression methods such as V2Drop and ApET is not provided — baselines are limited to FastV, SparseVLM, DART, and PACT.
- Only two AD models (DriveMM and DriveLMM-o1) are evaluated.
Related Work & Insights¶
- vs. FastV (attention-based pruning): FastV relies on second-layer attention and suffers from positional bias while being incompatible with FlashAttention. Prune2Drive avoids attention entirely, achieving 57.4 vs. 54.1 (+3.3) at 10% token retention.
- vs. DART (similarity-based pruning): DART uses cosine similarity but ignores view-level differences. Prune2Drive adds view-adaptive optimization; results on DriveLM at 10% tokens are not reported for DART.
- vs. V2Drop (CVPR '26): V2Drop prunes tokens inside the LLM using inter-layer variation, while Prune2Drive prunes after encoder output using FPS — the two approaches are orthogonal and combinable.
- vs. ApET (CVPR '26): ApET uses approximation error for importance estimation; Prune2Drive maximizes diversity — different mechanisms targeting similar objectives.
- The "embedding-space FPS" strategy of T-FPS is generalizable to all multi-image VLM settings, including multi-view medical imaging and multi-camera robotic systems.
- The reward-penalty framework for view-adaptive rate optimization is applicable to other cross-modal or cross-source budget allocation scenarios.
- T-FPS is complementary to DUET-VLM's visual-side clustering and language-side pruning paradigm — T-FPS could replace DUET's V2V clustering module.
Rating¶
- Novelty: ⭐⭐⭐⭐ T-FPS and view-adaptive rate optimization are both clear, original contributions, though neither component is architecturally complex.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 2 AD benchmarks + 1 video AD benchmark + 5 general VLMs, with efficiency analysis, ablation studies, visualizations, and failure case analysis.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with valuable theoretical analysis; some tables could be more concise.
- Value: ⭐⭐⭐⭐⭐ Real-time deployment of autonomous driving VLMs critically demands methods of this kind; the 6.40× speedup offers direct industrial applicability.