Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving¶

Conference: CVPR 2026 arXiv: 2508.13305 Code: https://github.com/MinhaoXiong/Prune2Drive Area: Multimodal VLM Keywords: Multi-view VLM, Visual Token Pruning, Farthest Point Sampling, View-Adaptive Pruning, Autonomous Driving Acceleration

TL;DR¶

The first plug-and-play token pruning framework for multi-view autonomous driving VLMs. By leveraging T-FPS (Token-wise Farthest Point Sampling) to preserve semantic and spatial diversity, combined with view-adaptive pruning rate optimization to automatically allocate token budgets per camera, the framework achieves 6.40× prefill acceleration on DriveLM while retaining only 10% of tokens with only a 3% performance drop.

Background & Motivation¶

Computational explosion in multi-view VLMs: Autonomous driving VLMs (e.g., DriveMM) must process inputs from 6 surround-view cameras, yielding 729 tokens per image and over 4,000 visual tokens in total. The \(O(n^2)\) attention complexity results in unacceptable inference latency.
Existing pruning methods are designed for single-image settings: Methods such as FastV and SparseVLM overlook the spatial and semantic diversity across multiple views, and their direct application leads to loss of critical view information.
Attention-weight-dependent methods are incompatible with efficient attention: FastV and similar approaches require reading the attention matrix, making them incompatible with efficient implementations such as FlashAttention.
Positional bias: Attention-score-based methods systematically favor tokens at specific positions, neglecting semantically important but low-attention tokens (e.g., distant vehicles).
Unequal view contributions: The front-view camera is far more critical for driving decisions than the rear-view camera, yet existing methods apply uniform pruning rates across all views.
Urgent real-time requirements: Autonomous driving is a latency-sensitive domain; high inference latency in VLMs directly compromises safety.

Method¶

Overall Architecture¶

Two core components work in concert: (1) T-FPS (Token-wise Farthest Point Sampling)—selects the most diverse subset of tokens in the token embedding space using farthest point sampling; and (2) View-adaptive pruning rate optimization—employs TPE (Tree-structured Parzen Estimator) to automatically search for the optimal token retention rate for each camera view on a small validation set. The framework is entirely training-free and is applied directly after the visual encoder output.

T-FPS Diversity-Aware Token Selection¶

Inspired by FPS (Farthest Point Sampling) from point cloud processing, the algorithm replaces Euclidean distance with cosine distance:

Randomly select an initial token and add it to the selected set \(\mathcal{S}\).
At each step, compute the cosine distance between all unselected tokens and the most recently added token in \(\mathcal{S}\).
Update the minimum distance record for each unselected token.
Add the token with the largest minimum distance (i.e., the farthest from \(\mathcal{S}\)) to \(\mathcal{S}\).
Repeat until the target count \(\mathcal{K}\) is reached.

Key advantages: (a) No reliance on attention—fully compatible with FlashAttention; (b) Maximizes semantic and spatial coverage—avoids discarding important low-attention objects; (c) Negligible computational overhead—only 0.02s for \(N=729\), less than 0.1% of total FLOPs.

View-Adaptive Pruning Rate Optimization¶

Each view's retention rate \(\alpha_i\) is treated as an optimizable variable, with the objective function defined as:

\[\mathcal{M}(\boldsymbol{\alpha}) = R(\boldsymbol{\alpha}) - \lambda P(\boldsymbol{\alpha})\]

Reward term \(R(\boldsymbol{\alpha})\): linguistic similarity between model output and ground truth.
Penalty term \(P(\boldsymbol{\alpha}) = \sum_{i=1}^{M} \alpha_i\): total token retention, encouraging sparsity.
Hyperparameter \(\lambda\): balances performance and efficiency.

TPE is used to search for the optimal solution on a small validation set of 500 samples, converging within only 3 H100 GPU hours. Results show that the front-view camera automatically receives a higher retention rate, while the rear and side views are more aggressively pruned.

Theoretical Guarantee¶

The paper proves that the combination of T-FPS (a greedy k-center approximation minimizing Hausdorff distance) and view-adaptive rates (budget allocation weighted by importance) provides a tighter error bound than uniform random sampling with equal-ratio pruning, under the assumption of View-Weighted Lipschitz continuity:

\[\sum_{i=1}^{M} w_i \cdot d_H(V_i, S_{i,\text{Prune2Drive}}) \leq \sum_{i=1}^{M} w_i \cdot d_H(V_i, S_{i,\text{baseline}})\]

Compatibility¶

The framework is entirely training-free and compatible with multiple VLMs including LLaVA-OneVision-7B (DriveMM), InternVL2.5-8B (DriveLMM-o1), and LLaVA-1.5-7B, requiring neither retraining nor access to attention matrices.

Key Experimental Results¶

DriveLM Benchmark (DriveMM model, 10% token retention)¶

Method	Tokens/Image	Avg Score↑	Prefill Speedup	FLOPs
Vanilla	729	59.1	1×	100%
FastV	72	54.1	5.78×	14.2%
SparseVLM	72	55.9	4.06×	14.4%
PACT	72	56.8	—	—
Prune2Drive	72	57.4	6.40×	13.4%

DriveLMM-o1 Benchmark (10% token retention)¶

Method	Overall Reasoning↑	Risk Accuracy↑	Scene Understanding↑
Vanilla (100%)	74.2	73.01	75.99
FastV	65.3	65.37	66.43
DART	67.4	65.32	68.17
Prune2Drive	68.3	68.34	69.86

General VLM and Video AD Benchmarks¶

Setting	Prune2Drive	SparseVLM	FastV
LLaVA-1.5 (128 tokens)	97.3% of original	96.2%	92.8%
LLaVA-1.5 (64 tokens)	94.6%	86.9%	74.3%
OmniDrive (Video AD)	49.0	46.8	44.3

Ablation Study¶

Ablation	DriveLMM-o1 Overall↑	Notes
Cosine distance (default)	68.3	Best
L1 distance	68.3	Nearly equivalent
L2 distance	67.7	Slightly lower
Min distance (nearest sampling)	63.0	Severe degradation −5.3; validates diversity principle
TPE (default)	68.3	Best HPO
Grid Search	67.3	−1.0
Evolutionary	67.6	−0.7

Interesting finding: At 25% token retention on DriveLM, Match Score reaches 34.0, surpassing the original model's 33.9—moderate pruning has a regularization effect, and removing redundant or noisy tokens can improve certain metrics.

Highlights & Insights¶

First token pruning framework tailored for multi-view autonomous driving: Rather than naively transferring single-image methods, the work systematically addresses multi-view spatial/semantic diversity and heterogeneous view contributions.
Elegant T-FPS design: The FPS idea from point cloud processing is transferred to the token embedding space, using cosine distance to ensure semantic diversity with only 0.02s of computational overhead.
View-adaptive rate optimization automatically discovers front > rear priority: Without manually designed priors, TPE search automatically allocates optimal budgets.
6.40× speedup with direct industrial value: The acceleration is practically meaningful for the deployment of real-time autonomous driving systems.

Limitations & Future Work¶

Objects with large uniform textures may be undersampled: For objects such as orange buses, where token features are highly similar, T-FPS may retain too few tokens, leading to information loss.
T-FPS depends on random initialization: The random selection of the initial token may introduce slight variance; the paper does not report variance across multiple runs.
Validation limited to 7B–8B scale VLMs: No experiments on larger models (70B+); pruning ratios and their effects may vary with model scale.
View-adaptive rates are static: The same pruning rate is applied to all samples, without accounting for the fact that different driving scenarios (highway, congestion, intersection) may require different view-attention allocations.
KV Cache reduction only during initial encoding: Decoding-phase speedup is only 1.04–1.09×, offering limited acceleration for long-sequence generation.

vs. FastV / SparseVLM / PACT (Single-Image Token Pruning)¶

FastV selects tokens based on second-layer attention scores, exhibiting positional bias and incompatibility with FlashAttention. SparseVLM employs text-guided cross-modal attention pruning, also requiring attention matrix access. PACT uses progressive multi-stage pruning. All three are designed for single-image settings and do not account for semantic complementarity or contribution differences across views. Prune2Drive's T-FPS requires no attention access, and the view-adaptive rates are specifically designed for multi-view scenarios. Under extreme compression to 64 tokens, Prune2Drive (94.6%) substantially outperforms SparseVLM (86.9%) and FastV (74.3%).

vs. DriveMM / DriveLMM-o1 (Autonomous Driving VLMs)¶

DriveMM and DriveLMM-o1 are autonomous driving–specific VLMs. Prune2Drive is applied directly on top of them as a plug-and-play module without modifying model weights. This orthogonal acceleration approach means Prune2Drive can be combined with any future autonomous driving VLM.

vs. Quantization / Distillation (Other Acceleration Methods)¶

Quantization (e.g., GPTQ) reduces numerical precision but does not reduce token count; distillation requires additional training. Prune2Drive is a training-free token reduction method that is orthogonal to both quantization and distillation, and can be combined with them for greater overall speedup.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of T-FPS and view-adaptive rates is novel, though FPS itself and token pruning are established concepts.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two AD benchmarks + general VLM + video AD + comprehensive ablations + efficiency analysis + theoretical proof.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with complete theory–experiment–analysis flow and rigorous derivations.
Value: ⭐⭐⭐⭐ — Direct practical value for multi-view VLM acceleration; 6.40× speedup is industrially attractive.