Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning¶

Conference: ICCV 2025 arXiv: 2503.08101 Code: https://github.com/iseri27/tg_gbc Area: 3D Vision Keywords: 3D Object Detection, Transformer Pruning, Attention Key Pruning, Zero-Shot Acceleration, Autonomous Driving

TL;DR¶

This paper proposes tgGBC (trim keys gradually Guided By Classification scores), a zero-shot runtime pruning method that computes key importance by element-wise multiplication of classification scores and attention maps, progressively pruning unimportant keys across layers. It achieves nearly 2× acceleration of the Transformer decoder on multiple 3D detectors with less than 1% performance degradation.

Background & Motivation¶

Query-based 3D object detection methods (e.g., PETR, StreamPETR, ToC3D) employ multi-layer Transformer decoders to process dense features from surround-view cameras and have become the dominant SOTA paradigm. However, the global feature interactions in dense methods incur substantial computational overhead, hindering deployment on edge devices.

Limitations of existing acceleration approaches:

Static pruning methods (e.g., FastV, SparseViT): require an additional forward pass for hyperparameter search and are designed for ViT classification models, making them difficult to transfer to 3D detectors.

Runtime pruning methods (e.g., ToMe, Zero-TPrune): ToMe's similarity matrix computation has complexity \(O(N_k^2)\), which is prohibitively expensive for the large token counts in 3D detectors (4,224–24,000); Zero-TPrune assumes a square attention matrix, but in 3D detectors the numbers of queries and keys differ, yielding a non-square attention matrix.

Key Challenge: The token count in 3D detection far exceeds that in ViT models (3D: 4,224–24,000 vs. ViT: ~1,024), making direct transfer of existing methods suboptimal in both efficiency and effectiveness.

Core Insight: In 3D detectors, the final output is determined by predictions with the highest classification scores. Keys that contribute little to high-confidence predictions can therefore be safely pruned. Classification scores and attention maps arise naturally within the Transformer decoder, requiring no additional parameters.

Core Idea: Classification scores are broadcast and element-wise multiplied with the attention map; column-wise summation yields a per-key importance score, and the least important keys are pruned layer by layer.

Method¶

Overall Architecture¶

The tgGBC module is inserted between adjacent Transformer decoder layers. It receives the classification scores \(C \in \mathbb{R}^{N_q \times N_C}\) and cross-attention weights \(A \in \mathbb{R}^{N_q \times N_k}\) from the preceding layer, computes key importance, and prunes unimportant keys. A total of \(r\) keys are removed, distributed across \(n\) layers, with \(\lfloor r/n \rfloor\) keys pruned per layer.

Key Designs¶

Key Importance Computation:
- Step 1: Obtain the maximum classification score per query: \(\hat{C}_i = \max_j C_{i,j}\)
- Step 2: Expand to \(\tilde{C} \in \mathbb{R}^{N_q \times N_k}\) by repeating along the key dimension
- Step 3: Element-wise multiply: \(S_0 = A \odot \tilde{C}\) (propagating classification quality to each key)
- Step 4: Select the top-\(k\) rows by classification score to obtain \(S_1 \in \mathbb{R}^{k \times N_k}\)
- Step 5: Sum column-wise to obtain key importance: \(S_j = \sum_{i=1}^k (S_1)_{i,j}\)
- Design Motivation: Not all queries are equally important; only high-confidence queries contribute to final predictions. Keys that are more relevant to these high-confidence queries are preferentially retained, avoiding the "democratic" assignment of equal weight to all queries.
Gradual Layer-wise Pruning: Rather than removing all target keys in a single layer, \(\lfloor r/n \rfloor\) keys are pruned after each of the first \(n\) Transformer layers.
- Design Motivation: Progressive pruning allows subsequent layers to recompute attention over the already-pruned key set, resulting in less information loss compared to one-shot pruning.
Feasibility Guarantee: The attention module output shape \(O \in \mathbb{R}^{N_q \times E}\) is independent of \(N_k\); thus, pruning keys does not alter the output dimension. Provided that \(K\) and \(V\) are pruned synchronously (in 3D detectors, \(V = K\)), the original model parameters remain valid.

Loss & Training¶

Training-free: tgGBC does not modify model parameters and requires no training or fine-tuning.
Plug-and-play: pruning layers are appended only during inference.
Negligible additional computation: only a single element-wise matrix multiplication and column-wise summation are required.

Key Experimental Results¶

Main Results¶

Model	Backbone	\(N_k\)	\(r\)	mAP	NDS	Dec. Time (ms)	Speedup
PETR	ResNet50	16896	0	31.74%	0.367	47.09	1.00×
PETR+tgGBC	ResNet50	16896	12000	30.78%	0.358	28.58	1.65×
StreamPETR	VovNet	24000	0	48.89%	0.573	64.93	1.00×
StreamPETR+tgGBC	VovNet	24000	21000	48.55%	0.573	34.98	1.86×
3DPPE	VovNet	6000	0	39.81%	0.446	53.25	1.00×
3DPPE+tgGBC	VovNet	6000	3000	39.74%	0.445	27.56	1.93×
ToC3D	-	-	-	Latest model	-	-	1.99×

A 1.99× Transformer decoder speedup is achieved on ToC3D with less than 1% mAP degradation.

Ablation Study¶

Pruning Method	Model	mAP	Decoder Latency
No pruning	FocalPETR (VovNet)	42.36%	24.65ms
ToMe	FocalPETR (VovNet)	41.82%	22.35ms
tgGBC	FocalPETR (VovNet)	42.38%	17.08ms
No pruning	StreamPETR (ResNet50)	38.01%	31.10ms
ToMe	StreamPETR (ResNet50)	37.55%	29.40ms
tgGBC	StreamPETR (ResNet50)	37.93%	24.15ms

On FocalPETR, tgGBC even improves mAP (42.38% vs. 42.36%), suggesting that moderate key pruning may exert a regularization effect.

Key Findings¶

Deployment on edge devices (Orin) yields 1.18× and 1.19× inference speedup for FocalPETR and StreamPETR, respectively.
tgGBC can prune up to 90% of keys within a single layer (\(r\) approaching \(N_k\)), which is infeasible for ToMe due to its bipartite matching constraint.
On certain models, tgGBC improves performance, indicating that redundant keys effectively act as noise.

Highlights & Insights¶

Truly zero-shot: no training data, hyperparameter search, or fine-tuning is required; the method is fully plug-and-play.
The method leverages classification scores and attention maps already present in the model ("free" information) without introducing any additional parameters.
Strong cross-model generality: validated on PETR, FocalPETR, StreamPETR, 3DPPE, MV2D, M-BEV, and ToC3D.
Architecture-agnostic: applicable to any Transformer decoder employing dense global attention.
Practical deployment efficacy verified on edge hardware.

Limitations & Future Work¶

Applicable only to dense methods with global attention; incompatible with sparse methods using Deformable Attention or Flash Attention.
The pruning ratio \(r\) must be set manually; no automatic selection strategy is provided.
Evaluation is limited to the nuScenes dataset; validation on other autonomous driving benchmarks (e.g., Waymo, Argoverse) is absent.
Key pruning for historical frames in temporal models (e.g., StreamPETR) is not thoroughly investigated.
Query pruning in self-attention remains unexplored; the current work targets only cross-attention keys.

ToMe: merges tokens via bipartite matching; the present work avoids the \(O(N^2)\) similarity computation.
Zero-TPrune: uses Markov chain convergence as a pruning criterion but requires a square attention map.
3D Detectors: query-based methods such as the PETR family, OPEN, and ToC3D are the primary application targets.
Insight: The paradigm of using task-specific signals (classification scores) as pruning criteria is generalizable to other Transformer tasks where analogous score signals exist.

Rating¶

Novelty: ⭐⭐⭐⭐ Elegant use of classification-score-weighted attention maps for key pruning; the method is concise and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven models × multiple configurations, including edge-device deployment validation.
Writing Quality: ⭐⭐⭐⭐ Method description is clear with intuitive illustrations.
Value: ⭐⭐⭐⭐⭐ A plug-and-play acceleration tool with direct practical value for autonomous driving deployment.