SEPatch3D: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors¶

Conference: CVPR 2026 arXiv: 2604.14563
Code: github.com/Mingqj/SEPatch3D
Area: 3D Vision Keywords: 3D object detection, token compression, patch size selection, multi-view detection, ViT acceleration

TL;DR¶

This paper proposes SEPatch3D, which achieves 57% inference acceleration with comparable detection accuracy in ViT-based sparse multi-view 3D detection, via spatiotemporal-aware dynamic patch size selection and an entropy-based informative patch enhancement mechanism.

Background & Motivation¶

ViT-based sparse query-based multi-view 3D detectors (e.g., StreamPETR) deliver strong performance but suffer from high inference latency. Existing token compression strategies have the following limitations: (1) token pruning may discard informative background regions critical for learning hard negatives; (2) irregular aggregation in token merging disrupts contextual consistency; (3) naively increasing patch size beyond a threshold (e.g., >18) degrades performance due to loss of fine-grained semantic cues. The core observation is that increasing patch size reduces computation but must simultaneously preserve fine-grained information in semantically important regions.

Method¶

Overall Architecture¶

A two-stage strategy: (1) dynamic dual-patch embedding — the SPSS module adaptively selects patch size based on spatiotemporal cues; (2) selective cross-granularity feature enhancement — the IPS module identifies informative patches, and the CGFE module enhances coarse-grained patches with fine-grained counterparts.

Key Designs¶

Spatiotemporal-aware Patch Size Selection (SPSS): The average depth \(\bar{D}^{T-1}\) and depth trend slope change \(\Delta S^{T-1}\) of object queries from the previous frame are used to dynamically determine the patch size for the current frame. Distant objects with a receding trend lead to a larger patch to reduce computation; nearby objects with an approaching trend lead to a smaller patch to preserve detail; otherwise, the previous frame's setting is maintained to ensure temporal stability.
Entropy-based Informative Patch Selection (IPS): After enhancing patch features via cross-attention with motion-aligned historical queries, the information entropy of L2-normalized features is computed. Patches whose entropy exceeds the scene mean are selected as informative regions, using an adaptive threshold rather than a fixed Top-K to accommodate varying scene complexity.
Cross-Granularity Feature Enhancement (CGFE): Selected coarse-grained patches serve as queries, while the corresponding fine-grained patches from the original resolution serve as keys/values. Detail information is injected via position-encoding-augmented cross-attention, with residual connections preserving global structure.

Loss & Training¶

The detection loss from StreamPETR is inherited. In the dual-patch embedding, the original 16×16 small patches provide fine-grained feature references, while the flexible large patches are used for efficient inference. End-to-end training is employed.

Key Experimental Results¶

Main Results¶

Method	Backbone	NDS (%)	mAP (%)	Inference Time
StreamPETR (patch=16)	ViT	Baseline	Baseline	Baseline
ToC3D-faster	ViT	Slightly lower	Slightly lower	Faster
SEPatch3D-faster	ViT	Comparable	Comparable	−57%

On nuScenes, inference is accelerated by 57% with less than 1-point performance drop; an additional 20% speedup over ToC3D-faster. Effectiveness is also validated on Argoverse 2.

Ablation Study¶

The joint depth-and-trend decision in SPSS outperforms using depth or trend alone.
The adaptive entropy threshold outperforms fixed Top-K selection.
The cross-granularity enhancement in CGFE is critical for maintaining detection accuracy.

Key Findings¶

Detection performance begins to degrade when patch size exceeds 18, but selective enhancement can extend the acceleration benefit beyond this point.
Informative patches tend to correspond to texture-rich or edge regions — precisely where coarse patches incur the greatest information loss.
Spatiotemporal-aware selection effectively prevents abrupt patch size changes between consecutive frames.

Highlights & Insights¶

The "enlarge patch + selective enhancement" paradigm is more suited to 3D detection than pruning or merging approaches, representing a novel perspective.
Cross-layer interaction that uses detection query spatiotemporal information to guide backbone compression is an elegant design.
Unlike the foreground-oriented pruning in ToC3D, this approach retains background information valuable for learning hard negatives.

Limitations & Future Work¶

The predefined patch size set (\(P_s\), \(P_l\)) and depth threshold \(\theta\) require manual specification.
Fine-grained patches must always be computed, even though they do not participate in all ViT blocks.
Validation is limited to the StreamPETR baseline; generalizability to other sparse detectors has not been tested.

The dynamic patch size selection strategy is extensible to other ViT applications requiring an efficiency–accuracy trade-off.
The paradigm of using spatiotemporal queries to guide backbone computation breaks the conventional unidirectional information flow.
Cross-granularity feature enhancement can be applied to multi-scale representation learning.

Rating¶

7/10 — Clear motivation, practical methodology, and significant acceleration, with practical value in autonomous driving scenarios.