ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention¶
Conference: ECCV 2024
arXiv: 2401.00912
Code: https://github.com/skyhehe123/ScatterFormer
Area: 3D Vision
Keywords: 3D Object Detection, Voxel Transformer, Linear Attention, Point Cloud Processing, LiDAR Perception
TL;DR¶
ScatterFormer is proposed, which is the first voxel Transformer to directly apply linear attention on variable-length voxel sequences across windows. By implementing a Scattered Linear Attention (SLA) module and a chunk-wise matrix multiplication algorithm, it achieves sub-millisecond latency. Paired with a Cross-Window Interaction (CWI) module to replace window shifting, it achieves state-of-the-art accuracy on Waymo and nuScenes while maintaining a detection speed of 23 FPS.
Background & Motivation¶
Background: Window-based Transformers (such as SST, DSVT) perform outstandingly in large-scale point cloud 3D object detection, obtaining context-aware feature representations by localizing attention computation. However, the sparsity of point clouds leads to highly variable numbers of voxels in each window (the variable-length sequence problem).
Limitations of Prior Work: Existing methods (SST, DSVT, FlatFormer) adopt group-based strategies to solve the parallel computation problem of variable-length sequences, sorting and padding voxels into fixed-length sequences. This introduces significant computational and memory overhead. In DSVT, sorting and resorting operations account for ~24% of the total backbone latency.
Key Challenge: The unequal number of voxels per window causes the attention matrix to occupy irregular memory space, preventing direct parallel computation. Meanwhile, the additional overhead introduced by existing solutions (padding + sorting) offsets the performance gains brought by attention.
Goal: Design a method to efficiently compute attention in parallel on variable-length voxel sequences without resorting and padding.
Key Insight: Leverage the "recurrence form" nature of linear attention—the hidden state matrix \(S \in \mathbb{R}^{d \times d}\) is independent of sequence length, allowing parallel attention across different windows to be completed in shared memory via chunk-wise computation.
Core Idea: Flatten voxels of all windows into a single sequence, and utilize the KV pre-computation property of linear attention to perform chunk-wise parallel calculation, avoiding the overhead of padding and sorting.
Method¶
Overall Architecture¶
The overall architecture of ScatterFormer is as follows: the input point cloud is voxelized and converted into high-dimensional embeddings through a VFE layer, then processed by Conditional Position Encoding (CPE). The encoded features are fed into a backbone consisting of 6 ScatterFormer Blocks. Each Block contains an SLA module, a CWI module, and an FFN, supplemented by Batch Normalization and skip connections. After passing through 3 Blocks, features are downsampled via sparse convolution, then converted to pillar features to generate compact BEV features for bounding box prediction.
Key Designs¶
- Linear Attention Basics: Standard self-attention has a complexity of \(O(N^2)\). Linear attention approximates softmax through kernel functions:
By changing the computation order from \((\phi(Q) \cdot \phi(K)^\mathsf{T}) \cdot V\) to \(\phi(Q) \cdot (\phi(K)^\mathsf{T} \cdot V)\), the complexity is reduced to \(O(N)\). A key property is that the hidden state matrix \(S \in \mathbb{R}^{d \times d}\) is independent of sequence length and can be converted into a recurrence form:
This implies that the sequence can be divided into non-overlapping chunks for computation.
- Scattered Linear Attention (SLA): The core innovative module. Voxels of the entire scene are sorted by window coordinates and then flattened into a single matrix. Voxels within the same window form sub-matrices in contiguous memory. After obtaining \(Q, K, V\) via shared projection layers, linear attention is computed separately for each window's sub-matrix:
where \(M\) is the number of non-empty windows, and \(m^j\) is the number of voxels in the \(j\)-th window.
Hardware-Efficient Implementation (Chunk-wise Algorithm): Optimization utilizing GPU memory hierarchy: - Split the flattened \(Q, K, V\) into fixed-length chunks. - Load chunks from slow HBM to fast SRAM (shared memory). - Assign a single thread to each window, iterating through all corresponding K/V chunks to accumulate matrix products into the hidden state matrix \(S\). - Assign independent threads to each query chunk \(q_i \in Q^j\) to multiply with the corresponding hidden state matrix to obtain the output. - Implemented using Triton, avoiding large output matrices, with less than 1ms latency.
Compared to a naive scatter-based implementation, the chunk-wise method has significant advantages in speed and memory usage.
-
Cross-Window Interaction (CWI): A cross-window interaction module that replaces window shifting. Window shifting requires re-calculating window coordinates and resorting voxels, which accounts for ~24% of latency in DSVT. The CWI module employs an Inception-style multi-branch design:
- Input features are divided into 4 parts along the channel dimension.
- Branch 1: \((S_h+1) \times 1 \times 1\) depthwise convolution (long 1D kernel along the height direction).
- Branch 2: \(1 \times (S_w+1) \times 1\) depthwise convolution (long 1D kernel along the width direction).
- Branch 3: \(3 \times 3 \times 3\) depthwise convolution (enhancing locality).
- Branch 4: Identity mapping.
The axially decomposed large kernel design allows voxel features to be efficiently mixed across different windows, achieving a better accuracy-latency trade-off than window shifting.
Loss & Training¶
- The detection head and loss function are identical to DSVT: heatmap estimation + bounding box regression + IoU loss confidence calibration.
- The detection head on nuScenes adopts the TransFusion architecture.
- Waymo: voxel size (0.32m, 0.32m, 0.1875m), window (12,12), 24 epochs, lr=0.006.
- nuScenes: voxel size (0.3m, 0.3m, 8m), window (20,20), 20 epochs, lr=0.004.
- Ground-truth sampling data augmentation is disabled in the last 4 epochs.
- 8 RTX A6000 GPUs, batch size 32.
Key Experimental Results¶
Main Results — Waymo Open Dataset (Val Set)¶
| Method | Frames | ALL L2 mAPH↑ | Vehicle L2↑ | Pedestrian L2↑ | Cyclist L2↑ |
|---|---|---|---|---|---|
| DSVT-1f | 1 | 72.1 | 71.0 | 71.5 | 73.7 |
| HEDNet-1f | 1 | 73.4 | 72.7 | 72.6 | 74.9 |
| ScatterFormer | 1 | 73.8 | 72.7 | 72.6 | 76.1 |
| DSVT-4f | 4 | 75.6 | 73.6 | 75.9 | 77.3 |
| ScatterFormer-4f | 4 | 76.7 | 74.7 | 76.6 | 78.1 |
Main Results — nuScenes (Val Set)¶
| Method | NDS↑ | mAP↑ |
|---|---|---|
| DSVT | 71.1 | 66.4 |
| HEDNet | 71.4 | 66.7 |
| ScatterFormer | 72.4 | 68.3 |
Ablation Study¶
| Configuration | Vehicle APH/L2 | Ped APH/L2 | Cyclist APH/L2 | Description |
|---|---|---|---|---|
| baseline (Full) | 71.1 | 70.9 | 73.1 | All components |
| w/o SLA | 68.0 (-3.1) | 67.0 (-3.9) | 69.9 (-3.2) | SLA is the most critical component |
| w/o CWI | 69.2 (-1.9) | 69.9 (-1.0) | 71.1 (-2.0) | CWI contributes to all classes |
| CWI→SW | 69.5 (-1.6) | 70.4 (-0.5) | 72.4 (-0.7) | CWI outperforms window shifting |
| w/o CPE | 69.8 (-1.3) | 67.2 (-3.7) | 70.5 (-2.6) | CPE is crucial for Pedestrian/Cyclist |
Window Size Ablation¶
| Window Size | Area Size | Vehicle | Pedestrian | Cyclist |
|---|---|---|---|---|
| 10 | 3.20m | 70.2 | 69.2 | 71.0 |
| 12 | 3.84m | 71.1 | 70.9 | 73.1 |
| 14 | 4.48m | 71.3 | 69.9 | 72.5 |
| 16 | 5.12m | 71.0 | 69.2 | 72.3 |
Key Findings¶
- ScatterFormer single-frame achieves a 1.7% improvement in Waymo L2 mAPH over DSVT and 0.4% over HEDNet.
- On nuScenes, mAP outperforms HEDNet by 1.6% (68.3 vs 66.7) and NDS by 1.0%.
- Detection speed reaches 23 FPS, significantly outperforming other Transformer detectors and approaching sparse convolutional detectors.
- SLA module latency is < 1ms, with the chunk-wise implementation performing significantly better in speed and memory than the scatter-based implementation.
- The model is insensitive to window size, with 12×12 being the optimal choice.
- CWI yields better performance than Shifted Window with lower latency.
- The model is not highly sensitive to different linear attention variants (Efficient Attn, Gated Linear Attn, Focused Linear Attn, XCA).
Highlights & Insights¶
- Pioneering Nature: The first method to directly apply attention to variable-length voxel sequences, completely avoiding sorting and padding.
- I/O-Aware Algorithm: Design of chunk-wise matrix multiplication utilizing GPU memory hierarchy (HBM→SRAM) is an excellent example of hardware-algorithm co-design.
- CWI Replacing Window Shifting: Sorting/resorting accounts for 24% of latency in DSVT; CWI thoroughly eliminates this overhead with simple axially decomposed large-kernel convolutions.
- Linear Complexity without Sacrificing Accuracy: Proves that linear attention is not only feasible in 3D detection but also superior to softmax attention.
- Significant Improvement for Cyclist: ScatterFormer-4f improves by 0.8 APH (L2) on Cyclist compared to DSVT-4f, indicating a special advantage in detecting sparse, small objects.
Limitations & Future Work¶
- Reliance on custom operators (Triton kernel), which have not been implemented as TensorRT plugins, requiring additional engineering effort for in-vehicle deployment.
- Performance of linear attention degrades with excessively large windows, likely because the "focusing capability" of linear attention is diluted as the token count increases.
- The current depthwise convolution requires support from a modified Spconv library, increasing the barrier to entry.
- Deeper integration with temporal modeling (multi-frame fusion) can be explored.
- Adaptive mechanisms such as gating could be introduced to enhance the expressiveness of linear attention.
- Failed to completely surpass multi-frame methods on Waymo test set Pedestrian and Cyclist categories.
Related Work & Insights¶
- DSVT: Alternating axis sorting + fixed-length grouping. This is the most direct baseline; ScatterFormer comprehensively outperforms it in accuracy and speed.
- SST: Pioneered window-based Transformers for point cloud detection, but has low mixed serial-parallel computation efficiency.
- FlatFormer: Flattens voxels into fixed-length sequences, but still incurs sorting overhead.
- FlashAttention: I/O-aware attention computation inspired the chunk-wise implementation of SLA.
- Inspiration: The recurrence form of linear attention \(S_t = S_{t-1} + k_t^\top v_t\) is naturally suited to solving variable-length sequence problems. This insight can be extended to other sparse data processing with Transformers.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to combine the recurrence property of linear attention with variable-length voxel sequences; SLA + CWI designs are elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated on two datasets + detailed ablation study + multiple linear attention comparison + runtime analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear problem motivation, detailed algorithm description, and intuitive illustrations.
- Value: ⭐⭐⭐⭐⭐ Pushes the frontier of voxel Transformers in both accuracy and efficiency, carrying major significance for real-time 3D detection.