CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking¶
Conference: AAAI 2026 arXiv: 2511.15580 Area: 3D Single Object Tracking / Autonomous Driving Keywords: Point cloud tracking, spatial redundancy, information redundancy, information bottleneck, SVD, low-rank approximation, dynamic token compression
TL;DR¶
CompTrack is proposed as the first framework to simultaneously address dual redundancy in LiDAR point clouds: SFP filters background noise via information entropy analysis to resolve spatial redundancy; IB-DTC estimates effective rank via online SVD and adaptively determines compression ratio to compress foreground into low-rank proxy tokens, resolving information redundancy. Achieves state-of-the-art on nuScenes (61.04% Success) at 90 FPS.
Background & Motivation¶
Background: LiDAR-based 3D single object tracking is a fundamental task in autonomous driving, with methods categorized into appearance matching and motion-centric paradigms.
Limitations of Prior Work: The inherent sparsity of LiDAR point clouds introduces dual redundancy — (1) Spatial redundancy: abundant background points overwhelm sparse target features; (2) Information redundancy: points on large flat surfaces in the foreground provide ambiguous localization cues (analogous to the aperture problem in optical flow), while corner points carry structural information.
Key Challenge: Existing methods primarily address spatial redundancy while entirely ignoring the information redundancy and low-rank structure of foreground feature matrices.
Key Insight: Foreground feature matrices are intrinsically low-rank and can be compressed via optimal low-rank approximation (truncated SVD), which naturally corresponds to the information bottleneck principle.
Core Idea: Spatial redundancy is removed by an information entropy-guided foreground predictor; information redundancy is resolved by estimating the effective rank via online SVD and compressing via cross-attention with learned queries.
Method¶
Overall Architecture¶
BEV representation with a two-stage pipeline: Stage 1 SFP filters background → Stage 2 IB-DTC compresses foreground → prediction head outputs \((x,y,z,\theta)\).
Key Designs¶
-
Spatial Foreground Predictor (SFP)
- Function: Filters spatial redundancy from an information-theoretic perspective
- Design Motivation: When BEV occupancy probability \(p \ll 1\), empty pillars carry negligible information, making their removal theoretically lossless
- Implementation: A lightweight CNN produces a spatial importance heatmap, applied element-wise to enhance foreground and suppress background
- Supervision: CenterPoint-style 2D Gaussian heatmap + MSE loss
-
IB-DTC Module
- Function: Compresses redundant foreground \(\mathbf{X}_{fg} \in \mathbb{R}^{N \times C}\) into proxy tokens \(\mathbf{X}_{proxy} \in \mathbb{R}^{K \times C}\) (\(K \ll N\))
- Mechanism: Tractable surrogate for the IB objective — optimal low-rank approximation via the Eckart–Young theorem
- Three-step implementation:
- Online rank estimation: Fast non-backpropagated SVD; effective rank \(K\) determined by cumulative energy threshold \(\tau=0.99\) (average \(\approx 78\))
- SVD-guided dynamic queries: \(\mathbf{Q}_{act} = \mathbf{S}_K \mathbf{Q}_{learn} + \mathbf{Q}_{SVD}\) (residual learning)
- Guided cross-attention: \(\mathbf{X}_p = \text{Softmax}(\frac{\mathbf{Q}_{act} W_q (X'_{fg} W_k)^T}{\sqrt{C}}) X'_{fg} W_v\)
- Training: Adaptive masking — tensors padded to fixed maximum length \(L\), with only the first \(K\) positions contributing to the loss
-
End-to-End Optimization
- \(\mathbf{L}_{total} = \theta_1 \mathbf{L}_{pred} + \theta_2 \mathbf{L}_{track}\)
- SVD is used solely to determine integer indices; gradients propagate through learned queries and cross-attention
Key Experimental Results¶
KITTI Comparison¶
| Method | Mean Success/Precision | FLOPs | FPS |
|---|---|---|---|
| P2P (IJCV'25) | 71.7 / 89.4 | 1.23G | 65 |
| MBPTrack (ICCV'23) | 70.3 / 87.9 | 2.88G | 50 |
| CompTrack | 71.4 / 89.3 | 0.94G | 90 |
nuScenes SOTA¶
| Method | Mean Success/Precision |
|---|---|
| P2P | 59.22 / 71.19 |
| MBPTrack | 57.48 / 69.88 |
| CompTrack | 61.04 / 73.68 |
Waymo Cross-Dataset Generalization¶
| Method | Mean | Pedestrian |
|---|---|---|
| P2P | 47.2 / 62.9 | 37.4 / 58.1 |
| CompTrack | 48.6 / 65.7 | 39.0 / 62.7 |
Ablation Study (nuScenes)¶
| Config | SFP | IB-DTC | Mean Success | FPS |
|---|---|---|---|---|
| Baseline | ✗ | ✗ | 59.38 | 48 |
| +SFP | ✓ | ✗ | 60.01 | 55 |
| +IB-DTC | ✗ | ✓ | 59.95 | 75 |
| Full | ✓ | ✓ | 61.04 | 90 |
SVD-Guided Query Fusion¶
| Strategy | Success | Precision |
|---|---|---|
| Learned query only | 60.70 | 73.25 |
| SVD only | 60.15 | 72.50 |
| Additive fusion | 61.04 | 73.68 |
Key Findings¶
- SFP and IB-DTC are fully complementary; their combination improves FPS from 48 to 90
- Online SVD introduces less than 1 ms latency
- Performance is stable across energy thresholds in the range 0.99–0.999
- Average effective rank \(K \approx 78\), confirming the low-rank nature of foreground features
- FLOPs are 24% lower than P2P with a 38% speed improvement
Highlights & Insights¶
- Clear dual redundancy decomposition: The spatial + information redundancy framework is novel and self-consistent; the aperture problem analogy is intuitive
- Theoretical connection from IB to low-rank approximation: Compression is not arbitrary — the IB framework motivates truncated SVD as the theoretically optimal solution
- Residual fusion of SVD prior and learned queries: Simple yet effective, outperforming more complex concatenation schemes
- Accuracy improves alongside efficiency: Redundancy removal not only accelerates inference but also reduces interference
Limitations & Future Work¶
- Performance remains limited in extremely sparse scenarios with partially visible targets
- Temporal information is not exploited
- Fusion with RGB data has not been explored
- Variable \(K\) across samples in a batch increases implementation complexity
- The impact of pillar encoder choice is not thoroughly investigated
Related Work & Insights¶
- The "online SVD rank estimation → dynamic compression" paradigm in IB-DTC is generalizable to other feature redundancy scenarios
- The design pattern of low-rank prior + learnable residual queries is broadly applicable
- Information-theoretic analysis of point cloud sparsity provides theoretical grounding for efficiency optimization in 3D perception
Rating¶
⭐⭐⭐⭐
- Novelty ⭐⭐⭐⭐⭐: The dual redundancy framework and IB-DTC design are highly innovative
- Experimental Thoroughness ⭐⭐⭐⭐⭐: Three benchmarks, 21 SOTA comparisons, multi-dimensional ablations
- Writing Quality ⭐⭐⭐⭐: Clear motivation and coherent theoretical derivation
- Value ⭐⭐⭐⭐: Win-win on efficiency and accuracy; 90 FPS meets autonomous driving requirements