Real-Time 3D Object Detection with Inference-Aligned Learning¶
Conference: AAAI 2026 arXiv: 2511.16140 Code: GitHub Area: 3D Vision Keywords: 3D object detection, point cloud, indoor scene, optimal transport, label assignment, self-distillation, real-time
TL;DR¶
This paper proposes SR3D, a framework that bridges the training-inference gap in indoor dense 3D object detection via two training-phase components: Spatial-Priority Optimal Transport Assignment (SPOTA) and Ranking-Aware adaptive Self-distillation (RAS). SR3D achieves state-of-the-art performance among dense detectors on ScanNet V2 and SUN RGB-D at a real-time speed of 42ms.
Background & Motivation¶
Indoor point cloud 3D object detection is critical for real-time applications such as augmented reality, robotics, and navigation. Existing detectors follow two paradigms:
- Sparse detectors (VoteNet, 3DETR, V-DETR, etc.): achieve high localization accuracy by refining a small set of high-quality proposals, but incur large memory overhead and high latency (typically >130ms), making them unsuitable for real-time use.
- Dense detectors (GSDN, FCAF3D, TR3D, etc.): perform single-pass prediction by densely tiling anchors in the spatial domain, offering fast inference (~42ms) but substantially lower accuracy than sparse methods.
The authors identify the root cause of accuracy limitations in dense detectors as the training-inference gap, manifested in two missing properties:
- Lack of spatial reliability: label assignment during training relies on fixed heuristics (e.g., center prior, IoU threshold), ignoring the actual spatial quality of anchors, which leads to misidentification of high-quality anchors in cluttered indoor scenes.
- Lack of ranking awareness: training applies uniform supervision to all positive samples regardless of their relative localization quality, whereas the AP metric used at inference is inherently ranking-sensitive, causing inconsistency between classification confidence and localization accuracy.
Case Study Validating the Bottleneck¶
The authors conduct an elegant oracle experiment: replacing predicted classification scores with ground-truth IoU scores boosts AP25 from 70.8 to 91.8 and AP50 from 55.6 to 87.7. This directly demonstrates that the lack of ranking awareness is the primary bottleneck, as the severe misalignment between classification confidence and localization quality substantially limits detection performance.
Method¶
Overall Architecture¶
SR3D adopts a classic dense detection architecture: sparse convolutional backbone (MinkResNet34) + FPN multi-scale feature fusion + dual-branch task head (classification + regression). The two core components, SPOTA and RAS, are used only during training, introducing zero additional overhead at inference and preserving real-time speed.
1. Spatial-Priority Optimal Transport Assignment (SPOTA)¶
Standard OTA formulates label assignment as an optimal transport problem, but direct application to 3D detection is problematic: (1) 3D detection relies more on geometric cues than semantic ones; (2) jointly optimizing classification and regression costs leads to multi-objective conflicts.
Three key designs in SPOTA:
Normalized Vertex Distance: IoU provides insufficient discriminability for predictions with similar overlap ratios but different geometric structures. SPOTA introduces the normalized vertex distance \(\mathcal{R}_{VD}\) to capture fine-grained alignment differences at bounding box vertices:
where \(d(\cdot)\) denotes Euclidean distance and \(\rho(\hat{\mathbf{b}}, \mathbf{b})\) is the diagonal length of the minimum enclosing box. Unlike DIoU, which only considers center distance, vertex distance simultaneously captures scale and shape variation.
Spatial-Priority Strategy: The classification cost term is entirely removed; label assignment is driven solely by geometric cues. The rationale is that semantic information in 3D point clouds is inherently encoded in geometric structure (object shape, edges, layout), and retaining an explicit classification term introduces redundancy and biases the model toward semantic patterns rather than robust geometric alignment.
Center Prior Constraint: A Gaussian center prior \(\gamma_c = 1 - \exp(-\mu d^2(\mathbf{c}, \mathbf{c}^{gt}))\) is introduced to stabilize optimization in the early stages of training.
The final cost matrix is:
For each ground truth, the top-\(k\) anchors with the lowest cost are selected as positives (default \(k=6\), corresponding to the six principal directions in 3D Euclidean space).
2. Ranking-Aware Adaptive Self-distillation (RAS)¶
RAS injects localization quality and ranking information into the classification branch via a self-distillation mechanism, comprising two sub-components:
Ranking-aware Distillation Loss (RDL): Soft targets are constructed from the localization quality (IoU) and soft-rank signals produced by the model's own regression branch, guiding the classification branch:
where \(\sigma\) is the classification confidence, \(q\) is the IoU, and \(r^{reg}\) is the IoU-based soft rank (higher values indicate better localization). This formulation imposes heavier penalties on poorly localized samples, suppressing inconsistent predictions with high confidence but low localization quality.
Soft Ranking Algorithm: A differentiable soft-rank function \(R_i = \frac{1}{N}\sum_{j \neq i}\sigma(\frac{s_j - s_i}{\tau})\) is used to compute continuous ranks, preserving pairwise distance information from the original scores and providing richer structural signals than hard ranking.
Adaptive Weighting Strategy: The contributions of standard classification loss (Focal Loss) and self-distillation loss are dynamically balanced according to the relative rank of classification scores:
High-ranked (high-confidence) positive samples receive more self-distillation supervision to correct overconfidence, while low-ranked positives are primarily supervised by standard Focal Loss to preserve basic learning capacity.
3. Loss & Training¶
- Backbone: MinkResNet34 sparse convolution + generative sparse transposed convolution FPN
- Optimizer: AdamW, initial learning rate 1e-3, warmup 300 steps from 1e-5, weight decay 1e-4
- Training schedule: 13 epochs; learning rate decayed by 10× at epochs 8, 11, and 12
- Voxel size: 0.01m
- Data augmentation: random sampling of 66% points, horizontal flipping, rotation ±5°, scaling [0.6, 1.4]
- Inference: NMS (IoU threshold 0.5, confidence threshold 0.01)
- Hardware: single RTX 4090
- Default hyperparameters: \(k=6\), \(\mu=1\), \(\beta=1\), \(\tau=0.1\)
Key Experimental Results¶
Main Results¶
Comparison on ScanNet V2 and SUN RGB-D indoor 3D detection benchmarks (parentheses denote averages over 25 evaluations):
| Method | Type | ScanNet AP25 | ScanNet AP50 | Latency | SUN AP25 | SUN AP50 | Latency |
|---|---|---|---|---|---|---|---|
| VoteNet | Sparse | 58.6 | 33.5 | 71ms | 57.7 | - | 41ms |
| 3DETR | Sparse | 65.0 | 47.0 | 170ms | 59.1 | 32.7 | - |
| CAGroup3D | Sparse | 75.1 (74.5) | 61.3 (60.3) | 472ms | 66.8 (66.4) | 50.2 (49.5) | - |
| V-DETR | Sparse | 77.4 (76.8) | 65.0 (64.5) | 240ms | 67.5 (66.8) | 50.4 (49.7) | - |
| DEST | Sparse | 78.5 (78.3) | 66.6 (66.2) | 263ms | 68.4 (67.4) | 51.8 (50.9) | - |
| GSDN | Dense | 62.8 | 34.8 | 49ms | - | - | - |
| FCAF3D | Dense | 71.5 (70.7) | 57.3 (56.0) | 64ms | 64.2 (63.8) | 48.9 (48.2) | 56ms |
| TR3D | Dense | 72.9 (72.0) | 59.3 (57.4) | 42ms | 67.1 (66.3) | 50.4 (49.6) | 36ms |
| TR3D+DLLA | Dense | 73.8 (72.8) | 60.2 (58.9) | - | 67.3 (67.0) | 50.6 (50.5) | - |
| SR3D (Ours) | Dense | 74.0 (73.2) | 59.7 (58.5) | 42ms | 68.1 (67.2) | 50.9 (50.5) | 36ms |
SR3D surpasses all prior dense detector state-of-the-art methods on all metrics. Compared to the TR3D baseline, it improves AP25 by 1.1/1.0 (ScanNet/SUN) with no increase in latency. SR3D achieves accuracy comparable to DLLA, while DLLA incurs higher computational cost due to its additional auxiliary branches.
Ablation Study¶
| SPOTA | RAS | AP25 | AP50 | Latency |
|---|---|---|---|---|
| ✗ | ✗ | 70.8 | 55.6 | 42ms |
| ✓ | ✗ | 72.3 | 57.4 | 42ms |
| ✗ | ✓ | 72.5 | 57.7 | 42ms |
| ✓ | ✓ | 73.2 | 58.5 | 42ms |
Both components are individually effective and complementary. The full model achieves +2.4 AP25 / +2.9 AP50 over the baseline with no change in latency.
SPOTA Design Ablation¶
| Setting | AP25 | AP50 |
|---|---|---|
| SPOTA (full) | 73.2 | 58.5 |
| + classification cost | 72.5 (−0.7) | 56.9 (−1.6) |
| − vertex distance | 72.7 (−0.5) | 57.8 (−0.7) |
Adding the classification cost leads to a significant performance drop, validating the spatial-priority strategy. Removing the vertex distance also causes notable degradation, confirming the importance of fine-grained geometric cues.
RAS vs. Other Quality-Aware Losses¶
| Method | AP25 | AP50 |
|---|---|---|
| QFL (Quality Focal Loss) | 71.9 | 57.7 |
| VFL (Varifocal Loss) | 71.7 | 58.3 |
| RAS (Ours) | 73.2 | 58.5 |
RAS substantially outperforms QFL (+1.3 AP25) and VFL (+1.5 AP25). The authors attribute this to the generally low IoU values in 3D detection: directly using IoU as classification supervision targets creates optimization conflicts, whereas RAS distills ranking signals rather than using raw IoU as labels, yielding more stable training.
Training Cost Comparison¶
| Method | Training Time/Epoch | Parameters | AP25 | AP50 |
|---|---|---|---|---|
| TR3D | 12.3 min | 14.7M | 72.0 | 57.4 |
| SR3D | 12.6 min | 14.7M | 73.2 | 58.5 |
Parameter counts are identical, training time increases by less than 3%, and inference overhead is zero — a highly cost-effective improvement.
Highlights & Insights¶
- Precise problem formulation: The oracle experiment (replacing classification scores with GT IoU boosts AP by 20+) clearly quantifies the severity of the training-inference gap, providing strong motivation for the proposed methods.
- Spatial priority over semantic priority: Counterintuitively removing the classification cost entirely and relying solely on geometric cues for label assignment is logically well-grounded — in 3D point clouds, semantics are already encoded in geometry, making the classification term a redundant source of interference.
- Self-distillation with adaptive weighting: Significant gains are achieved purely through training strategy improvements without introducing any additional modules or parameters, embodying the design philosophy of inference-aligned learning.
- Experimental rigor: Each model is trained 5 times and evaluated 5 times (25 total evaluations), with comprehensive ablation studies covering all design choices, hyperparameters, and qualitative visualizations.
Limitations & Future Work¶
- Limited to indoor scenes: SPOTA and RAS are validated on indoor benchmarks (ScanNet V2, SUN RGB-D); transferability to large-scale outdoor scenes (e.g., nuScenes LiDAR data with extreme sparsity and diverse scale distributions) remains to be verified.
- No inference acceleration: SR3D's contributions are concentrated in training strategy; model quantization, knowledge distillation, and lightweight design for inference acceleration are not explored.
- Ceiling of the dense paradigm: Although SR3D substantially narrows the gap with sparse detectors (e.g., DEST), an absolute accuracy gap remains (AP25: 74.0 vs. 78.5), and it is unclear how close the inherent representational capacity of the dense paradigm is to its upper bound.
- Single-modal input: Only point cloud coordinates and colors are used; the potential of multi-modal fusion (e.g., incorporating RGB images or text) is not explored.
Related Work & Insights¶
- Indoor 3D detection: Sparse methods (VoteNet → CAGroup3D → V-DETR → DEST) lead in accuracy but are slow; dense methods (GSDN → FCAF3D → TR3D) are fast but accuracy is constrained by fixed label assignment strategies.
- Dynamic label assignment: FreeAnchor, OTA, SimOTA, AlignOTA, DLLA, and others improve label quality through dynamic matching but do not address the lack of ranking awareness.
- Self-knowledge distillation: Born-Again Networks, CS-KD, and related works leverage a model's own knowledge to guide training; SR3D's novelty lies in embedding ranking awareness into the distillation process.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | 4 | The training-inference alignment perspective is novel; spatial-priority OTA and ranking-aware self-distillation are distinctive designs. |
| Technical Depth | 4 | Solid theoretical grounding (OT framework, soft ranking); the oracle experiment is elegant. |
| Experimental Thoroughness | 5 | Two datasets, 25 repeated evaluations, comprehensive ablation, hyperparameter analysis, and qualitative visualization. |
| Value | 4 | Zero-inference-overhead training strategy; plug-and-play, generalizable to other dense detectors. |
| Writing Quality | 4 | Problem motivation is clear and figures are intuitive, though the dense notation requires frequent cross-referencing. |
| Overall | 4.2 | A high-quality contribution with precise problem definition, elegant method design, and rigorous experimentation. |