Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding¶
Conference: CVPR2025
arXiv: 2603.12514
Code: github.com/shivasmic/3d-trauma-detection-ssl
Area: Medical Imaging
Keywords: 3D object detection, Self-Supervised Learning, Semi-Supervised Learning, Trauma Detection, VDETR
TL;DR¶
A two-stage label-efficient learning framework is proposed: first, a 3D U-Net encoder is pre-trained via self-supervised Masked Image Modeling on 1,206 unlabeled CT scans; then, combined with VDETR + Vertex RPE and Mean Teacher semi-supervised learning, it achieves a 3D abdominal trauma detection [email protected] of 45.30% (+115%) using only 144 labeled cases.
Background & Motivation¶
Clinical Urgency of Abdominal Trauma Detection: Manual slice-by-slice analysis of CT scans in emergency departments is time-consuming and prone to inter-observer variability. Automated detection can significantly accelerate clinical decision-making.
Extreme Scarcity of Annotation: In the RSNA abdominal trauma dataset, only 206 out of 4,711 cases (4.4%) have segmentation annotations, making traditional fully supervised methods infeasible.
Specific Challenges in 3D Detection: Abdominal organs exhibit highly irregular shapes, and 2-D center distances are insufficient to describe 3D bounding box relationships. Moreover, general-purpose 3D feature extractors transfer poorly.
Complementary Potential of Self-Supervised + Semi-Supervised Learning: Self-supervised pre-training learns anatomical priors from unlabeled data, while semi-supervised learning stabilizes training by utilizing large amounts of unlabeled data.
Locality Advantage of VDETR: The Vertex RPE in V-DETR provides explicit geometric interior-exterior relationships by calculating the relative position encoding of 8 vertices.
Limitations of Prior Work: Direct adaptation of 2D detection to 3D exhibits poor performance, while models like VoteNet and FCAF3D rely heavily on extensive annotations.
Method¶
Overall Architecture¶
A three-stage pipeline is designed: (1) A 3D U-Net is pre-trained via patch-based MIM on all 1,206 CT scans; (2) The VDETR detector is trained in two phases (frozen \(\rightarrow\) unfrozen encoder); (3) Mean Teacher semi-supervised learning utilizes 2,000 unlabeled cases.
Key Designs¶
- Masked Image Modeling: Divides a 128³ patch into 8³ sub-blocks, randomly masks 75% of them, and then reconstructs the raw intensity values of the masked regions using the 3D U-Net.
- 3D Vertex RPE: For each query and voxel position, it calculates the offset vectors to the 8 vertices of the predicted box, generating an attention bias via an MLP: \(\mathbf{A} = \text{softmax}(\mathbf{QK}^T + \mathbf{R})\), where \(\mathbf{R} = \sum_{i=1}^{8}\text{MLP}_i(F(\Delta\mathbf{P}_i))\).
- Two-stage Training: Phase I (0-20 epochs) freezes the encoder to train only the VDETR decoder. Phase II (20-100 epochs) unfreezes the encoder for joint fine-tuning with the encoder learning rate set to 1/10 of the decoder's (1e-5 vs 1e-4) and applies a 3-epoch warmup to prevent catastrophic forgetting.
- Mean Teacher Semi-Supervised Learning: Weak augmentation (Gaussian noise σ=0.01, intensity shift ±2%) generates teacher pseudo-labels, while strong augmentation (σ=0.05, shift ±10%, elastic deformation) generates student predictions. The consistency loss includes center MSE, size MSE, and classification KL divergence.
- Classification Branch: Frozen encoder \(\rightarrow\) GAP \(\rightarrow\) 2-layer FC (256\(\rightarrow\)128\(\rightarrow\)7) with only 33,799 trainable parameters. A weighted BCE loss where \(w_i^{\text{pos}} = N_i^{\text{neg}} / N_i^{\text{pos}}\) (e.g., bowel injury weight 4.45) is used to handle extreme class imbalance.
Loss & Training¶
where \(\lambda(t)\) linearly increases from 0 to 0.3 (epochs 20-60). Weighted BCE loss is used in the classification task to handle class imbalance.
Key Experimental Results¶
3D Detection Validation Set Ablation¶
| Method | Best Epoch | [email protected] | [email protected] |
|---|---|---|---|
| VDETR (without SSL) | 5 | 26.36% | 6.82% |
| VDETR + SSL | 99 | 56.57% | 45.12% |
| Gain | — | +115% | +562% |
3D Detection Test Set¶
| Metric | Without SSL | With SSL | Gain |
|---|---|---|---|
| [email protected] | 23.03% | 45.30% | +97% |
| [email protected] | 16.67% | 28.72% | +72% |
Classification Task (Frozen Encoder Linear Probe)¶
| Category | Test Accuracy |
|---|---|
| Bowel Injury | 97.5% |
| High-grade Liver Injury | 98.3% |
| Overall Average | 94.07% |
Key Findings¶
- Without semi-supervised learning, the detection performance peaks at epoch 5 and then collapses (overfitting); adding semi-supervised learning stabilizes training and ensures convergence.
- Features learned from self-supervised pre-training are of extremely high quality: the frozen encoder linear probe achieves 94.07% accuracy at epoch 0 with no further improvement during subsequent training.
- Surprisingly, semi-supervised learning does not benefit the classification task (75.4% vs baseline 77.7%), likely due to pseudo-label noise.
Highlights & Insights¶
- Extreme Data Efficiency Validation: Achieved a usable 3D detector with only 144 labeled and 2,000 unlabeled cases.
- Synergistic Effect of Self-Supervision \(\rightarrow\) Semi-Supervision: Pre-training provides a stable feature representation, while semi-supervised learning provides implicit regularization; their combination avoids training collapse.
- 3D Medical Adaptation of Vertex RPE: Successfully adapted the SOTA method from 3D detection in natural scenes to the medical domain.
- Complete System Design: Open-sourced the entire pipeline including data preprocessing \(\rightarrow\) self-supervision \(\rightarrow\) detection + classification.
- Rigorous Self-Supervised Quality Validation: Feature quality of the pre-trained model is doubly proven by a reconstruction PSNR of 19.39dB and a linear probe accuracy of 76%.
Limitations & Future Work¶
- The evaluation set for the detector contains only 32 cases, limiting statistical significance.
- The AUC for classification is only 51.4% (due to probability calibration issues); although it does not affect binary predictions, it indicates that the model's confidence is unreliable.
- Validation is limited to the RSNA abdominal trauma dataset, with lack of generalization to other 3D medical detection tasks.
- Compared to the RSNA 2023 competition winner (98% AUC, multi-model ensemble), a significant gap remains for this single-model approach.
- High computational overhead: full 3D voxel processing requires an A100 GPU, with a batch size of only 1-2.
Related Work & Insights¶
- 3D Detection: Fully convolutional methods like VoteNet and FCAF3D rely heavily on dense annotations; 3DETR and GroupFree are adapted to 3D but lack localized learning.
- V-DETR: Solves the locality problem in 3D through an 8-vertex RPE; this work is the first to apply it to 3D medical detection.
- Self-Supervised Learning: MAE is effective in natural images; Eckstein et al. validated the value of pre-training for 3D medical detection.
- Semi-Supervised Detection: Mean Teacher and Unbiased Teacher are effective in 2D detection; this work extends them to 3D medical scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Individual components are existing methods, but their integration is logical and highly systematic.)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Ablation studies are convincing, but the test set is too small; lacks comparison with other 3D detectors.)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, but a bit too long with redundant details.)
- Value: ⭐⭐⭐⭐⭐ (3D medical detection under label scarcity is a real pain point, and the framework holds significant reference value.)