Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding¶
Conference: CVPR 2026 arXiv: 2603.12514 Code: GitHub Area: Medical Imaging / 3D Trauma Detection Keywords: Self-supervised learning, semi-supervised learning, Masked Image Modeling, 3D object detection, VDETR, Vertex Relative Position Encoding, abdominal CT, trauma detection
TL;DR¶
This paper proposes a two-stage label-efficient framework: a patch-based MIM self-supervised pretraining of a 3D U-Net encoder on 1,206 unlabeled CT volumes, followed by VDETR with 3D vertex relative position encoding for 3D lesion detection, augmented by Mean Teacher semi-supervised consistency regularization over 2,000 additional unlabeled volumes. Using only 144 annotated samples, the framework achieves 56.57% val mAP@0.50, a 115% improvement over fully supervised training.
Background & Motivation¶
Urgent clinical need for abdominal CT trauma detection: Emergency settings require rapid and accurate detection of internal injuries, yet manual analysis of 3D medical volumes is time-consuming and subject to inter-observer variability.
Severe scarcity of annotated data: Among 4,711 sequences in the RSNA Abdominal Trauma dataset, only 206 (4.4%) carry segmentation annotations, rendering conventional fully supervised methods impractical.
Loss of 3D spatial relationships in 2D slice-wise analysis: Treating CT volumes as independent 2D slices fails to capture the complex volumetric spatial structures present in the data.
Inadequacy of centroid-based positional metrics for irregular organs: Standard DETR-style position encodings compute distances from centroids to pixels, providing insufficient geometric description for irregularly shaped organs and lesion regions.
Poor transfer of features pretrained on natural domains: 3D feature extractors pretrained on natural images or videos transfer poorly to medical imaging data characterized by Hounsfield Unit values and distinctive intensity distributions.
Underexplored integration of SSL, semi-supervised learning, and Transformer-based detection in 3D medical imaging: A systematic combination of these three paradigms remains an open gap in the literature.
Method¶
Overall Architecture¶
Input: raw DICOM CT sequences → preprocessing and standardization to \(512\times336\times336\) voxels (anisotropic spacing \(2.0\times1.0\times1.0\) mm) → Stage 1: patch-based MIM self-supervised pretraining of 3D U-Net encoder → Stage 2: frozen/unfrozen encoder + VDETR decoder for 3D detection + Mean Teacher semi-supervised training → Output: 3D bounding boxes + classification labels.
Key Design 1: Patch-based Masked Image Modeling for Self-Supervised Pretraining¶
- Function: Extracts \(128^3\) patches from 1,206 CT volumes (206 annotated + 1,000 unannotated), subdivides each patch into \(8^3\) sub-blocks, randomly masks 75% of sub-blocks, and trains a 3D U-Net to reconstruct the masked regions.
- Mechanism: Following the MAE paradigm, the reconstruction objective forces the encoder to learn meaningful anatomical structure patterns and spatial relationships without any manual annotation.
- Design Motivation: Medical data annotation is prohibitively expensive (only 4.4% labeled), whereas unlabeled data is abundant. Patch-level operations substantially reduce computational cost (\(128^3\) vs. \(512\times336\times336\)), while multi-patch sampling ensures adequate anatomical coverage. After 50 epochs of training, the encoder weights are frozen and serve as a fixed feature extraction backbone for downstream tasks.
Key Design 2: VDETR + 3D Vertex Relative Position Encoding¶
- Function: The pretrained encoder outputs \(32\times21\times21\times256\) feature maps; 4,096 tokens are sampled and fed into the VDETR decoder, which computes the geometric relationship between each voxel and the 8 vertices of the predicted bounding box via 3D RPE.
- Mechanism: For each query \(q\) and voxel position, the offset vectors to all 8 vertices of the predicted box are computed as \(\Delta\mathbf{P}_i \in \mathbb{R}^{K \times N \times 3}\). After nonlinear transformation and MLP projection, a positional bias \(\mathbf{R} = \sum_{i=1}^{8}\mathbf{P}_i\) is generated and added to the standard attention scores: \(\mathbf{A} = \text{softmax}(\mathbf{QK}^T + \mathbf{R})\).
- Design Motivation: Medical organ and lesion shapes are highly irregular; a single centroid distance cannot determine whether a voxel lies inside, outside, or on the boundary of a target. The 8-corner encoding provides complete geometric containment/exclusion information, enabling the model to learn correct locality inductive biases even from limited training data.
Key Design 3: Two-Stage Training + Mean Teacher Semi-Supervised Learning¶
- Function: Phase I (epochs 0–20) freezes the encoder and trains only the decoder; Phase II (epochs 20–100) unfreezes the encoder for joint fine-tuning (learning rate 10× lower than the decoder), while introducing Mean Teacher semi-supervised training over 2,000 additional unlabeled volumes.
- Mechanism: The Teacher model generates pseudo-labels using weak augmentation (Gaussian noise \(\sigma=0.01\), intensity shift \(\pm2\%\)); the Student model is trained with strong augmentation (\(\sigma=0.05\), shift \(\pm10\%\), blur, elastic deformation), and a consistency loss enforces prediction agreement between the two.
- Design Motivation: Phase I prevents randomly initialized decoder gradients from corrupting pretrained features. The differential learning rates in Phase II (encoder \(1\times10^{-5}\) vs. decoder \(1\times10^{-4}\)) mitigate catastrophic forgetting. Semi-supervised training is activated only at epoch 20 (with \(\lambda\) linearly increasing from 0 to 0.3) to avoid training collapse caused by low-quality pseudo-labels when the decoder has not yet converged.
Key Design 4: Multi-Label Injury Classification (Downstream Task II)¶
- Function: The frozen encoder's bottleneck features (\(32\times21\times21\times256\)) are passed through global average pooling → two FC layers (\(256\rightarrow128\rightarrow7\)) → 7 independent binary classifiers.
- Mechanism: Linear probe evaluation — only the 33,799-parameter classification head is trained (vs. the encoder's 5.6M parameters), directly assessing the discriminative capacity of the self-supervised representations.
- Design Motivation: Severe class imbalance (e.g., bowel injury has only 18% positive rate) is addressed by a weighted BCE loss \(w_i^{pos} = N_i^{neg}/N_i^{pos}\), imposing heavier penalties on false negatives for rare classes.
Loss & Training¶
Total detection loss:
The consistency loss comprises three components: center MSE, size MSE, and classification KL divergence (temperature \(T=2.0\)); \(\lambda(t)\) increases linearly from 0 to 0.3 over epochs 20–60.
Classification loss: Weighted Binary Cross-Entropy \(\mathcal{L}_{cls} = \frac{1}{7}\sum_{i=1}^{7}\mathcal{L}_{BCE}^i\), with positive sample weights such as \(w_{bowel\ injury}^{pos}=4.45\).
Key Experimental Results¶
Table 1: Detection Performance Comparison (Validation Set)¶
| Metric | VDETR (w/o semi-sup.) | VDETR + SSL | Gain |
|---|---|---|---|
| Best Epoch | 5 | 99 | — |
| mAP@0.10 | 27.27% | 56.57% | +107% |
| mAP@0.25 | 27.27% | 56.57% | +107% |
| mAP@0.50 | 26.36% | 56.57% | +115% |
| mAP@0.75 | 6.82% | 45.12% | +562% |
Key Findings: Without semi-supervised learning, the model peaks at epoch 5 and then collapses catastrophically (dropping to ~8%), demonstrating that 144 annotated samples alone are entirely insufficient for stable training. The addition of semi-supervised learning yields stable convergence.
Table 2: Detection Performance Comparison (Test Set, 32 volumes)¶
| Metric | VDETR (w/o semi-sup.) | VDETR + SSL | Gain |
|---|---|---|---|
| mAP@0.10 | 23.03% | 45.30% | +97% |
| mAP@0.25 | 23.03% | 45.30% | +97% |
| mAP@0.50 | 23.03% | 45.30% | +97% |
| mAP@0.75 | 16.67% | 28.72% | +72% |
Table 3: Classification Ablation Study¶
| Method | Encoder | Test Acc | Test AUC |
|---|---|---|---|
| Fine-tune + augmentation (144 samples) | Unfrozen | 77.7% | 57.7% |
| Fine-tune + augmentation + SSL (144 samples) | Unfrozen | 75.4% | 57.3% |
| Fine-tune + augmentation + Focal Loss | Unfrozen | 75.9% | 56.0% |
| Linear probe (2,244 samples) | Frozen | 94.07% | 51.4% |
Key Findings: Semi-supervised learning degrades classification performance (pseudo-label noise); expanding the labeled set from 144 to 2,244 samples combined with a frozen encoder linear probe achieves 94.07%, confirming that high-quality labels outweigh pseudo-labels.
Table 4: Per-Class Classification Performance on Test Set (482 volumes)¶
| Injury Category | Test Acc | Test AUC |
|---|---|---|
| Bowel healthy | 97.5% | 0.577 |
| Bowel injury | 97.5% | 0.584 |
| Liver healthy | 87.6% | 0.500 |
| Liver high-grade | 98.3% | 0.429 |
| Kidney high-grade | 96.1% | 0.470 |
| Spleen healthy | 87.1% | 0.518 |
| Extravasation | 94.4% | 0.521 |
| Overall | 94.07% | 0.514 |
Highlights & Insights¶
- Systematic integration of self-supervised and semi-supervised learning: The two-stage design is conceptually clean — MIM pretraining establishes a strong feature foundation, while Mean Teacher semi-supervised training addresses label scarcity in the detection phase. This pipeline is directly reusable for other medical detection scenarios with scarce annotations.
- Stability improvement from semi-supervised training is the most salient contribution: The transition from catastrophic collapse at epoch 5 to stable convergence over 100 epochs, with a 562% gain in mAP@0.75, demonstrates that consistency regularization provides regularization benefits far beyond mere performance improvement.
- Medical adaptation of 3D RPE: Introducing V-DETR's 8-corner position encoding into 3D medical image detection offers a fundamentally more expressive geometric description of irregular organ shapes compared to centroid-based distances.
- Linear probe achieves 94.07% at epoch 0: This demonstrates that self-supervised pretraining yields immediately transferable features requiring no fine-tuning.
- Code is publicly available and the complete pipeline is fully reproducible.
Limitations & Future Work¶
- Absolute detection performance leaves room for improvement: A test mAP@0.50 of 45.30% remains far from clinical deployment readiness, and mAP@0.75 of only 28.72% indicates insufficient localization precision.
- Classification AUC is very low (51.4%): Despite high accuracy (94.07%), the model exhibits severe probability miscalibration, with sigmoid confidence scores poorly aligned with true probabilities. The authors attribute this to calibration issues but do not address it in the paper.
- Limited data scale: Only 206 annotated and 1,000 unlabeled volumes are used for pretraining, which is modest by contemporary large-scale pretraining standards.
- Semi-supervised training is ineffective or detrimental for classification: The gain from expanding labeled data from 144 to 2,244 samples (+16.37%) far exceeds the effect of semi-supervised learning (which actually decreases performance by 2.3%), indicating poor generalization of the semi-supervised strategy to the classification task.
- Evaluation on a single dataset (RSNA): Cross-dataset and cross-domain generalization are not assessed.
- No direct comparison with other 3D medical detection methods (e.g., nnDetection): Head-to-head comparisons with domain-specific state-of-the-art methods are absent.
- → The framework is extensible to multi-organ detection, CT-MRI cross-modal transfer, and larger-scale pretraining data.
Related Work & Insights¶
- vs. MAE (He 2022): MAE targets 2D natural images; this work extends the patch-based MIM paradigm to 3D medical volumes and demonstrates that reconstruction objectives remain effective in the CT domain (PSNR 19.39 dB, linear probe 76%).
- vs. V-DETR (2024): V-DETR achieves state-of-the-art results on the indoor scene dataset ScanNetV2; this work is the first to introduce 3D RPE into medical image detection. The core contribution lies not in RPE itself but in its systematic integration with self-supervised and semi-supervised learning.
- vs. Eckstein et al. (2024) on 3D medical object detection pretraining: That work demonstrates the importance of pretraining for 3D medical detection; this paper further integrates semi-supervised learning into the pipeline.
- vs. Mean Teacher (Tarvainen 2017): The classic semi-supervised framework is adapted from 2D image classification to 3D volumetric detection, with the addition of three-branch consistency losses over center, size, and classification predictions.
- vs. RSNA 2023 competition winning solution: The competition winner achieved 98% AUC via a two-stage pipeline with model ensembling; this work achieves 94.07% accuracy with a single model and frozen encoder, using 29% less data at substantially lower complexity.
Rating¶
- Novelty: ⭐⭐⭐ — Individual components (MIM, V-DETR, Mean Teacher) are not novel; the contribution lies in their systematic integration and the design of a complete pipeline for label-scarce scenarios.
- Experimental Thoroughness: ⭐⭐⭐ — Ablation studies cover SSL, semi-supervised learning, classification, and detection, but cross-dataset validation and direct comparisons with domain-specific SOTA are missing; the test set is small (32 volumes).
- Writing Quality: ⭐⭐⭐⭐ — The paper is well-structured, the design motivations for the two-stage training strategy are clearly articulated, and mathematical derivations are complete.
- Value: ⭐⭐⭐ — The integration paradigm of self-supervised and semi-supervised learning under label-scarce conditions is transferable; the application of 3D RPE to medical detection provides useful reference.
- Overall Value: TBD