SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation¶

Conference: CVPR 2026 arXiv: 2603.11616 Code: N/A Area: Medical Image Segmentation Keywords: Semi-supervised Learning, Multi-Source Data, Tooth Segmentation, CBCT, Pseudo Labels

TL;DR¶

This paper proposes SemiTooth, a multi-teacher multi-student semi-supervised framework coupled with a Stricter Weighted-Confidence (SWC) constraint, which effectively leverages multi-source unlabeled data for multi-source CBCT tooth segmentation and achieves cross-source generalization.

Background & Motivation¶

CBCT (Cone Beam Computed Tomography) tooth segmentation is a core task in clinical dental diagnosis, yet it faces two major challenges:

Scarcity of annotated data: Voxel-level annotation is extremely costly, leaving large volumes of de-identified CBCT data unutilized.
Multi-source data heterogeneity: CBCT data acquired from different institutions and devices exhibit significant distribution discrepancies (in density distribution, grayscale intensity, and feature clustering), making unified model training difficult.

Existing semi-supervised medical image segmentation methods (e.g., Mean Teacher, Co-training) are predominantly designed for single-source data and cannot effectively handle distribution shifts in multi-source settings. Furthermore, publicly available multi-source CBCT tooth segmentation datasets are extremely scarce, limiting method validation.

Method¶

Overall Architecture¶

SemiTooth is a multi-branch semi-supervised framework adopting a three-student two-teacher architecture:

Multi-source data are reorganized into three subsets: main (labeled primary source), other (unlabeled data from other sources), and mixed (an unlabeled subset with distribution similar to the primary source, selected via Wasserstein distance measurement).
Three student networks are each responsible for learning from one subset.
Two teacher networks supervise the students on the mixed and other subsets respectively, providing stable pseudo labels through EMA updates.

Key Designs¶

Multi-source subset partitioning strategy:
Inter-source Wasserstein distances are used to divide unlabeled data into a mixed group (distribution close to the labeled source) and an other group (larger distribution discrepancy).
The mixed subset serves as a bridge across different sources, improving cross-source training robustness.
Multi-Teacher Multi-Student architecture:
Three student networks share similar architectures but learn independently, promoting effective knowledge transfer while maintaining diversity.
Teachers are updated via EMA: \(\theta_t^{(k)} \leftarrow \gamma \theta_t^{(k-1)} + (1-\gamma) \theta_s^{(k)}\), with decay rate \(\gamma = 0.99\).
Compared to Mean Teacher (single teacher, single student) and Co-training (no teacher, multiple students), SemiTooth combines the stability of teacher supervision with the cross-source capability of multi-student collaboration.
Stricter Weighted-Confidence (SWC) Constraint:
Each sample is uniformly partitioned into non-overlapping cubic regions \(\{r\}\).
Region-level confidence is defined as the mean of the maximum class probability across all voxels within a region: \(c(r) = \mathbb{E}_{i \in r}[\max_c P_{i,c}^T]\).
Low-confidence regions (\(c(r) < \tau\)) are flagged as unreliable and ignored.
Within reliable regions, voxel-level confidence \(c_i = \max_c P_{i,c}^T\) further weights teacher–student alignment.
This dual-layer design balances structural reliability at the region level with fine-grained accuracy at the voxel level.

Loss & Training¶

The total loss consists of three components:

Supervised loss (labeled data from the primary source): \(\mathcal{L}_{sup} = CE(P^S(x^l), y)\)
SWC consistency loss (unlabeled data from other and mixed sources):

\[\mathcal{L}_{SWC} = \mathbb{E}_{r \in \mathcal{R}_\tau} \left[ \mathbb{E}_{i \in r} \left[ c_i \cdot CE(P_i^S, P_i^T) \right] \right]\]

Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{sup} + \alpha \mathcal{L}_{cons}^u + \beta \mathcal{L}_{cons}^h\)

where \(\alpha = \beta = 0.5\) balance contributions from different sources, and the confidence threshold is set to \(\tau = 0.9\).

Training details: V-Net is used as the backbone, with the Adam optimizer, learning rate \(10^{-4}\), batch size 4, trained for 300 epochs on 4 NVIDIA A4500 GPUs.

Key Experimental Results¶

Dataset: MS3Toothset¶

A self-constructed multi-source semi-supervised tooth segmentation dataset comprising 98 annotated samples (20 for testing) and 438 unlabeled samples from three sources (ShanghaiTech, PKU-SS, and AFMC). The sources exhibit significant differences in density, grayscale intensity, and feature distributions.

Main Results¶

Method	Venue	Year	mIoU	Dice	Recall	Acc
V-Net	IEEE 3DV	2016	61.36	73.65	70.77	66.75
Mean Teacher	NeurIPS	2017	67.69	78.72	78.06	73.68
UA-MT	MICCAI	2019	68.37	79.18	80.42	76.17
ASDA	IEEE TIP	2022	73.75	83.63	80.93	78.79
MLRPL	MIA	2024	72.86	83.29	79.75	77.39
CMT	ACM MM	2024	76.14	85.07	87.14	84.32
Uni-HSSL	CVPR	2025	75.76	85.42	84.26	81.88
SemiTooth	-	2025	76.67	85.69	88.66	86.44

Ablation Study¶

Exp	V-Net	MT	ST	SWC	mIoU	Dice	Recall	Acc
1	✓				61.36	73.65	70.77	66.75
2	✓	✓			67.69	78.72	78.06	73.68
3	✓	✓		✓	69.94	80.29	79.67	75.34
4	✓	✓	✓		75.37	84.56	83.07	80.48
5	✓	✓	✓	✓	76.67	85.69	88.66	86.44

Key Findings¶

The SemiTooth multi-branch architecture (Exp4 vs. Exp2) contributes the largest performance gain (+7.68 mIoU), demonstrating that the multi-teacher multi-student structure is critical for cross-source learning.
The SWC constraint yields +2.25 mIoU over the Mean Teacher baseline (Exp3 vs. Exp2) and +1.30 mIoU on top of SemiTooth (Exp5 vs. Exp4).
t-SNE visualizations show that after training with SemiTooth, feature distributions across different sources become more clustered, validating the improvement in cross-source generalization.

Highlights & Insights¶

Clear problem formulation: This work is the first to systematically address multi-source semi-supervised CBCT tooth segmentation and constructs a dedicated benchmark dataset.
Well-motivated architecture: The asymmetric three-student two-teacher design is more effective than a naive multi-student framework, as the teachers provide stable pseudo-label training signals.
Elegant SWC constraint: The dual-layer filtering strategy—region-level gating combined with voxel-level weighting—is better suited to the spatial structure of 3D CBCT than simple confidence thresholding.
Wasserstein-based subset partitioning: Using distributional distance to bridge labeled and unlabeled sources is a practical and effective strategy.

Limitations & Future Work¶

Limited dataset scale: Only 98 annotated and 438 unlabeled samples; validation on larger-scale data is needed.
Restricted number of sources: Only three sources are considered; scalability to a larger number of sources has not been verified.
Single backbone: Only V-Net is evaluated; more advanced 3D segmentation networks (e.g., nnU-Net, SwinUNETR) may yield further improvements.
Out-of-distribution generalization unverified: Model performance on completely unseen new-source data remains unknown.
Fixed cubic region size in SWC: An adaptive region partitioning scheme may be preferable.

The Mean Teacher paradigm is the foundational framework for semi-supervised medical image segmentation (SSMIS); SemiTooth extends it via a multi-branch design.
CMT (ACM MM 2024) and Uni-HSSL (CVPR 2025) are recent multi-source semi-supervised methods, both of which SemiTooth outperforms.
Cross-domain medical segmentation provides inspiration: Wasserstein distance for source distribution measurement and multi-teacher structures for cross-domain knowledge transfer.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐
Theoretical Depth	⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐
Value	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐
Overall	⭐⭐⭐