Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation¶

Conference: ECCV 2024
arXiv: 2311.17325
Code: https://github.com/zhenzhao/AD-MT
Area: Medical Image Segmentation / Semi-supervised Learning
Keywords: Semi-supervised Segmentation, Mean Teacher, Confirmation Bias, Pseudo Label, Conflict-Combating

TL;DR¶

Proposes AD-MT (Alternate Diverse Mean Teacher), which addresses the confirmation bias problem in semi-supervised medical image segmentation through random periodic alternate updating of two teacher models and an entropy-based conflict-combating strategy, comprehensively outperforming SOTA methods on ACDC, LA, and Pancreas datasets.

Background & Motivation¶

Background: Mainstream methods in semi-supervised medical image segmentation (SSMIS) are based on consistency regularization, generating pseudo-labels for unlabeled data via a teacher-student framework. The core challenge is confirmation bias, where a single model inevitably generates noisy pseudo-labels and self-reinforces.

Limitations of Prior Work: - Single-Teacher (Mean Teacher): Employs only a single perspective, lacking a correction mechanism for pseudo-label noise. - Multi-Student Co-training (e.g., MC-Net+): Introduces additional training parameters, and differences arising solely from different initializations or learning rates are insufficient. - Multi-Teacher Ensemble (e.g., PS-MT): The teacher updating strategies are not carefully designed, leading to insufficient diversity; furthermore, simple averaging discards conflicting information between teachers.

Key Challenge: Seeking diverse teacher supervision to alleviate confirmation bias, but making teachers sufficiently different without introducing extra training costs is difficult; additionally, conflicting predictions between teachers are typically discarded, thereby wasting information.

Goal: (a) How to make the two teacher models sufficiently different? (b) How to leverage (rather than discard) conflicting predictions between teachers?

Key Insight: Intervening at the teacher updating dimension—ensuring diversity through a triple redundancy of complementary data batches, different augmentation strategies, and random switching periods; during conflicts, comparing the teacher ensemble entropy with student entropy and choosing the more confident one.

Core Idea: Alternately update two teachers (maximizing diversity via complementary data, different augmentations, and random periods) and utilize an entropy-based conflict-combating module to learn from both consistent and conflicting predictions.

Method¶

Overall Architecture¶

One trainable student model + two non-trainable teacher models (EMA updated). Only one teacher is updated per iteration, with the two teachers updating alternately. Both teachers simultaneously generate pseudo-labels for unlabeled data, which are fused via the Conflict-Combating Module to supervise the student's predictions on strongly augmented data. The total loss is formulated as \(\mathcal{L} = \mathcal{L}_x + \lambda_t \mathcal{L}_u\).

Key Designs¶

Random Periodic Alternate (RPA) Updating Module:
- Function: To ensure the two teacher models are as distinct as possible.
- Triple Diversity Strategy:
  - Complementary Data Batches: Only one teacher is updated at a time, making the unlabeled data seen by the two teachers during a training period fully complementary.
  - Different Augmentation Strategies: T1 uses color-jittering while T2 uses copy-paste augmentation—distinctly different in strength and nature.
  - Random Switching Periods: Instead of fixed alternation intervals, a new period is randomly generated from \([0, \mathcal{T}_{max}]\) at each switch.
- Design Motivation: Introducing divergence across three dimensions simultaneously—data, augmentation, and update pacing—to ensure teachers generate truly distinct "perspectives" in the feature space.
Conflict-Combating Module (CCM):
- Function: To handle pixels where the two teachers' predictions do not match—not by discarding, but by leveraging them.
- Mechanism: Pixel-wise processing:
  - When teachers agree: Use an entropy-based weighted ensemble \(\psi_i = \frac{w_1 q_i^{t_1} + w_2 q_i^{t_2}}{w_1 + w_2}\), where \(w_k = e^{-H_{t_k}}\) (low entropy = high weight).
  - When teachers conflict: Compare the entropy of the ensemble prediction \(H_{\psi_i}\) with that of the student prediction \(H_{q_i^s}\), selecting the one with lower entropy (higher confidence) as supervision.
- Design Motivation: In the late stages of training, the student might become more accurate than the teachers in certain regions (as the student processes all data); selecting the student during conflicts avoids penalizing the student with incorrect teacher predictions.

Loss & Training¶

Supervised Loss: Average of Dice + CE.
Unsupervised Loss: Also Dice + CE, but using the pseudo-labels fused via CCM.
Confidence Threshold \(\tau\): 0.95 for 2D datasets and 0.75 for 3D datasets (a high threshold in 3D would filter out too much information).
EMA parameter of 0.99, maximum period \(\mathcal{T}_{max} = 0.5\) epoch.

Key Experimental Results¶

Main Results¶

Dataset	Labeled Ratio	Metric (Dice%)	AD-MT	BCP (prev SOTA)	Gain
LA (3D)	5% (4 cases)	Dice	89.63	88.02	+1.61
LA (3D)	10% (8 cases)	Dice	90.55	89.62	+0.93
ACDC (2D)	5% (3 cases)	Dice	88.75	87.59	+1.16
ACDC (2D)	10% (7 cases)	Dice	89.46	88.84	+0.62
Pancreas (3D)	10% (6 cases)	Dice	80.21	73.83	+6.38
Pancreas (3D)	20% (12 cases)	Dice	82.61	82.91	-0.30

Ablation Study¶

Configuration	ACDC Dice	ACDC 95HD	Description
T1 only	86.83	2.65	Single-teacher baseline
T2 only	86.22	2.43	Copy-paste augmentation is slightly worse
T1+T2+RPA (no CCM)	87.88	2.03	Alternate updating yields 1%+ improvement
T1+T2+RPA+CCM (Full)	88.75	1.48	CCM further yields 0.87% improvement

Key Findings¶

Most significant improvement under Pancreas 10% (+6.38%): The advantage of AD-MT is most pronounced in low-data scenarios, indicating that diverse supervision from the two teachers carries the highest value when labeled data is extremely scarce.
RPA module contributes the most (+1.05%): Transitioning from a single teacher to RPA twin teachers represents the largest source of improvement.
CCM is more valuable in later training stages: Conflicts between teachers increase in later stages of training, where the student model has already acquired substantial capability and can provide valuable alternative supervision.
Threshold \(\tau\) sensitivity differs between 2D and 3D datasets: A high threshold (0.95) works well for 2D, while a lower threshold (0.75) is better for 3D.
Outperforms BCP without pre-training: BCP requires an additional pre-training phase, whereas AD-MT's end-to-end training is more elegant and concise.

Highlights & Insights¶

Addressing confirmation bias from the perspective of teacher updating strategies: Rather than relying on architectural differences or extra loss functions, the method creates differences between teachers across three dimensions: data partition, augmentation strategy, and update frequency. The approach is simple yet effective.
Leveraging rather than discarding conflicts: Standard approaches typically average or discard conflicting teacher predictions. In contrast, CCM compares entropy to select the most confident prediction, an insight that offers high referential value for multi-model ensemble methods.
Random periods are key: Fixed alternating periods might lead to a fixed data distribution observed by each of the two teachers; randomization breaks this rigid pattern.

Limitations & Future Work¶

Evaluated only on U-Net/V-Net backbones: Validation has not been performed on stronger backbones (e.g., nnU-Net, Swin UNETR).
Slightly lower performance than BCP on the Pancreas 20% setting: The comparative advantage diminishes under higher labeled ratios, indicating that diverse supervision experiences diminishing returns as labeled data increases.
The augmentation strategies for the two teachers are manually designed: The choice of color-jittering vs. copy-paste was not automatically searched, and other augmentation combinations might perform better.
Risks associated with CCM's student substitution strategy: In the early stages of training, the student is quite weak, and choosing the student's prediction during conflicts may introduce more noise—suggesting the need for a warm-up mechanism.

vs. PS-MT: PS-MT also utilizes multiple teachers but updates them at different epochs. This paper implements alternation within the same epoch, combined with complementary data and distinct augmentations, enabling more thorough differentiation.
vs. MC-Net+: MC-Net+ uses co-training with two students, introducing additional parameters. AD-MT only trains a single student (with teachers updated via EMA), incurring no additional training costs.
vs. BCP: BCP requires a pre-training schedule to establish a reliable initialization, while AD-MT runs end-to-end without pre-training—yet achieves a 6.38% gain on the Pancreas 10% setting.

Rating¶

Novelty: ⭐⭐⭐⭐ RPA and CCM designs are innovative, and the triple-differentiation concept of the alternate updating strategy is clever.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 3 datasets, 2D/3D setups, detailed ablation studies, threshold sensitivity, and class-level analysis.
Writing Quality: ⭐⭐⭐⭐ Diagrams are clear, and the contrastive framework against existing methods is well-depicted.
Value: ⭐⭐⭐⭐ Simple yet highly effective, presenting direct value to semi-supervised medical image segmentation.