Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation¶
Conference: ECCV 2024
arXiv: 2311.17325
Code: https://github.com/zhenzhao/AD-MT
Area: Medical Image Segmentation / Semi-supervised Learning
Keywords: Semi-supervised Segmentation, Mean Teacher, Confirmation Bias, Pseudo Label, Conflict-Combating
TL;DR¶
Proposes AD-MT (Alternate Diverse Mean Teacher), which addresses the confirmation bias problem in semi-supervised medical image segmentation through random periodic alternate updating of two teacher models and an entropy-based conflict-combating strategy, comprehensively outperforming SOTA methods on ACDC, LA, and Pancreas datasets.
Background & Motivation¶
Background: Mainstream methods in semi-supervised medical image segmentation (SSMIS) are based on consistency regularization, generating pseudo-labels for unlabeled data via a teacher-student framework. The core challenge is confirmation bias, where a single model inevitably generates noisy pseudo-labels and self-reinforces.
Limitations of Prior Work: - Single-Teacher (Mean Teacher): Employs only a single perspective, lacking a correction mechanism for pseudo-label noise. - Multi-Student Co-training (e.g., MC-Net+): Introduces additional training parameters, and differences arising solely from different initializations or learning rates are insufficient. - Multi-Teacher Ensemble (e.g., PS-MT): The teacher updating strategies are not carefully designed, leading to insufficient diversity; furthermore, simple averaging discards conflicting information between teachers.
Key Challenge: Seeking diverse teacher supervision to alleviate confirmation bias, but making teachers sufficiently different without introducing extra training costs is difficult; additionally, conflicting predictions between teachers are typically discarded, thereby wasting information.
Goal: (a) How to make the two teacher models sufficiently different? (b) How to leverage (rather than discard) conflicting predictions between teachers?
Key Insight: Intervening at the teacher updating dimension—ensuring diversity through a triple redundancy of complementary data batches, different augmentation strategies, and random switching periods; during conflicts, comparing the teacher ensemble entropy with student entropy and choosing the more confident one.
Core Idea: Alternately update two teachers (maximizing diversity via complementary data, different augmentations, and random periods) and utilize an entropy-based conflict-combating module to learn from both consistent and conflicting predictions.
Method¶
Overall Architecture¶
One trainable student model + two non-trainable teacher models (EMA updated). Only one teacher is updated per iteration, with the two teachers updating alternately. Both teachers simultaneously generate pseudo-labels for unlabeled data, which are fused via the Conflict-Combating Module to supervise the student's predictions on strongly augmented data. The total loss is formulated as \(\mathcal{L} = \mathcal{L}_x + \lambda_t \mathcal{L}_u\).
Key Designs¶
-
Random Periodic Alternate (RPA) Updating Module:
- Function: To ensure the two teacher models are as distinct as possible.
- Triple Diversity Strategy:
- Complementary Data Batches: Only one teacher is updated at a time, making the unlabeled data seen by the two teachers during a training period fully complementary.
- Different Augmentation Strategies: T1 uses color-jittering while T2 uses copy-paste augmentation—distinctly different in strength and nature.
- Random Switching Periods: Instead of fixed alternation intervals, a new period is randomly generated from \([0, \mathcal{T}_{max}]\) at each switch.
- Design Motivation: Introducing divergence across three dimensions simultaneously—data, augmentation, and update pacing—to ensure teachers generate truly distinct "perspectives" in the feature space.
-
Conflict-Combating Module (CCM):
- Function: To handle pixels where the two teachers' predictions do not match—not by discarding, but by leveraging them.
- Mechanism: Pixel-wise processing:
- When teachers agree: Use an entropy-based weighted ensemble \(\psi_i = \frac{w_1 q_i^{t_1} + w_2 q_i^{t_2}}{w_1 + w_2}\), where \(w_k = e^{-H_{t_k}}\) (low entropy = high weight).
- When teachers conflict: Compare the entropy of the ensemble prediction \(H_{\psi_i}\) with that of the student prediction \(H_{q_i^s}\), selecting the one with lower entropy (higher confidence) as supervision.
- Design Motivation: In the late stages of training, the student might become more accurate than the teachers in certain regions (as the student processes all data); selecting the student during conflicts avoids penalizing the student with incorrect teacher predictions.
Loss & Training¶
- Supervised Loss: Average of Dice + CE.
- Unsupervised Loss: Also Dice + CE, but using the pseudo-labels fused via CCM.
- Confidence Threshold \(\tau\): 0.95 for 2D datasets and 0.75 for 3D datasets (a high threshold in 3D would filter out too much information).
- EMA parameter of 0.99, maximum period \(\mathcal{T}_{max} = 0.5\) epoch.
Key Experimental Results¶
Main Results¶
| Dataset | Labeled Ratio | Metric (Dice%) | AD-MT | BCP (prev SOTA) | Gain |
|---|---|---|---|---|---|
| LA (3D) | 5% (4 cases) | Dice | 89.63 | 88.02 | +1.61 |
| LA (3D) | 10% (8 cases) | Dice | 90.55 | 89.62 | +0.93 |
| ACDC (2D) | 5% (3 cases) | Dice | 88.75 | 87.59 | +1.16 |
| ACDC (2D) | 10% (7 cases) | Dice | 89.46 | 88.84 | +0.62 |
| Pancreas (3D) | 10% (6 cases) | Dice | 80.21 | 73.83 | +6.38 |
| Pancreas (3D) | 20% (12 cases) | Dice | 82.61 | 82.91 | -0.30 |
Ablation Study¶
| Configuration | ACDC Dice | ACDC 95HD | Description |
|---|---|---|---|
| T1 only | 86.83 | 2.65 | Single-teacher baseline |
| T2 only | 86.22 | 2.43 | Copy-paste augmentation is slightly worse |
| T1+T2+RPA (no CCM) | 87.88 | 2.03 | Alternate updating yields 1%+ improvement |
| T1+T2+RPA+CCM (Full) | 88.75 | 1.48 | CCM further yields 0.87% improvement |
Key Findings¶
- Most significant improvement under Pancreas 10% (+6.38%): The advantage of AD-MT is most pronounced in low-data scenarios, indicating that diverse supervision from the two teachers carries the highest value when labeled data is extremely scarce.
- RPA module contributes the most (+1.05%): Transitioning from a single teacher to RPA twin teachers represents the largest source of improvement.
- CCM is more valuable in later training stages: Conflicts between teachers increase in later stages of training, where the student model has already acquired substantial capability and can provide valuable alternative supervision.
- Threshold \(\tau\) sensitivity differs between 2D and 3D datasets: A high threshold (0.95) works well for 2D, while a lower threshold (0.75) is better for 3D.
- Outperforms BCP without pre-training: BCP requires an additional pre-training phase, whereas AD-MT's end-to-end training is more elegant and concise.
Highlights & Insights¶
- Addressing confirmation bias from the perspective of teacher updating strategies: Rather than relying on architectural differences or extra loss functions, the method creates differences between teachers across three dimensions: data partition, augmentation strategy, and update frequency. The approach is simple yet effective.
- Leveraging rather than discarding conflicts: Standard approaches typically average or discard conflicting teacher predictions. In contrast, CCM compares entropy to select the most confident prediction, an insight that offers high referential value for multi-model ensemble methods.
- Random periods are key: Fixed alternating periods might lead to a fixed data distribution observed by each of the two teachers; randomization breaks this rigid pattern.
Limitations & Future Work¶
- Evaluated only on U-Net/V-Net backbones: Validation has not been performed on stronger backbones (e.g., nnU-Net, Swin UNETR).
- Slightly lower performance than BCP on the Pancreas 20% setting: The comparative advantage diminishes under higher labeled ratios, indicating that diverse supervision experiences diminishing returns as labeled data increases.
- The augmentation strategies for the two teachers are manually designed: The choice of color-jittering vs. copy-paste was not automatically searched, and other augmentation combinations might perform better.
- Risks associated with CCM's student substitution strategy: In the early stages of training, the student is quite weak, and choosing the student's prediction during conflicts may introduce more noise—suggesting the need for a warm-up mechanism.
Related Work & Insights¶
- vs. PS-MT: PS-MT also utilizes multiple teachers but updates them at different epochs. This paper implements alternation within the same epoch, combined with complementary data and distinct augmentations, enabling more thorough differentiation.
- vs. MC-Net+: MC-Net+ uses co-training with two students, introducing additional parameters. AD-MT only trains a single student (with teachers updated via EMA), incurring no additional training costs.
- vs. BCP: BCP requires a pre-training schedule to establish a reliable initialization, while AD-MT runs end-to-end without pre-training—yet achieves a 6.38% gain on the Pancreas 10% setting.
Rating¶
- Novelty: ⭐⭐⭐⭐ RPA and CCM designs are innovative, and the triple-differentiation concept of the alternate updating strategy is clever.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 3 datasets, 2D/3D setups, detailed ablation studies, threshold sensitivity, and class-level analysis.
- Writing Quality: ⭐⭐⭐⭐ Diagrams are clear, and the contrastive framework against existing methods is well-depicted.
- Value: ⭐⭐⭐⭐ Simple yet highly effective, presenting direct value to semi-supervised medical image segmentation.