Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation¶

CVPR 2026 Medical Imaging Weakly supervised semantic segmentation teacher-student framework pseudo-mask refinement gland segmentation colorectal cancer pathology

Conference: CVPR 2026 arXiv: 2603.08605 Authors: Hikmat Khan, Wei Chen, Muhammad Khalid Khan Niazi (The Ohio State University Wexner Medical Center) Area: Medical Imaging Keywords: Weakly supervised semantic segmentation, teacher-student framework, pseudo-mask refinement, gland segmentation, colorectal cancer pathology

TL;DR¶

This paper proposes a weakly supervised teacher-student framework that leverages sparse pathological annotations and an EMA-stabilized teacher network to generate progressively refined pseudo-masks. Through confidence filtering, adaptive fusion, and curriculum-guided refinement, the framework achieves efficient segmentation of glandular structures in colorectal cancer pathology images.

Background & Motivation¶

Clinical Need¶

Colorectal cancer (CRC) is the third most common cancer worldwide, and its pathological grading relies heavily on accurate segmentation of glandular structures. Pathologists must assess morphological features of glands—including size, shape, and arrangement density—to determine tumor grade, which directly informs treatment planning.

Annotation Bottleneck¶

High cost of pixel-level annotation: A single whole slide image (WSI) may contain hundreds of glandular structures; pixel-wise annotation can require hours or even days of expert pathologist effort.
Poor annotation consistency: Different pathologists exhibit subjective disagreement in delineating gland boundaries, particularly in high-grade tumors with abnormal gland morphology.
Clinical infeasibility: Large-scale pixel-level annotation is impractical to obtain within routine clinical diagnostic workflows.

Limitations of Weakly Supervised Methods¶

Existing weakly supervised semantic segmentation (WSSS) methods are predominantly based on class activation maps (CAMs) and suffer from the following issues: - Incomplete activation: CAMs tend to highlight only the most discriminative local regions (e.g., gland centers), neglecting the complete structural boundaries of glands. - Poor pseudo-mask quality: Pseudo-masks generated by CAMs are noisy with blurred boundaries; using them directly for training limits segmentation model performance. - Inability to supervise unannotated structures: When only a subset of glands in the training data is annotated, CAM-based methods cannot effectively exploit information from unannotated glands.

Core Motivation¶

Can a method be designed that, using only a small number of sparse pathological annotations, progressively discovers and segments unannotated glandular regions via a teacher-student framework, while ensuring progressive improvement in pseudo-mask quality?

Method¶

Overall Architecture¶

The paper proposes a weakly supervised teacher-student framework comprising three core mechanisms working in concert.

1. EMA-Stabilized Teacher Network¶

Architecture: The teacher network shares the same architecture as the student network, but its parameters are updated from the student network via exponential moving average (EMA).
Parameter update rule: Teacher parameters $\theta_T$ are updated at each training step $t$ as: $$\theta_T^{(t)} = \alpha \cdot \theta_T^{(t-1)} + (1-\alpha) \cdot \theta_S^{(t)}$$ where $\alpha$ is the EMA decay coefficient (typically 0.99–0.999) and $\theta_S$ denotes student network parameters.
Advantage: EMA aggregates student parameters over multiple steps, yielding more stable and smooth predictions and avoiding pseudo-mask oscillations caused by single-step training fluctuations.
Pseudo-mask generation: The teacher network generates predictions over unannotated regions, serving as additional supervision signals for the student network.

2. Confidence Filtering and Adaptive Fusion¶

Confidence filtering: A confidence threshold $\tau$ is applied to teacher network predictions; only high-confidence regions retain pseudo-labels.
- Low-confidence regions ($p < \tau$) are marked as "ignored" and excluded from loss computation.
- This prevents erroneous teacher predictions from being propagated to the student network.
Adaptive fusion: The limited ground-truth annotations are fused with teacher-predicted pseudo-masks.
- In annotated regions, ground-truth labels serve as supervision.
- In unannotated regions, confidence-filtered teacher predictions are used.
- Fusion weights adapt with training progress: early stages trust ground truth more heavily, while pseudo-mask contributions increase gradually.

Easy-to-hard strategy: Training initially focuses on morphologically clear, easily segmented glands (e.g., large, regular-shaped normal glands), progressively transitioning to morphologically complex cases (e.g., small, irregular tumor glands).
Iterative pseudo-mask update: At the end of each training stage, the updated teacher network regenerates pseudo-masks, progressively improving their quality.
Dynamic threshold adjustment: The confidence threshold is gradually lowered as training progresses, allowing more regions to participate in training and expanding pseudo-mask coverage.

Training Pipeline¶

Initialize the student network using sparse pathological annotations.
Initialize the teacher network via EMA.
The teacher network generates pseudo-masks for unannotated regions.
Confidence filtering → adaptive fusion → construction of mixed supervision signals.
The student network is trained under mixed supervision.
EMA updates teacher network parameters.
The curriculum scheduler adjusts training difficulty and threshold.
Repeat steps 3–7 until convergence.

Key Experimental Results¶

Datasets¶

OSU institutional dataset: 60 H&E-stained whole slide images from the Ohio State University Wexner Medical Center.
GlaS: The Gland Segmentation Challenge public dataset of colorectal cancer pathology images.
TCGA-COAD: The Cancer Genome Atlas colon adenocarcinoma dataset.
TCGA-READ: The Cancer Genome Atlas rectal adenocarcinoma dataset.
SPIDER: An independent pathological segmentation dataset.

Table 1: Segmentation Performance Comparison on GlaS¶

Method	Supervision	mIoU (%)	mDice (%)
Fully supervised baseline (UNet)	Full pixel annotation	~85	~92
CAM-based WSSS	Image-level labels	~65	~75
Pseudo-label methods	Point/scribble annotation	~72	~82
Semi-supervised Mean Teacher	Partial pixel annotation	~76	~85
Ours	Sparse annotation	80.10	89.10

Using only sparse annotations, the proposed method achieves mIoU of 80.10% and mDice of 89.10%, approaching the fully supervised baseline and substantially outperforming conventional CAM-based weakly supervised methods.

Table 2: Cross-Dataset Generalization Performance¶

Dataset	Training Setting	mIoU (%)	mDice (%)	Notes
GlaS (in-domain)	Standard training	80.10	89.10	Best performance
TCGA-COAD (cross-domain)	Zero-shot transfer	Robust	Robust	No additional annotations; good generalization
TCGA-READ (cross-domain)	Zero-shot transfer	Robust	Robust	Consistent with COAD
SPIDER (cross-domain)	Zero-shot transfer	Degraded	Degraded	Performance drop due to domain shift

Cross-domain evaluation indicates: - The framework generalizes well to similar colorectal cancer datasets (TCGA-COAD/READ), maintaining robust performance without additional annotations. - Performance degrades on SPIDER, reflecting domain shift arising from different tissue preparation protocols and scanning equipment.

Highlights & Insights¶

Annotation efficiency: Only sparse pathological annotations are required (rather than full pixel-level annotation), substantially reducing annotation cost in clinical practice and enabling deployment of large-scale gland segmentation systems.
Progressive pseudo-mask refinement: The curriculum-guided iterative refinement strategy continuously improves pseudo-mask quality throughout training, avoiding the fundamental drawback of CAM methods that generate low-quality pseudo-masks in a single pass.
EMA stability: The EMA update mechanism of the teacher network effectively suppresses pseudo-label oscillation, ensuring stable convergence during training.
Clinical applicability: The framework design aligns with clinical reality—pathologists typically annotate only a small number of representative structures—and the method fully exploits this sparse annotation paradigm.
Cross-domain generalization: Zero-shot transfer results on TCGA-COAD/READ demonstrate that the framework learns generalizable feature representations of glandular structures.

Limitations & Future Work¶

SPIDER domain shift: Notable performance degradation on SPIDER indicates that the framework is sensitive to domain shift from different tissue preparation protocols and scanners; additional domain adaptation strategies are needed.
Ambiguous definition of sparse annotation: The paper does not explicitly quantify the proportion of "sparse annotation" (e.g., what fraction of glands are annotated), making it difficult to evaluate method performance under varying annotation budgets.
Limited dataset scale: The OSU institutional dataset contains only 60 WSIs; this small scale may limit assessment of the method's upper bound.
Insufficient comparison with SOTA WSSS methods: No direct comparison is made with recent advanced weakly supervised segmentation methods (e.g., pathology-adapted versions of SEAM, AffinityNet).
Hyperparameter sensitivity: The framework introduces additional hyperparameters (EMA decay coefficient, confidence threshold, curriculum schedule); their sensitivity and tuning guidance are not sufficiently discussed.

Gland segmentation: Traditional methods rely on handcrafted features (morphological filtering, random forests); deep learning approaches (U-Net, DeepLab) have achieved breakthroughs but depend on large-scale pixel annotations.
Weakly supervised semantic segmentation (WSSS): CAM-based methods (GradCAM, LayerCAM) → pseudo-mask generation → segmentation network training; improvements include attention erasure, contrastive learning, and affinity propagation.
Teacher-student / semi-supervised segmentation: Mean Teacher, FixMatch, and related methods are widely applied to natural images; adapting these to pathology images requires handling challenges such as multi-scale structures and dense targets.
Pseudo-label refinement: Pseudo-label denoising strategies in self-training include confidence thresholding, uncertainty estimation, and prototype-based contrastive learning.
Positioning of this work: The paper combines the teacher-student framework with progressive pseudo-mask refinement, tailored specifically to sparse annotation scenarios in pathology images, filling a gap in the weakly supervised gland segmentation literature regarding stable pseudo-label generation mechanisms.

Rating¶

Novelty: ⭐⭐⭐ — EMA teacher-student and pseudo-mask refinement are established techniques; the contribution lies primarily in their combination and adaptation to the pathology domain.
Experimental Thoroughness: ⭐⭐⭐ — Multi-dataset evaluation and cross-domain experiments are convincing, but comparisons with a broader range of WSSS methods and detailed ablation studies are lacking.
Writing Quality: ⭐⭐⭐ — Structure is clear and positioning is well-defined, though methodological details remain somewhat high-level.
Value: ⭐⭐⭐⭐ — Directly addresses the annotation cost bottleneck in pathology, with clear practical value for clinical deployment.