Skip to content

DualFete: Revisiting Teacher-Student Interactions from a Feedback Perspective for Semi-supervised Medical Image Segmentation

Conference: AAAI 2026 arXiv: 2511.09319 Code: github.com/lyricsyee/dualfete Area: Medical Image Segmentation / Semi-supervised Learning Keywords: Semi-supervised medical image segmentation, teacher-student model, feedback mechanism, confirmation bias, dual-teacher framework

TL;DR

A feedback mechanism is introduced into the teacher-student semi-supervised learning framework, enabling the student to feed back to the teacher information on whether pseudo-label-guided updates are consistent with the direction of supervision from labeled data. This feedback dynamic is further enhanced within a dual-teacher architecture, effectively suppressing error accumulation and confirmation bias in medical image segmentation.

Background & Motivation

Core Problem: Confirmation Bias

Semi-supervised medical image segmentation (SSMIS) trains segmentation models using a small amount of labeled data together with a large amount of unlabeled data. The teacher-student paradigm is the dominant framework: the teacher generates pseudo-labels for unlabeled data to supervise the student. However, this introduces a severe confirmation bias problem:

Inherent ambiguity of medical images: Target boundaries are unclear and regional uncertainty is high, making it easy to generate erroneous pseudo-labels.

Error self-reinforcement: After the student trains on incorrect pseudo-labels, it influences the teacher via EMA or similar mechanisms, causing the teacher to generate more similar errors — a vicious cycle.

Degeneration into self-training: Existing multi-model methods (e.g., cross-supervision) introduce diversity, but for highly nonlinear networks, the diversity gradually vanishes and the framework eventually degenerates into self-training.

Issues Revealed by Preliminary Experiments (Fig. 1)

Preliminary experiments on the LA dataset clearly illustrate the problem: - (a) Pseudo-label accuracy remains nearly unchanged throughout training (Mean Teacher). - (b) High-error regions are concentrated near boundaries (regional uncertainty). - (c) Consistent errors are prevalent. - (d) Feedback interaction can effectively reduce consistent errors (key finding).

Paper Goals

Existing methods lack an intrinsic error-correction mechanism. Inspired by metacognitive intervention, this paper proposes that the student evaluates whether pseudo-label-guided updates are consistent with the supervision direction of labeled data, and feeds the evaluation back to the teacher, endowing the teacher-student framework with a self-correction capability.

Method

Overall Architecture

DualFete comprises three levels: 1. Basic feedback mechanism: Feedback is introduced into a single teacher-student model. 2. Dual-teacher feedback model: Two teachers collaboratively guide a single student, each receiving individualized feedback. 3. Full framework: Dual-teacher feedback + cross-supervision + strong-augmentation consistency.

Key Designs

1. Feedback-Coupled Teacher-Student Model

Core Idea: Quantify the effect of pseudo-label-guided student updates on performance over labeled data.

Let \(\mathcal{L}_l(\theta_S)\) and \(\mathcal{L}_l(\theta_S')\) denote the labeled-data loss of the student before and after one pseudo-label update step, respectively. The feedback signal is defined as:

\[\delta = \mathcal{L}_l(\theta_S) - \mathcal{L}_l(\theta_S')\]
  • \(\delta > 0\): The pseudo-label-guided update reduces the labeled loss → beneficial update → increase pseudo-label confidence.
  • \(\delta < 0\): The pseudo-label-guided update increases the labeled loss → harmful update → decrease pseudo-label confidence.

The teacher minimizes the feedback loss accordingly: $\(\mathcal{L}_{fb}(\theta_T; \mathcal{D}_u') = -\delta \log \mathcal{P}(\hat{y}^u | x^u; \theta_T, \mathcal{D}_u')\)$

Theoretical Basis: \(\delta\) is a first-order approximation of the inner product of two gradients — the pseudo-label direction \(\Delta\theta_S\) and the labeled-data direction \(\nabla_{\theta_S}\mathcal{L}_l\) — which is consistent with the meta-objective of Meta Pseudo Labels.

2. Dual-Teacher Feedback (Core Contribution of DualFete)

Limitation of single-teacher feedback: Updates to all voxel pseudo-labels are applied in a uniform direction, limiting the error-correction capacity.

DualFete introduces two teachers \(\phi\) and \(\psi\), decomposing the feedback into two dimensions:

Feedback Attributor — identifies which pseudo-labels triggered the student update: - \(\bar{y}^a\) (agreement region): regions where both teachers predict the same label. - \(\bar{y}^d\) (disagreement region): regions where the two teachers predict different labels.

Feedback Receiver — determines to which teacher and which component the feedback is applied: - Agreement feedback \(\delta_a\) → applied to the teacher on the lower-confidence side. - Disagreement feedback \(\delta_d\) → applied to the teacher on the higher-confidence side.

Design Intuition: - \(\delta_a > 0\): Consensus is correct → raise the confidence lower bound → stronger consensus. - \(\delta_a < 0\): Consensus is incorrect → lower the confidence on the lower-confidence side → more likely to produce disagreement for error correction. - \(\delta_d > 0\): The higher-confidence side in disagreement is correct → further reinforce it. - \(\delta_d < 0\): The higher-confidence side in disagreement is incorrect → flip the prediction toward the other teacher's label.

The final dual-teacher feedback loss: $\(\mathcal{L}_{df}(\theta) = -\sum_{\bar{y} \in \{\bar{y}^a, \bar{y}^d\}} \delta_{\bar{y}} \log \mathcal{P}(\hat{y}^{\theta_u} | x^u; \theta, \mathcal{D}_u, \mathcal{M}_{\bar{y}}^\theta)\)$

3. Pseudo-Label Fusion and Cross-Supervision

Pseudo-label strategy (Eq. 6): - Both teachers agree → use the consensus label. - Both teachers disagree → use the label from the higher-confidence teacher.

The total teacher loss comprises three terms: $\(\mathcal{L}_T(\theta) = \mathcal{L}_l(\theta) + \mathcal{L}_{df}(\theta) + \lambda \mathcal{L}_{cs}^{\mathcal{A}}(\theta; \bar{\theta}, \mathcal{A})\)$

  • \(\mathcal{L}_l\): Fully supervised loss on labeled data.
  • \(\mathcal{L}_{df}\): Dual-teacher feedback loss.
  • \(\mathcal{L}_{cs}^{\mathcal{A}}\): Cross-supervision loss with strong augmentation (one teacher's prediction serves as the target for the other teacher's strongly augmented input).

Loss & Training

  • The student is updated using only unlabeled data and pseudo-labels, and is responsible for computing the feedback.
  • The teachers are updated using both labeled and unlabeled data, receiving feedback and cross-supervision.
  • Only the student model is used at inference; both teachers are used exclusively during training.
  • The student can optionally be fine-tuned on labeled data.

Key Experimental Results

Main Results

Method LA 5%(4) LA 10%(8) LA 20%(16) Pancreas 10%(6) Pancreas 20%(12) BraTS 10%(25) BraTS 20%(50)
FullySup 52.55 82.74 86.96 55.60 72.38 74.43 80.16
UA-MT 82.26 86.28 88.74 66.44 76.10 84.64 85.32
BCP 88.02 89.62 91.26 73.83 82.91 85.14 86.13
AD-MT 89.63 90.55 - 80.21 82.61 - -
TraCoCo - 89.86 91.51 79.22 83.36 85.71 86.69
DualFete 90.35 91.28 91.89 81.99 83.49 86.13 85.83
DualFete w.ft. 90.22 91.12 91.91 82.45 83.85 86.25 86.46

Values are Dice (%). DualFete achieves the best performance in almost all settings. On Pancreas 10%, it improves by +1.78% Dice over the previous SOTA AD-MT.

Ablation Study

Configuration LA 20% Dice Pancreas 20% Dice Notes
Baseline (single teacher, no feedback) 88.55 77.18 Baseline
+ Single-teacher feedback 89.63 79.27 Feedback mechanism is effective
+ Dual-teacher + unified feedback 89.83 76.83 Mismatched attribution/receiver causes degradation
+ Dual-teacher + agreement feedback 90.34 79.56 Individually effective
+ Dual-teacher + disagreement feedback 90.35 80.77 Individually effective
+ Dual-teacher + mismatched feedback 87.69 78.06 Reversed attribution/receiver → performance drop
+ Dual-teacher + correct feedback 90.89 81.12 Both feedbacks in synergy yield best results

Table 3 further verifies that the feedback loss is not equivalent to consistency regularization or entropy minimization: a model trained solely with \(\mathcal{L}_{df}\) is not robust to input perturbations and does not reduce prediction uncertainty.

Key Findings

  1. Qualitative analysis (Fig. 4): Experiments under 8 different constraint configurations clearly demonstrate the distinct roles of the two feedback types — \(\delta_a\) governs consensus quality and \(\delta_d\) governs disagreement dynamics; their synergy produces "productive predictive disagreement" while maintaining pseudo-label accuracy.
  2. Efficiency analysis (Table 5): Inference speed is nearly identical to FullySup (~1.9 s/case), since only the student model is used at inference. Training is slightly faster than TraCoCo (2.28 vs. 2.39 s/iter) and requires less memory (10.25 vs. 21.93 GB).
  3. Effect of fine-tuning: Fine-tuning yields notable gains in settings with more labels and on challenging datasets (Pancreas), but tends to overfit when labeled data is extremely scarce.
  4. Confidence threshold: 0.7 is the optimal threshold; filtering low-confidence targets is particularly important for the feedback mechanism.

Highlights & Insights

  • Originality of the feedback mechanism: This is the first work to introduce an intrinsic error-correction capability into the teacher-student semi-supervised framework, rather than relying on external heuristics (e.g., confidence filtering, uncertainty estimation).
  • Elegant dual-teacher feedback design: The combination of two feedback types (agreement/disagreement) and two receivers (high/low confidence) creates rich learning dynamics, breaking the uniform-update limitation of single feedback.
  • Closed loop between theory and experiment: The theoretical foundation of the feedback mechanism is derived from the bilevel optimization of meta-learning, and its effectiveness is validated through qualitative ablations under 8 constraint conditions.
  • Practical value: No additional overhead at inference; performance gains are most significant in extremely label-scarce settings (5% labels).

Limitations & Future Work

  • Some performance fluctuations are observed on the BraTS dataset, possibly due to overfitting caused by the small validation set (25 samples).
  • The dual-teacher architecture increases training memory and time costs (though lower than TraCoCo).
  • Computing the feedback signal requires additional forward passes (the student is first updated, then evaluated on labeled data), increasing computational overhead.
  • Validation is limited to 3D medical image segmentation; the method has not been extended to 2D natural images or other medical tasks.
  • The interaction between the two feedback types is complex; certain combinations can lead to collapse (e.g., \(\delta_d < 0\) used alone causes alternating prediction erosion), requiring careful design.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The feedback perspective offers a fresh reexamination of teacher-student interactions; the dual-teacher feedback design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, multiple labeling ratios, extensive ablations and qualitative analyses.
  • Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear, though the dense notation requires some background knowledge.
  • Value: ⭐⭐⭐⭐⭐ — Makes an important contribution to the SSMIS field; the method is generalizable to other semi-supervised scenarios.