Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation¶

Conference: AAAI 2026 arXiv: 2601.05855 Code: Available Area: Medical Imaging Keywords: Semi-supervised learning, medical image segmentation, channel selection, bidirectional interaction, weak-to-strong consistency

TL;DR¶

This paper proposes the BCSI framework, which employs a channel-selection router to dynamically identify critical feature channels and performs bidirectional channel-level interaction between labeled and unlabeled data streams. Combined with semantic-spatial perturbation-based weak-to-strong consistency learning, BCSI achieves substantial improvements in semi-supervised medical image segmentation.

Background & Motivation¶

Semi-supervised medical image segmentation aims to train segmentation models using a small amount of labeled data alongside large quantities of unlabeled data. Existing methods are primarily built upon two major frameworks:

Mean Teacher (MT) Framework: The student network is updated via gradient backpropagation, while the teacher network is updated via EMA. However, the teacher model is susceptible to accumulated prediction errors from the student model.

Dual-stream Consistency Framework: Employs multi-decoder or dual-branch architectures and enforces prediction consistency across different network heads for the same input to improve generalization. However, these methods tend to converge to similar decision boundaries and incur high computational costs.

Three core limitations of existing methods:

Error accumulation: Teacher model performance degrades in the MT framework
Model complexity: Multi-decoder structures introduce additional computational overhead
Lack of cross-stream interaction: Labeled and unlabeled data are trained separately, neglecting the potential interaction between the two

AllSpark identifies that decoupled training causes labeled data to dominate, leading to low-quality pseudo-labels. SKCDF proposes decoupled data streams, yet all existing methods lack bidirectional data interaction and do not differentiate the contribution of individual channels — operating on all channels indiscriminately introduces redundant information and noise.

Method¶

Overall Architecture¶

The BCSI framework adopts a single encoder-decoder architecture (VNet) comprising three core components:

Semantic-Spatial Perturbation (SSP): Applies two strong augmentations and one weak augmentation to the input data
Channel-selection Router (CR): Dynamically selects the most relevant feature channels for interaction
Bidirectional Channel-level Interaction (BCI): Performs bidirectional feature exchange between labeled and unlabeled data on the selected channels

Key Designs¶

1. Semantic-Spatial Perturbation (SSP)¶

Rather than relying on multi-decoder or dual-stream architectures, SSP enhances model robustness through two strong augmentation strategies:

Semantic Perturbation (Color Jitter): \(x_{s_{col}}^u = \alpha \cdot x^u + \beta + \mathcal{N}(\mu, \sigma^2)\), randomly altering brightness and contrast while adding Gaussian noise
Spatial Perturbation (Copy-Paste): \(x_{s_{mix}}^u = \mathcal{M} \odot x^l + (1 - \mathcal{M}) \odot x^u\), mixing spatial regions of labeled and unlabeled data via random binary masks

Predictions from all three augmentations are fed into the model simultaneously: the weakly augmented prediction serves as a pseudo-label to supervise the strongly augmented predictions, while a consistency constraint is enforced between the two strong augmentation predictions.

2. Channel-selection Router (CR)¶

Not all channels are beneficial for feature interaction. CR dynamically selects the most informative channels through a learned mechanism:

Input: Labeled features \(\mathcal{F}^l \in \mathbb{R}^{C \times h \times w \times d}\) and unlabeled features \(\mathcal{F}^u\) extracted by the encoder
Routing scores: A lightweight router \(\mathcal{G}(\cdot)\) generates channel importance scores \(\mathbf{s} \in \mathbb{R}^C\)
Sparse mask: A sparse channel mask is constructed via Top-K selection: \(\mathcal{R} = \delta(\mathbf{s} \geq \tau_K(\mathbf{s}))\)
Feature selection: \(\mathcal{F}_{sub}^l = \mathcal{F}^l \odot \mathcal{R}^l\), retaining only the \(K\) most relevant channels

In experiments, \(K=64\) out of 256 total channels is used. Compared to random selection, the router yields a Dice improvement of 1.17% on the LA dataset.

3. Bidirectional Channel-level Interaction (BCI)¶

BCI enables deep information exchange between labeled and unlabeled data over the selected channels:

Feature queues: Two randomly initialized feature queues \(\mathcal{Q}^l, \mathcal{Q}^u \in \mathbb{R}^{M \times L}\) (M = maximum length, L = per-channel length) store historical features and are updated in a FIFO manner.

Similarity retrieval: Cosine similarity between selected channels and stored features is computed to retrieve the most similar features:

\[\mathcal{F}_q^l = \{\arg\max_{f_q \in \mathcal{Q}^l} \text{Sim}(\mathcal{F}_{sub,k}^l, f_q)\}_{k=1}^K\]

Cross-attention interaction: Bidirectional feature fusion is achieved via Q-K-V attention:

\[\tilde{\mathcal{F}}_{sub}^l = \sigma(\mathbf{Q}(\mathcal{F}_{sub}^l) \cdot \mathbf{K}(\mathcal{F}_q^u)^\top / \sqrt{d}) \cdot \mathbf{V}(\mathcal{F}_q^u) + \mathcal{F}_{sub}^l\]

Feature re-injection: The interacted features are re-injected into the original feature map via the sparse mask: \(\tilde{\mathcal{F}^l} = \tilde{\mathcal{F}}_{sub}^l \odot \mathcal{R}^l + \mathcal{F}^l \odot (1 - \mathcal{R}^l)\)

Loss & Training¶

The total loss consists of three terms: \(\mathcal{L}_{total} = \mathcal{L}_{sup} + \mathcal{L}_{cons} + \lambda_u \mathcal{L}_{unsup}\)

Supervised loss: Computes a weighted segmentation loss \(\mathcal{L}_{seg} = \mathcal{L}_{BCE} + \mathcal{L}_{IoU}\) over the three augmented views (two strong, one weak) of labeled data, guided by uncertainty weighting
Unsupervised loss: The weakly augmented prediction serves as a pseudo-label to supervise both strongly augmented predictions
Consistency loss: MSE-based consistency constraint between the two strong augmentation predictions
Warmup schedule: \(\lambda_u(t) = 0.1 \times e^{-5(1-t/t_{max})^2}\)

Training configuration: SGD optimizer (lr=0.01, momentum=0.9), 30k iterations, batch size=4, NVIDIA 4090 GPU.

Key Experimental Results¶

Main Results¶

Comparisons against 11 state-of-the-art methods on three 3D medical segmentation benchmarks:

Left Atrium dataset (100 3D MRI scans):

Method	10% Dice↑	10% 95HD↓	20% Dice↑	20% 95HD↓
VNet (SupOnly)	82.74	13.35	84.93	14.50
BCP	89.62	6.81	90.38	6.68
UnCo	90.37	6.11	90.91	5.36
BCSI (Ours)	91.07	5.57	91.84	5.06

BraTS-2019 dataset (335 multi-modal MRI scans):

Method	10% Dice↑	10% 95HD↓	20% Dice↑	20% 95HD↓
VNet (SupOnly)	74.43	37.11	80.16	22.68
UnCo	85.09	8.63	85.16	8.41
BCSI (Ours)	86.17	8.43	86.86	7.62

Pancreas-CT dataset (82 abdominal CT scans): BCSI achieves a Dice of 80.41% under 10% labeled data (vs. 78.53% for UnCo), with 95HD reduced to 6.33.

Ablation Study¶

Component ablation (LA 20%):

SSP	BCI	CR	Dice↑	95HD↓
✗	✓	✗	88.60	10.98
✓	✓	✗	90.58	6.43
✗	✓	✓	89.23	7.50
✓	✓	✓	91.84	5.06

Interaction direction ablation: Bidirectional interaction (Lab↔Unlab, Dice=91.84%) outperforms unidirectional variants (Lab→Unlab=91.42%, Lab←Unlab=91.39%).

Number of selected channels: \(K=64\) yields the best performance (91.84%), while using all channels (\(K=256\)) performs worst (91.27%), confirming the necessity of selective interaction.

Key Findings¶

SSP weak-to-strong consistency outperforms conventional MT structures (Dice improvement of 2%+)
Bidirectional interaction surpasses unidirectional interaction; the labeled→unlabeled direction is marginally superior to the reverse
Router-based selection outperforms random selection (+1.17% Dice), demonstrating that dynamic channel selection effectively reduces noise interference
On BraTS-2019 with 20% labeled data, BCSI surpasses fully supervised VNet in terms of 95HD

Highlights & Insights¶

Novel channel-selection paradigm: Not all feature channels are suitable for interaction; selecting only Top-K channels for exchange constitutes a form of "precise perturbation" that enriches features while avoiding redundant noise
Single-model architecture: Avoids the complexity of dual-stream/multi-decoder designs, achieving diversity through data augmentation strategies rather than architectural modifications
Feature queue design: FIFO queues storing historical features enhance the model's long-term memory, conceptually analogous to MoCo's momentum queue
Dual-role recognition: The paper is the first to explicitly characterize feature interaction as simultaneously serving as augmentation and perturbation, leveraging this duality to improve model robustness

Limitations & Future Work¶

The paper does not discuss the applicability of the method to 2D medical images; experiments are conducted exclusively on 3D data
The CR router design is relatively simple (lightweight network); more sophisticated routing strategies such as MoE-style gating could be explored
The maximum queue length is fixed (2560); adaptive adjustment warrants further investigation
Comparisons with foundation model-based semi-supervised methods such as SAM are not fully addressed (only briefly mentioned in the discussion)

AllSpark (CVPR 2024): Proposes channel-level cross-attention to regenerate labeled features from unlabeled data, but lacks selectivity
SKCDF: Decouples encoder and decoder roles but does not support bidirectional interaction
UniMatch: Introduces consistency training with an auxiliary feature perturbation stream, inspiring the weak-to-strong strategy
The channel-selection paradigm is generalizable to other tasks requiring cross-stream interaction, such as domain adaptation and federated learning

Rating¶

Novelty: ★★★★☆ — The channel-selection router and bidirectional interaction constitute valuable and original designs
Experimental Thoroughness: ★★★★★ — Three datasets, 11 comparison methods, and detailed ablation studies
Writing Quality: ★★★★☆ — Well-structured with detailed mathematical derivations
Value: ★★★★☆ — Single-model architecture with manageable computational overhead; code is publicly available