Domain-Adaptive Transformer for Data-Efficient Glioma Segmentation in Sub-Saharan MRI¶

Conference: NeurIPS 2025 arXiv: 2511.02928
Code: None
Area: Medical Imaging Keywords: Glioma segmentation, domain adaptation, Transformer, resource-constrained, BraTS-Africa

TL;DR¶

This paper proposes SegFormer3D+, a domain-adaptive Transformer architecture tailored for heterogeneous MRI data from Sub-Saharan Africa. By integrating histogram matching, radiomics-guided stratified sampling, a frequency-aware dual-path encoder, and a dual attention mechanism, the model achieves a mean Dice of 0.81 for glioma segmentation with only 60 annotated cases for fine-tuning, outperforming nnU-Net by +2.5%.

Background & Motivation¶

Background: Glioma is the most common malignant primary brain tumor in adults, and MRI is the gold standard for diagnosis and treatment planning. Deep learning segmentation methods such as nnU-Net and Swin-UNETR have demonstrated strong performance on high-quality datasets.

Limitations of Prior Work: Most models are trained on data from well-resourced institutions and suffer severe performance degradation when applied to MRI data from Sub-Saharan Africa (SSA). SSA scans typically exhibit lower resolution, increased motion artifacts, and inconsistent contrast due to aging scanners and heterogeneous acquisition protocols, resulting in substantial domain shift.

Key Challenge: The BraTS-Africa challenge introduced the first annotated glioma MRI dataset from SSA medical centers, yet it contains only 60 training cases. Existing methods individually explore histogram normalization, radiomics features, dual-path encoders, or attention mechanisms, but no prior work has systematically unified these techniques into a single domain-adaptive framework.

Goal: To design a robust segmentation architecture under conditions of severely limited annotated data and pronounced domain shift.

Key Insight: Approaching the problem from a systems engineering perspective, the paper combines multiple well-validated domain adaptation techniques into a unified framework—intensity normalization to address scanner variability, radiomics-based stratification to ensure balanced training, a frequency-aware encoder to capture artifact patterns, and dual attention to enhance fine-grained representations.

Core Idea: To integrate histogram matching, radiomics-guided stratification, a frequency-aware dual-path encoder, and spatial-channel dual attention into a unified domain-adaptive segmentation framework for robust glioma segmentation on low-resource MRI.

Method¶

Overall Architecture¶

The SegFormer3D+ pipeline takes multi-parametric MRI (T1, T1CE, T2, FLAIR) as input. It proceeds through histogram matching for intensity normalization → radiomics feature extraction for stratified sampling → a frequency-aware dual-path stem for low- and high-frequency feature extraction → a four-stage hierarchical Transformer encoder → spatial and channel dual attention fusion → a decoder producing segmentation maps for three tumor subregions (WT/TC/ET). Pre-training is performed on BraTS 2023 (\(n=1251\)), followed by fine-tuning on BraTS-Africa (\(n=60\)).

Key Designs¶

Histogram Matching Intensity Normalization:
- Function: Eliminates voxel intensity distribution discrepancies across different scanners.
- Mechanism: A high-quality BraTS 2023 T1CE scan is selected as reference. The cumulative distribution functions \(F_s\) and \(F_r\) are computed for source image \(I_s\) and reference image \(I_r\), respectively. A monotonic mapping \(M(x) = F_r^{-1}(F_s(x))\) is applied to perform voxel-wise transformation: \(\hat{I}_s = M(I_s)\).
- Design Motivation: Scanners from different SSA centers produce markedly different intensity distributions, constituting one of the primary sources of domain shift.
Radiomics-Guided Stratified Sampling:
- Function: Ensures the training data spans the domain distribution across varying acquisition quality levels.
- Mechanism: Eighteen first-order radiomics features (mean, variance, skewness, kurtosis, energy, entropy, etc.) are extracted from normalized T2-FLAIR volumes, reduced to 10 dimensions via PCA, and clustered into \(k=3\) groups using k-means. Stratified 5-fold cross-validation is then applied to BraTS-Africa.
- Design Motivation: Prevents the model from overfitting to dominant acquisition patterns and ensures that each fold contains scans of diverse quality.
Frequency-Aware Dual-Path Stem:
- Function: Simultaneously captures low-frequency structural information and high-frequency detail/artifact features at the encoder input.
- Mechanism: Two-path 3D depthwise separable convolutions approximate low-pass and high-pass filtering: \(x_{\text{low}} = \text{DepthwiseConv3D}(x), \quad x_{\text{high}} = \text{DepthwiseConv3D}(x) - x_{\text{low}}\) \(x_{\text{stem}} = \text{Concat}([x_{\text{low}}, x_{\text{high}}])\) The low-pass path uses uniform initialization (\(1/27\) per kernel weight), while the high-pass path uses Kaiming initialization.
- Design Motivation: MRI from low-resource environments frequently contains frequency-domain artifacts and noise patterns that a single convolutional stem cannot simultaneously capture; this design also avoids the computational overhead of explicit wavelet transforms.
Spatial-Channel Dual Attention Fusion:
- Function: Enhances representations of tumor-relevant spatial regions and discriminative feature channels.
- Mechanism: Spatial attention \(A_s = \sigma(\text{Conv3D}([\text{MaxPool}(F), \text{AvgPool}(F)]))\); channel attention \(A_c = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot \text{GAP}(F)))\); final features \(F' = F \odot A_s \odot A_c\).
- Design Motivation: The cascaded spatial and channel attention modules respectively highlight tumor spatial locations and discriminative feature channels, which is particularly important for refining ET subregion boundaries in low-contrast scans.

Loss & Training¶

Composite Dice–cross-entropy loss: \(\mathcal{L} = (1 - \frac{2|P \cap G|}{|P| + |G|}) + CE(P, G)\)
Optimizer: AdamW (lr=\(1\text{e}{-4}\), weight decay=\(1\text{e}{-5}\), cosine schedule)
Data augmentation: random flipping, affine transforms (±10° rotation, 0.9–1.1 scaling), z-score normalization
Pre-training on BraTS 2023 for 75 epochs → fine-tuning on BraTS-Africa for 25 epochs (early stopping, patience=20)
Post-processing: connected component analysis retaining the largest connected component per class
Random 3D crop of \(96^3\), batch size 2

Key Experimental Results¶

Main Results (BraTS-Africa Validation Set, \(n=35\))¶

Method	WT Dice	TC Dice	ET Dice	Mean Dice	HD95
3D U-Net	0.86±0.03	0.71±0.05	0.68±0.06	0.75	—
SegFormer3D	0.88±0.03	0.73±0.04	0.70±0.05	0.77	—
nnU-Net	0.90±0.02	0.76±0.04	0.72±0.05	0.79	13.7+
Swin-UNETR	0.89±0.02	0.77±0.04	0.73±0.05	0.80	—
SegFormer3D+	0.91±0.02	0.79±0.03	0.74±0.04	0.81	12.5

Ablation Study¶

Configuration	WT	TC	ET	Mean Dice	p-value
Full (Ours)	0.91	0.79	0.74	0.81	—
w/o Histogram Matching	0.89	0.77	0.72	0.79 (−0.02)	.031
w/o Frequency Stem	0.90	0.78	0.73	0.80 (−0.01)	.089
w/o Dual Attention	0.89	0.76	0.71	0.79 (−0.02)	.019
w/o Radiomics Stratification	0.90	0.78	0.73	0.80 (−0.01)	.067
All Removed	0.88	0.73	0.70	0.77 (−0.04)	<.001

Key Findings¶

The dual attention module contributes most (Dice drops by 0.02 upon removal, \(p=0.019\)), particularly improving ET boundary refinement.
Histogram matching ranks second in contribution (+1.5%), effectively reducing scanner-specific intensity bias.
The cumulative gain from all components is +4 percentage points (0.77 → 0.81), with \(p < 0.001\) when all components are removed.
HD95 decreases from the baseline range of 13.7–16.1 to 12.5, indicating more precise boundary localization.
The transfer learning strategy is effective: large-scale BraTS 2023 pre-training followed by few-shot fine-tuning on BraTS-Africa.

Highlights & Insights¶

Systems engineering perspective: Rather than pursuing a single novel component, the paper systematically integrates multiple validated techniques into a unified framework, which is more practical for resource-constrained scenarios.
Radiomics-guided stratification is a distinctive contribution—leveraging established tools from the tumor imaging field to address sampling bias in deep learning training.
The frequency-aware stem is elegantly simple: low/high frequency decomposition is achieved solely through different initialization strategies (uniform vs. Kaiming) and residual connections, without the need for complex wavelet transforms.
The work has direct equity implications for low-resource healthcare settings in Africa.

Limitations & Future Work¶

Only 60 training cases limit generalizability; future work requires larger SSA cohorts.
Self-supervised pre-training is unexplored and may be more effective than supervised pre-training under severe annotation scarcity.
Some ablated components yield relatively large p-values (e.g., frequency stem \(p=0.089\)), indicating insufficient statistical significance.
No comparison with recent foundation models (e.g., SAM-Med, UniSeg).
The choice of reference image for histogram matching may introduce bias.

vs. nnU-Net: The self-configuring approach performs well on standard data but falls short under severe domain shift compared to domain-specific designs; this paper achieves +2.5% mean Dice.
vs. Swin-UNETR: Both adopt Transformer architectures, but Swin-UNETR is not designed for domain shift; this paper's key advantages lie in dual attention and frequency-aware encoding.
vs. isolated domain adaptation techniques: Prior studies typically validate individual techniques in isolation (e.g., histogram matching alone or attention alone); this paper presents the first systematic evaluation of their combined effect.
The methodology offers transferable insights for other resource-constrained medical imaging scenarios (e.g., rural ultrasound, mobile CT).

Rating¶

Novelty: ⭐⭐⭐ — All components are based on existing techniques; however, their systematic integration for the specific scenario of SSA glioma segmentation carries engineering value.
Experimental Thoroughness: ⭐⭐⭐⭐ — Includes main results, ablation studies, qualitative analysis, and statistical significance testing, though the dataset scale is small.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed method descriptions; some equations could be made more concise.
Value: ⭐⭐⭐⭐ — Provides practical value for low-resource medical AI deployment and represents an important direction toward fairness and accessibility.