CLoE: Expert Consistency Learning for Missing Modality Segmentation¶

Conference: CVPR 2026 arXiv: 2603.09316 Code: N/A Area: Medical Image Segmentation / Multimodal Learning Keywords: missing modality, consistency learning, expert fusion, reliability gating, brain tumor

TL;DR¶

This work reformulates the robustness problem under missing modalities as decision-level expert consistency control. It proposes a dual-branch consistency learning scheme (global MEC + regional REC) coupled with a lightweight gating network that converts consistency scores into modality reliability weights, achieving an average WT Dice of 88.09% across 15 missing-modality combinations on BraTS 2020, surpassing all prior state-of-the-art methods.

Background & Motivation¶

Background: Multimodal MRI segmentation (e.g., four-modality T1/T1c/T2/FLAIR for brain tumor) typically assumes full modality availability at training time, employing encoder–decoder architectures such as U-Net and V-Net. In clinical practice, however, modality absence due to interrupted scans, protocol differences, or quality issues is extremely common.

Limitations of Prior Work: (1) GAN-based synthesis of missing modalities is computationally expensive and prone to hallucinations; (2) arithmetic fusion methods such as HeMIS cause attention mechanisms to fail when zero-padding is applied; (3) the spatial priors in RFNet are passive—they specify where to attend but provide no signal as to which expert is trustworthy; (4) consistency learning approaches such as Mean Teacher are dominated by background voxels in volumetric MRI, so global consistency does not imply alignment on small foreground structures.

Key Challenge: Modality absence is not merely a problem of reduced information; it also amplifies prediction disagreement among modality experts. Naïve fusion can amplify such disagreement rather than resolve it, particularly for small foreground structures of clinical importance.

Goal: How can inter-expert consistency be quantified as a reliability signal, and how can that signal guide dynamic fusion?

Key Insight: Elevate robustness from the representation level to the decision level—rather than learning better features, the goal is to control which expert's output is trusted.

Core Idea: Cosine similarity between expert predictions is used to quantify consistency at two levels—global (MEC) and foreground-region (REC)—and a gating network maps these consistency scores to fusion weights.

Method¶

Overall Architecture¶

CLoE comprises four core components: (1) parallel modality encoders \(\Phi_m\) that extract multi-scale features; (2) a shared expert decoder \(D^{sep}\) that generates single-modality predictions \(p^{(m)}\); (3) an ECL module that computes MEC/REC consistency scores and maps them to fusion weights \(w_m\) via a gating network; and (4) a fusion decoder \(D^{fuse}\) that takes the weighted fused features as input to produce the final segmentation.

Key Designs¶

Modality Expert Consistency (MEC) + Regional Expert Consistency (REC)
MEC: Probability prediction vectors from available experts are flattened and their pairwise cosine similarities are computed; the average over all available pairs serves as the loss: \(\mathcal{L}_{MEC} = \frac{1}{|\mathcal{P}|}\sum_{(a,b)}(1-\mathcal{S}(p^{(a)}, p^{(b)}))\)
REC: A lightweight projection head aggregates shallow features from available experts to generate a probabilistic region map \(r=\sigma(\pi(\frac{1}{|\mathcal{A}|}\sum f_1^{(m)}))\); cosine similarity is then computed on \(r\)-weighted predictions.
Design Motivation: MEC enforces global distributional alignment to prevent expert drift, while REC focuses on clinically critical foreground regions to avoid domination by background voxels.
Consistency-Driven Dynamic Gating
For each expert \(m\), the global consistency score \(u_m\) and regional consistency score \(v_m\) are fed into a lightweight gating network \(\mathcal{G}\).
The network outputs reliability logits \(g_m = \mathcal{G}(u_m, v_m)\); softmax normalization over available experts yields fusion weights \(w_m\).
Multi-scale features are fused as \(f_\ell = \sum w_m \odot f_\ell^{(m)}\).
Design Motivation: Consistency is not only used as a training constraint but is also recycled as a reliability signal to guide fusion—consistent experts are trusted, while divergent ones are suppressed.

Loss & Training¶

The overall objective is \(\mathcal{L}_{total} = \mathcal{L}_{seg} + \alpha \mathcal{L}_{ECL} + \beta \mathcal{L}_{contrast}\): - \(\mathcal{L}_{seg}\): Weighted cross-entropy + Dice loss on fused predictions. - \(\mathcal{L}_{ECL}\): Per-expert supervision (WCE + DL) + \(\eta(\mathcal{L}_{MEC} + \lambda_{rec}\mathcal{L}_{REC})\). - \(\mathcal{L}_{contrast}\): Contrastive representation loss (SSIM for anatomical content alignment + cosine for modality style clustering + KL regularization). - Adam optimizer, lr = 0.0002, weight decay = 0.0001, batch size = 1, 500 epochs, input size = \(112^3\) 3D patch.

Key Experimental Results¶

Main Results¶

Dataset	Metric	CLoE	DC-Seg	M³AE	RFNet	HeMIS
BraTS2020 (15-combo Avg)	WT Dice%	88.09	87.54	86.90	86.98	75.10
BraTS2020 (15-combo Avg)	TC Dice%	80.23	79.63	79.10	78.23	65.45
BraTS2020 (15-combo Avg)	ET Dice%	65.06	65.00	61.70	61.47	-
BraTS2020 Full	WT Dice%	91.30	90.95	90.40	91.11	85.19
MSD Prostate PZ Avg	Dice%	80.12	79.59	-	77.35	-

Ablation Study¶

Configuration	Avg Dice Change	ET Dice Change	Note
w/o REC	−1.98%	−3.41%	Regional consistency is critical for small foreground structures
w/o Weight Fusion	−2.47%	−3.96%	Dynamic fusion weights contribute the most
w/o MEC	−0.70%	—	Global consistency plays a fine-tuning role
w/o Gating Network	−0.47%	—	Parameterized contribution is limited but directionally correct

Key Findings¶

REC and Weight Fusion are the two most critical components of CLoE, yielding the largest gains on ET (the clinically most important enhancing tumor subregion).
CLoE does not sacrifice performance in the full-modality setting (WT 91.30%), demonstrating that consistency constraints do not introduce degradation.
Cross-dataset generalization (BraTS → MSD Prostate) validates the framework's applicability across domains.
Even with bounding-box prompts, MedSAM fails to produce clear tumor boundaries, confirming the continued value of dedicated multimodal segmentation frameworks.

Highlights & Insights¶

Reformulating robustness as decision-level consistency control is a clear and intuitively sound perspective.
REC automatically identifies foreground regions of interest via a probabilistic region map without requiring manual ROI annotations, providing an elegant solution to the background-dominated consistency problem.
The design of converting consistency scores into reliability weights effectively recycles the consistency signal as a fusion signal rather than treating it solely as a training constraint.
Notable improvements on ET segmentation (+3.59% vs. RFNet), where foreground is extremely small, confirm the effectiveness of REC for clinically critical small targets.

Limitations & Future Work¶

Ablation results indicate that MEC and the gating network each contribute modestly in isolation (−0.70% / −0.47%), suggesting potential for architectural simplification.
Validation is limited to two datasets; evaluation on additional organs and modality combination patterns would be valuable.
Under extreme missing-modality conditions (only one modality available), pairwise consistency comparison is infeasible and the meaning of consistency scores degrades.
The probabilistic region map \(r\) is derived from shallow features, which may be unstable in early training stages.
The gating network is highly compact (2D input → 1D output), and whether its expressive capacity is sufficient remains an open question.

vs. DC-Seg: DC-Seg performs latent-space disentanglement via VAE-based contrastive learning; CLoE adds decision-level consistency control and dynamic fusion on top of this, making the two approaches complementary.
vs. M³AE: M³AE relies on large-scale masked autoencoder pre-training, yet CLoE surpasses it with a substantially lighter framework.
vs. RFNet: RFNet employs region-aware spatial priors in a passive manner, whereas CLoE's REC actively learns foreground regions and explicitly quantifies expert reliability.
The consistency-to-reliability framework is broadly applicable to other multi-source fusion scenarios, such as multi-sensor fusion in autonomous driving.
The foreground-focused regional consistency idea offers transferable insights for small-object detection and segmentation tasks.

Rating¶

Novelty: ⭐⭐⭐ — Each individual component is not entirely novel, but the problem formulation is clear and well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐ — Full coverage of 15 missing-modality combinations, cross-dataset validation, and detailed ablation studies.
Writing Quality: ⭐⭐⭐ — Methodology is described clearly, though space constraints leave some implementation details to the algorithm box.
Value: ⭐⭐⭐ — Missing modality segmentation is an important clinical problem; performance gains are solid if not dramatic.