CLoE: Expert Consistency Learning for Missing Modality Segmentation¶

Conference: CVPR 2026 arXiv: 2603.09316 Code: Unavailable Area: Medical Imaging Keywords: Missing modality, multimodal segmentation, consistency learning, brain tumor segmentation, reliability gating

TL;DR¶

This paper proposes CLoE (Consistency Learning of Experts), which reformulates missing-modality robustness as a decision-level expert consistency control problem. It reduces expert drift via two complementary consistency branches—Modality Expert Consistency (MEC) and Region Expert Consistency (REC)—and achieves reliability-weighted fusion through a consistency-score-driven gating network.

Background & Motivation¶

Multimodal MRI segmentation (e.g., brain tumor) frequently encounters missing modalities in clinical practice due to equipment failure or varying scanning protocols. Limitations of prior work:

Generative methods (GAN-based missing modality synthesis): Unstable generation quality inevitably introduces artifacts.
Fixed-weight fusion / attention mechanisms (e.g., SE, CBAM): When missing modalities are filled with zero tensors, attention mechanisms become ineffective—magnitude-based attention cannot produce meaningful weights for zero inputs.
Consistency learning (e.g., Mean Teacher): Suffers from background dominance in volumetric MRI—global consistency can be satisfied without aligning small tumor regions.

Key Challenge: Prior methods lack an explicit mechanism for determining "which modality expert should be trusted for a given case and region." Different modalities provide unequal evidence, yet no distinction is made during fusion.

Key Insight: CLoE redefines missing-modality robustness as a decision-level consistency problem—if predictions from all modality experts are consistent, the fused result is stable; inconsistency indicates that certain experts are unreliable and should be down-weighted.

Method¶

Overall Architecture¶

CLoE consists of three components: (1) parallel modality encoders \(\Phi_m\) for per-modality feature extraction; (2) weight-shared expert decoders \(D^{\text{sep}}\) that independently predict segmentation for each modality; and (3) a consistency-driven gating module that converts consistency scores into reliability weights, which are then used for weighted fusion before being passed to the fusion decoder \(D^{\text{fuse}}\).

Key Designs¶

Modality Expert Consistency (MEC): For all available modality pairs \((a,b)\), the cosine similarity between prediction maps is computed to enforce global distribution alignment: \(\mathcal{L}_{\text{MEC}} = \frac{1}{|\mathcal{P}|}\sum_{(a,b)\in\mathcal{P}}(1 - \mathcal{S}(\mathbf{p}^{(a)}, \mathbf{p}^{(b)}))\). Design Motivation: When certain modalities are absent, inconsistent predictions among remaining experts amplify fusion errors. MEC improves robustness by reducing case-wise drift.
Region Expert Consistency (REC): Since global consistency is easily dominated by background pixels, a learnable foreground region map is introduced, \(r = \sigma(\pi(\frac{1}{|\mathcal{A}|}\sum_{m\in\mathcal{A}}f_1^{(m)}))\), and consistency is computed on region-weighted predictions: \(\mathcal{L}_{\text{REC}} = \frac{1}{|\mathcal{P}|}\sum_{(a,b)\in\mathcal{P}}(1 - \mathcal{S}(\mathbf{p}_r^{(a)}, \mathbf{p}_r^{(b)}))\). Design Motivation: In brain tumor segmentation, the enhancing tumor (ET) region occupies a very small volume, rendering global consistency constraints nearly ineffective; REC explicitly emphasizes alignment in foreground regions.
Consistency-Driven Dynamic Gating: For each modality \(m\), global consistency \(u_m\) and region consistency \(v_m\) with respect to other experts are computed and fed into a lightweight gating network \(\mathcal{G}\) to obtain reliability weights \(w_m = \text{softmax}(\mathcal{G}(u_m, v_m))\). Multi-scale features are fused according to these weights: \(f_\ell = \sum_m w_m \odot f_\ell^{(m)}\). Weights for missing modalities automatically collapse to zero. Design Motivation: Inconsistent experts equate to unreliable experts; directly deriving fusion weights from consistency measures is more principled than feature-magnitude-based attention.

Loss & Training¶

The total loss is a sum of three terms:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{seg}} + \alpha \mathcal{L}_{\text{ECL}} + \beta \mathcal{L}_{\text{contrast}}\]

\(\mathcal{L}_{\text{seg}}\): Segmentation loss on fused features (WCE + Dice)
\(\mathcal{L}_{\text{ECL}}\): Independent supervision for each expert \(+ \eta(\mathcal{L}_{\text{MEC}} + \lambda_{\text{rec}}\mathcal{L}_{\text{REC}})\)
\(\mathcal{L}_{\text{contrast}}\): Contrastive representation learning loss (SSIM for content alignment + cosine for style alignment + KL regularization)

Training: Adam optimizer, lr=0.0002, weight decay=0.0001, 500 epochs, batch size=1. Modalities are randomly dropped during training to simulate missing modality scenarios.

Key Experimental Results¶

Main Results¶

BraTS 2020 (15 missing modality combinations, average Dice %)

Region	Metric	CLoE	DC-Seg	M³AE	Gain (vs DC-Seg)
WT	Avg Dice	88.09	87.54	86.90	+0.55
TC	Avg Dice	80.23	79.63	79.10	+0.60
ET	Avg Dice	65.06	65.00	61.70	+0.06

MSD Prostate PZ (3 modality combinations)

Setting	CLoE	DC-Seg	RFNet
T2	80.33	79.21	75.18
ADC	77.12	75.89	72.07
T2&ADC	82.91	81.67	78.00
Average	80.12	79.59	77.35

Ablation Study¶

Configuration	WT Dice	TC Dice	ET Dice	Notes
w/o MEC	87.75	80.01	63.50	Moderate contribution from global consistency
w/o REC	86.40	79.39	61.65	ET drops by 3.41%; region consistency is critical
w/o Gating	87.99	80.08	63.90	Gating provides fine-grained refinement
w/o Weight Fusion	86.52	78.33	61.10	ET drops by 3.96%; fusion is the most important component
CLoE (full)	88.09	80.23	65.06	—

Key Findings¶

REC and Weight Fusion are the two most critical components; removing either causes a significant drop in ET (the most challenging small-region class).
Removing MEC alone has a relatively modest effect, indicating that global consistency provides less precise constraints than region-level consistency.
A single model handles all 15 missing modality combinations without requiring separate models for each configuration.

Highlights & Insights¶

Reformulating missing-modality robustness as a consistency control problem is conceptually clear and operationally tractable.
The foreground-weighted strategy in REC effectively addresses background dominance and yields notable improvements for small-target segmentation (ET).
The consistency → reliability → fusion weight pipeline is logically coherent; the gating network is extremely lightweight and introduces no additional inference overhead.
Cross-dataset generalization is demonstrated from BraTS (4 modalities) to MSD Prostate (2 modalities).

Limitations & Future Work¶

Average Dice for ET remains at only 65%, indicating that small-target segmentation under missing modalities remains an open problem.
The gating network takes only two scalar inputs (\(u_m, v_m\)), which may carry limited information; richer feature representations could be explored.
Validation is conducted on only two datasets (BraTS and Prostate); other organ types and modality combinations are not covered.
No comprehensive comparison with SAM-based methods (e.g., MedSAM) is provided.

Complementarity with DC-Seg (latent disentanglement): CLoE emphasizes decision-level consistency, whereas DC-Seg focuses on representation-level disentanglement; the two methods operate at different levels of abstraction.
The consistency learning paradigm (Mean Teacher) has proven highly effective in semi-supervised learning; this work adapts it to the missing modality setting and resolves the background dominance problem.
General insight for multimodal fusion: Assessing the reliability of each modality prior to fusion is more principled than naive attention-based weighting.

Rating¶

Novelty: ⭐⭐⭐⭐ The consistency → reliability formulation is novel; REC addresses a genuine problem
Experimental Thoroughness: ⭐⭐⭐ BraTS + Prostate provide adequate but limited coverage
Writing Quality: ⭐⭐⭐⭐ Method motivation is well-articulated; ablation design is sound
Value: ⭐⭐⭐⭐ Missing modality is a genuine clinical need; the approach is practical and conceptually clear