Skip to content

CLoE: Expert Consistency Learning for Missing Modality Segmentation

Conference: CVPR 2026 arXiv: 2603.09316 Code: N/A Area: Medical Image Segmentation / Multimodal Learning Keywords: missing modality, consistency learning, expert fusion, reliability gating, brain tumor

TL;DR

This work reformulates the robustness problem under missing modalities as decision-level expert consistency control. It proposes a dual-branch consistency learning scheme (global MEC + regional REC) coupled with a lightweight gating network that converts consistency scores into modality reliability weights, achieving an average WT Dice of 88.09% across 15 missing-modality combinations on BraTS 2020, surpassing all prior state-of-the-art methods.

Background & Motivation

Background: Multimodal MRI segmentation (e.g., four-modality T1/T1c/T2/FLAIR for brain tumor) typically assumes full modality availability at training time, employing encoder–decoder architectures such as U-Net and V-Net. In clinical practice, however, modality absence due to interrupted scans, protocol differences, or quality issues is extremely common.

Limitations of Prior Work: (1) GAN-based synthesis of missing modalities is computationally expensive and prone to hallucinations; (2) arithmetic fusion methods such as HeMIS cause attention mechanisms to fail when zero-padding is applied; (3) the spatial priors in RFNet are passive—they specify where to attend but provide no signal as to which expert is trustworthy; (4) consistency learning approaches such as Mean Teacher are dominated by background voxels in volumetric MRI, so global consistency does not imply alignment on small foreground structures.

Key Challenge: Modality absence is not merely a problem of reduced information; it also amplifies prediction disagreement among modality experts. Naïve fusion can amplify such disagreement rather than resolve it, particularly for small foreground structures of clinical importance.

Goal: How can inter-expert consistency be quantified as a reliability signal, and how can that signal guide dynamic fusion?

Key Insight: Elevate robustness from the representation level to the decision level—rather than learning better features, the goal is to control which expert's output is trusted.

Core Idea: Cosine similarity between expert predictions is used to quantify consistency at two levels—global (MEC) and foreground-region (REC)—and a gating network maps these consistency scores to fusion weights.

Method

Overall Architecture

CLoE comprises four core components: (1) parallel modality encoders \(\Phi_m\) that extract multi-scale features; (2) a shared expert decoder \(D^{sep}\) that generates single-modality predictions \(p^{(m)}\); (3) an ECL module that computes MEC/REC consistency scores and maps them to fusion weights \(w_m\) via a gating network; and (4) a fusion decoder \(D^{fuse}\) that takes the weighted fused features as input to produce the final segmentation.

Key Designs

  1. Modality Expert Consistency (MEC) + Regional Expert Consistency (REC)
  2. MEC: Probability prediction vectors from available experts are flattened and their pairwise cosine similarities are computed; the average over all available pairs serves as the loss: \(\mathcal{L}_{MEC} = \frac{1}{|\mathcal{P}|}\sum_{(a,b)}(1-\mathcal{S}(p^{(a)}, p^{(b)}))\)
  3. REC: A lightweight projection head aggregates shallow features from available experts to generate a probabilistic region map \(r=\sigma(\pi(\frac{1}{|\mathcal{A}|}\sum f_1^{(m)}))\); cosine similarity is then computed on \(r\)-weighted predictions.
  4. Design Motivation: MEC enforces global distributional alignment to prevent expert drift, while REC focuses on clinically critical foreground regions to avoid domination by background voxels.

  5. Consistency-Driven Dynamic Gating

  6. For each expert \(m\), the global consistency score \(u_m\) and regional consistency score \(v_m\) are fed into a lightweight gating network \(\mathcal{G}\).
  7. The network outputs reliability logits \(g_m = \mathcal{G}(u_m, v_m)\); softmax normalization over available experts yields fusion weights \(w_m\).
  8. Multi-scale features are fused as \(f_\ell = \sum w_m \odot f_\ell^{(m)}\).
  9. Design Motivation: Consistency is not only used as a training constraint but is also recycled as a reliability signal to guide fusion—consistent experts are trusted, while divergent ones are suppressed.

Loss & Training

The overall objective is \(\mathcal{L}_{total} = \mathcal{L}_{seg} + \alpha \mathcal{L}_{ECL} + \beta \mathcal{L}_{contrast}\): - \(\mathcal{L}_{seg}\): Weighted cross-entropy + Dice loss on fused predictions. - \(\mathcal{L}_{ECL}\): Per-expert supervision (WCE + DL) + \(\eta(\mathcal{L}_{MEC} + \lambda_{rec}\mathcal{L}_{REC})\). - \(\mathcal{L}_{contrast}\): Contrastive representation loss (SSIM for anatomical content alignment + cosine for modality style clustering + KL regularization). - Adam optimizer, lr = 0.0002, weight decay = 0.0001, batch size = 1, 500 epochs, input size = \(112^3\) 3D patch.

Key Experimental Results

Main Results

Dataset Metric CLoE DC-Seg M³AE RFNet HeMIS
BraTS2020 (15-combo Avg) WT Dice% 88.09 87.54 86.90 86.98 75.10
BraTS2020 (15-combo Avg) TC Dice% 80.23 79.63 79.10 78.23 65.45
BraTS2020 (15-combo Avg) ET Dice% 65.06 65.00 61.70 61.47 -
BraTS2020 Full WT Dice% 91.30 90.95 90.40 91.11 85.19
MSD Prostate PZ Avg Dice% 80.12 79.59 - 77.35 -

Ablation Study

Configuration Avg Dice Change ET Dice Change Note
w/o REC −1.98% −3.41% Regional consistency is critical for small foreground structures
w/o Weight Fusion −2.47% −3.96% Dynamic fusion weights contribute the most
w/o MEC −0.70% Global consistency plays a fine-tuning role
w/o Gating Network −0.47% Parameterized contribution is limited but directionally correct

Key Findings

  • REC and Weight Fusion are the two most critical components of CLoE, yielding the largest gains on ET (the clinically most important enhancing tumor subregion).
  • CLoE does not sacrifice performance in the full-modality setting (WT 91.30%), demonstrating that consistency constraints do not introduce degradation.
  • Cross-dataset generalization (BraTS → MSD Prostate) validates the framework's applicability across domains.
  • Even with bounding-box prompts, MedSAM fails to produce clear tumor boundaries, confirming the continued value of dedicated multimodal segmentation frameworks.

Highlights & Insights

  • Reformulating robustness as decision-level consistency control is a clear and intuitively sound perspective.
  • REC automatically identifies foreground regions of interest via a probabilistic region map without requiring manual ROI annotations, providing an elegant solution to the background-dominated consistency problem.
  • The design of converting consistency scores into reliability weights effectively recycles the consistency signal as a fusion signal rather than treating it solely as a training constraint.
  • Notable improvements on ET segmentation (+3.59% vs. RFNet), where foreground is extremely small, confirm the effectiveness of REC for clinically critical small targets.

Limitations & Future Work

  • Ablation results indicate that MEC and the gating network each contribute modestly in isolation (−0.70% / −0.47%), suggesting potential for architectural simplification.
  • Validation is limited to two datasets; evaluation on additional organs and modality combination patterns would be valuable.
  • Under extreme missing-modality conditions (only one modality available), pairwise consistency comparison is infeasible and the meaning of consistency scores degrades.
  • The probabilistic region map \(r\) is derived from shallow features, which may be unstable in early training stages.
  • The gating network is highly compact (2D input → 1D output), and whether its expressive capacity is sufficient remains an open question.
  • vs. DC-Seg: DC-Seg performs latent-space disentanglement via VAE-based contrastive learning; CLoE adds decision-level consistency control and dynamic fusion on top of this, making the two approaches complementary.
  • vs. M³AE: M³AE relies on large-scale masked autoencoder pre-training, yet CLoE surpasses it with a substantially lighter framework.
  • vs. RFNet: RFNet employs region-aware spatial priors in a passive manner, whereas CLoE's REC actively learns foreground regions and explicitly quantifies expert reliability.
  • The consistency-to-reliability framework is broadly applicable to other multi-source fusion scenarios, such as multi-sensor fusion in autonomous driving.
  • The foreground-focused regional consistency idea offers transferable insights for small-object detection and segmentation tasks.

Rating

  • Novelty: ⭐⭐⭐ — Each individual component is not entirely novel, but the problem formulation is clear and well-motivated.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Full coverage of 15 missing-modality combinations, cross-dataset validation, and detailed ablation studies.
  • Writing Quality: ⭐⭐⭐ — Methodology is described clearly, though space constraints leave some implementation details to the algorithm box.
  • Value: ⭐⭐⭐ — Missing modality segmentation is an important clinical problem; performance gains are solid if not dramatic.