Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation¶
Conference: CVPR 2026 arXiv: 2603.12547 Code: To be released (upon acceptance) Area: Medical Image Segmentation / State Space Models / Decoder Design Keywords: Medical Image Segmentation, Mamba, Decoder-Centric, Deep Supervision, Co-Attention Gate
TL;DR¶
This paper proposes Deco-Mamba, a decoder-centric segmentation network that employs a Co-Attention Gate (CAG) for bidirectional encoder–decoder feature fusion, a Visual State Space Module (VSSM) for long-range dependency modeling, and deformable convolutions for detail recovery. A windowed distribution-aware KL-divergence deep supervision scheme is further introduced. The method achieves state-of-the-art performance on 7 medical segmentation benchmarks at moderate computational cost.
Background & Motivation¶
State of the Field¶
Background: Two unresolved bottlenecks persist in medical image segmentation: (1) most methods are optimized for a single dataset or modality, resulting in poor cross-modal generalization; (2) research attention has been disproportionately focused on the encoder (leveraging large pretrained backbones), leaving decoder design largely neglected. Existing Mamba-based methods (Mamba-UNet, U-Mamba, Swin-UMamba, etc.) primarily apply Mamba to enhance the encoder, without fully exploiting its long-range modeling capability in the decoding stage. Conventional deep supervision resizes intermediate outputs to full resolution, incurring information loss.
Approach¶
Goal: How to design a computationally efficient and cross-modal generalizable decoder that achieves fine-grained multi-scale feature reconstruction and boundary recovery with low parameter count?
Method¶
Overall Architecture¶
The network follows a U-Net-like structure with a dual-branch encoder (7×7 CNN for high-resolution detail retention + PVT-V2 Transformer for global context capture). The six-stage decoder cascades Co-Attention Gate → VSSM Block → Deformable Residual Block (DRB), coupled with multi-scale distribution-aware supervision. Two model variants are provided: V0 (PVT-B0, 9.67M) and V1 (PVT-B2, 46.93M).
Key Designs¶
- Co-Attention Gate (CAG): Extends the conventional unidirectional Attention Gate to a bidirectional mechanism — encoder features and decoder features serve as each other's input and gating signal. The two resulting attention outputs are concatenated and refined via channel attention (adaptive max + average pooling → dual 1×1 convolutions → sigmoid) to select the most informative channels. Formally: \(D_i' = CA[AG(x=X_i, g=D_{i+1}), AG(x=D_{i+1}, g=X_i)]\)
- Visual State Space Mamba Block (VSSMB): Employs a continuous-time SSM with selective scanning along horizontal, vertical, and their reverse directions to model global context at linear complexity. Two VSSMBs are used in the bottleneck; stages 2–5 each use one; the final stage omits it.
- Deformable Residual Block (DRB): Combines standard 3×3 convolutions with deformable convolutions that predict pixel-level offsets and modulation masks (sigmoid-constrained to \([0, 2]\)), recovering local details and boundaries that SSM processing may smooth.
- Multi-Scale Distribution-Aware Deep Supervision (MSDA): Instead of resizing intermediate outputs to full resolution, the method computes the intra-window class frequency distribution \(\tilde{P}^{(s)}\) at each decoder's native resolution and aligns it with the predicted softmax distribution via KL divergence. Boundary weighting is defined as \(W_{h,w}^{(s)} = (1 - \max_n \tilde{P}_{h,w,n}^{(s)})^\alpha\), assigning higher weights to mixed-class regions near boundaries.
Loss & Training¶
\(\mathcal{L}_{total} = \mathcal{L}_{dice} + \sum_s \lambda_s \mathcal{L}_{dist}^{(s)}\), with monotonically increasing stage weights \(\lambda_1 < \lambda_2 < ... < \lambda_S\). AdamW optimizer with cosine learning rate scheduling (warm restart \(T=2\)); input resolution 224×224; learning rate 1e-4, batch size 16 (primary datasets); training performed on an A5000 24 GB GPU.
Key Experimental Results¶
| Dataset | Metric | Deco-Mamba-V1 | Prev. SOTA | Gain |
|---|---|---|---|---|
| Synapse (8-class) | DSC/HD95 | 85.07/14.72 | 83.59/15.99 (Cascaded-MERIT) | +1.48/+1.27 |
| BTCV (13-class) | DSC/HD95 | 78.45/11.77 | 75.87/17.02 (PAG-TransYnet) | +2.58/+5.25 |
| ACDC (cardiac) | DSC | 92.35 | 92.12 (PVT-EMCAD-B2) | +0.23 |
| MoNuSeg | DSC | 85.14 | 81.45 (Swin-UMamba) | +3.69 |
| GlaS | DSC | 96.91 | 96.91 (Cascaded-MERIT) | Tied |
Deco-Mamba-V0 (only 9.67M parameters) approaches the performance of Transformer-based methods with ~150M parameters.
Ablation Study¶
- Removing CNN branch: DSC 84.07 (−1.00); removing VSSMB: DSC 83.51 (−1.56)
- CAG vs. conventional AG: 82.98 → 85.07; vs. LGAG: 82.69 → 85.07; vs. CBAM: 84.01 → 85.07
- Deformable conv vs. standard conv: 84.53 vs. 85.07; vs. dynamic conv: 83.77 vs. 85.07
- MSDA vs. conventional deep supervision: the latter improves DSC but degrades HD95 (15.89 vs. 14.72); MSDA improves both metrics
- vs. boundary-aware / distance-based boundary loss: HD95 of 21.43 / 20.64 vs. MSDA's 14.72
- Backbone comparison: PVT-B0 (9.67M) DSC 83.16; Swin-T (70.12M) DSC 83.76; PVT-B2 DSC 85.07
Highlights & Insights¶
- Impressive efficiency–accuracy trade-off: V0 with only 9.67M parameters outperforms Mamba-based methods such as SliceMamba and VM-UNet, approaching the 148M-parameter Cascaded-MERIT.
- MSDA loss avoids the information loss caused by resizing in conventional deep supervision by operating directly at native decoder resolutions.
- Comprehensive validation across 7 cross-modal benchmarks (CT / MRI / ultrasound / dermoscopy / pathology) demonstrates strong generalizability.
Limitations & Future Work¶
- Validation is limited to 2D slices; extension to 3D volumetric segmentation remains unexplored.
- The method relies on PVT pretrained weights; alternative pretraining strategies (e.g., self-supervised learning) have not been investigated.
- The multi-directional scanning strategy in VSSM lacks systematic ablation.
Related Work & Insights¶
- vs. EMCAD (CVPR 2024): Both target decoder enhancement, but EMCAD does not model long-range dependencies. Deco-Mamba-V0 with PVT-B0 already surpasses EMCAD with PVT-B2.
- vs. Cascaded-MERIT (147.86M): Deco-Mamba-V1 achieves +1.48% DSC using approximately one-third of the parameters.
- vs. Swin-UMamba: +3.69% DSC on MoNuSeg with fewer parameters.
Related Work & Insights¶
- The decoder-centric design philosophy merits attention — a lightweight encoder paired with a strong decoder may be more efficient than the converse.
- Distribution-aware deep supervision is generalizable to other dense prediction tasks.
Rating¶
- Novelty: ⭐⭐⭐ CAG, VSSM, and MSDA each offer incremental contributions; the combination is effective but individual innovations are limited in scope.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 datasets, 20+ comparison methods, comprehensive ablation study.
- Writing Quality: ⭐⭐⭐⭐ Clear figures and tables; detailed module descriptions.
- Value: ⭐⭐⭐ Strong practical utility; valuable to the medical segmentation community; design is transferable to other tasks.