Skip to content

Adaptive Morph-Patch Transformer for Aortic Vessel Segmentation

Conference: AAAI 2026 arXiv: 2511.06897 Code: https://github.com/iCherishxixixi/MPTransformer Area: Image Segmentation Keywords: Aortic segmentation, morphology-aware patch, semantic clustering attention, diffeomorphic deformation, velocity field

TL;DR

This paper proposes the Morph-Patch Transformer (MPT), which generates morphology-aware patches via a velocity-field-based adaptive patch partitioning strategy to preserve vascular topological integrity, and introduces Semantic Clustering Attention (SCA) to dynamically aggregate features from semantically similar patches. The method achieves state-of-the-art performance on three aortic segmentation benchmarks: AVT, AortaSeg24, and TBAD.

Background & Motivation

Aortic vessel segmentation is critical for the diagnosis and treatment of cardiovascular diseases, directly affecting the reliability of computational fluid modeling, surgical planning, and disease progression monitoring. While Transformers have become the dominant paradigm in this domain, two fundamental challenges remain:

Fixed rectangular patches disrupt vascular integrity: Conventional Transformers partition images into fixed-size rectangular patches, which are ill-suited for the elongated, curved, and morphologically complex structure of blood vessels. Rectangular windows often truncate thin vessels, fragmenting semantic information. Even DPT (Deformable Patch Transformer), which introduces learnable deformation, remains constrained to rectangular patches and cannot adapt to vascular morphology.

Lack of cross-scale semantic similarity modeling: The hierarchical window attention of Swin Transformer enables multi-scale feature extraction, but fixed windows still fail to model semantic similarity across patches at different scales. Existing methods—including those leveraging dynamic Snake convolutions—enhance feature extraction for elongated structures but lack a semantic clustering mechanism.

Core insight: Diffeomorphic deformation driven by a velocity field naturally generates morphology-aware patches with topological continuity; Soft K-means clustering enables dynamic aggregation of semantically similar patches via SCA.

Method

Overall Architecture

The model follows a 3D UNet-based encoder-decoder structure with two core innovations: (1) the Morph Partition Block replaces fixed patch partitioning; and (2) a Spatial + Semantic Transformer Block fuses spatial relationships (window attention) with semantic relationships (clustering attention). Three variants are provided: MPT (pure 3D ViT), MPT-UNETR (hybrid 3D ViT-CNN), and MPTUNet (lightweight 2D).

Key Designs

  1. Morph Partition Block (morphology-aware patch partitioning):

    • A CNN predicts a velocity field \(\upsilon\), which is integrated via the scaling-and-squaring method to obtain a diffeomorphic deformation field \(\phi^{(1)}\).
    • Core formula: \(y(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) \cdot x(p_0 + p_n + \phi(p_0 + p_n))\)
    • Unlike conventional deformable convolutions that directly predict offset fields, velocity field integration guarantees smooth, invertible, and topology-preserving transformations. Each point in the deformation field represents a coordinate offset; deformed features are sampled from the original input via bilinear interpolation.
    • Recurrence relation: \(\phi^{(1/2^{n-1})} = \phi^{(1/2^n)} \circ \phi^{(1/2^n)}\), yielding \(\phi^{(1)}\) after \(n\) iterations.
  2. Semantic Clustering Attention (SCA):

    • Differentiable Soft K-means is employed to extract core semantic features \(F_{core}\), clustering patch features according to softmax-weighted distances to the core features.
    • Update rule: \(f_{newcore}^s = \sum_{i=1}^m g_s(f^i) \cdot (f^i - f_{core}^s)\), where \(g_s\) is a differentiable membership function.
    • \(\lambda\), \(\mu\), and \(F_{core}\) are all learned by the network; \(g_s\) is parameterized as \(e^{\lambda^s f^i + \mu^s}\) to ensure differentiability.
    • SCA is computed as: \(\text{SCA} = \text{softmax}(QK^T/\sqrt{d}) \cdot V\), where Q comes from patch tokens and K/V come from the updated semantic centers.
  3. Fusion strategy: Morphology-aware patches from the Morph Partition Block, window attention from Swin Transformer (spatial relationships), and SCA (semantic relationships) are jointly integrated within the Transformer Block.

Loss & Training

Dice loss is used. The Adam optimizer is employed with a learning rate of 5e-5. The number of clusters is set to 32. Models are trained for 1000 epochs within the nnU-Net framework on an NVIDIA RTX 3090.

Key Experimental Results

Main Results: AVT and TBAD Datasets

Model Backbone Type AVT Dice AVT mIoU AVT clDice TBAD Dice TBAD mIoU TBAD clDice
MedNeXt CNN 0.809 0.718 0.724 0.926 0.871 0.880
SegMamba Mamba 0.829 0.730 0.711 0.932 0.881 0.918
nnFormer 3D ViT 0.835 0.743 0.732 0.926 0.871 0.895
DPT 2D ViT-CNN 0.886 0.800 0.825 0.924 0.868 0.917
MambaVision Mamba 0.882 0.795 0.795 0.929 0.874 0.914
TransFuse 2D ViT-CNN 0.880 0.794 0.796 0.927 0.872 0.895
MPT 3D ViT 0.856 0.762 0.757 0.933 0.881 0.915
MPTUNet 2D ViT-CNN 0.896 0.815 0.839 0.930 0.877 0.920

AortaSeg24 Dataset (23-class Fine-grained Segmentation)

Model Dice mIoU clDice
3DUXNet 0.784 0.666 0.964
nnFormer 0.779 0.666 0.923
SwinUNETR 0.781 0.664 0.937
DSCViT 0.788 0.673 0.965
DPT 0.778 0.662 0.959
MambaVision 0.795 0.682 0.960
MPT 0.804 0.690 0.926
MPTUNETR 0.809 0.695 0.955
MPTUNet 0.796 0.686 0.966

Key Findings

  • MPTUNet achieves Dice 0.896 on AVT, surpassing all baselines (including DPT 0.886, MambaVision 0.882, TransFuse 0.880), with the lowest variance (0.046), indicating strong stability.
  • AortaSeg24's 23-class fine-grained segmentation is highly challenging; MPTUNETR achieves Dice 0.809 / clDice 0.955, outperforming the second-best MambaVision (0.795/0.960).
  • The clDice metric, specifically designed to measure vascular topological integrity, confirms the effectiveness of diffeomorphic deformation: MPTUNet achieves the highest clDice of 0.920 on TBAD.
  • Model efficiency: The MPT family is comparable to DPT in FLOPs and parameter count, while delivering substantially better performance.

Highlights & Insights

  • Velocity field → diffeomorphic deformation → topology-preserving patch partitioning: Unlike DCN, which directly predicts offsets, deformations obtained via ODE integration are inherently smooth and invertible—particularly well-suited for vascular structures that require topological continuity.
  • Differentiable semantic clustering via Soft K-means: Hard assignments in standard K-means are softened through an exponential kernel \(e^{-\beta \|f-f_{core}\|^2}\), with \(\lambda\), \(\mu\), and \(F_{core}\) fully learnable, yielding greater flexibility than fixed-window attention.
  • Three variants covering diverse requirements: MPT (pure ViT for accuracy), MPT-UNETR (hybrid architecture balancing efficiency), and MPTUNet (lightweight 2D design).

Limitations & Future Work

  • The CNN for velocity field prediction introduces additional computational overhead; inference speed is not thoroughly analyzed.
  • The number of clusters is fixed at 32; adaptive clustering strategies are not explored.
  • Validation is limited to aorta-related datasets; generalization to other vascular segmentation tasks (e.g., retinal vessels, coronary arteries) remains untested.
  • The trade-off between deformation precision and computational cost with respect to the number of scaling-and-squaring steps is insufficiently discussed.
Method Category Representative Patch Strategy Semantic Modeling Topology Preservation
Standard Transformer UNETR, SwinUNETR Fixed rectangular Window attention None
Deformable Transformer DPT Deformable rectangular Standard attention Weak
Snake convolution hybrid TTCNet, DAU-Net Fixed + Snake Convolutional features Implicit
MPT (Ours) MPT/UNETR/UNet Diffeomorphic deformation SCA clustering attention Strong (topology-preserving)

Rating

  • Novelty: ⭐⭐⭐⭐ Velocity-field-driven morphology patches + differentiable semantic clustering attention
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets + 17 comparison methods + three model variants
  • Writing Quality: ⭐⭐⭐⭐ Clear methodological derivation and well-motivated problem formulation
  • Value: ⭐⭐⭐⭐ Broadly inspiring for segmentation of morphologically complex structures in medical imaging