Skip to content

EchoONE: Segmenting Multiple Echocardiography Planes in One Model

Conference: CVPR 2025
arXiv: 2412.02993
Code: https://github.com/a2502503/EchoONE
Area: Medical Imaging
Keywords: Echocardiography segmentation, Multi-plane segmentation, SAM adaptation, Prior Composable Mask, Unified model

TL;DR

This paper proposes EchoONE, the first unified model to address the Multi-Plane Segmentation (MPS) problem in echocardiography. By employing a Prior Composable Mask learning (PC-Mask) module to generate semantically-aware dense prompts, and designing a Local Feature Fusion and Adaptation (LFFA) module to inject local CNN features into the SAM decoder, EchoONE consistently achieves SOTA performance across 6 planes.

Background & Motivation

Background: Echocardiography examination requires observing cardiac structures from multiple planes (such as apical two-chamber, three-chamber, four-chamber, and parasternal short-axis, etc.) for a comprehensive evaluation. Machine learning segmentation models typically need to be trained separately for each plane due to the vast structural differences across different views (long-axis vs. short-axis, differing numbers of chambers, and distinct anatomical landmarks), which leads to high development and deployment complexity.

Limitations of Prior Work: Existing multi-plane solutions fall into two categories: (1) Multi-branch architectures—each branch processes one plane and merges them via cross-view attention/consistency constraints, which is essentially still a "divide-and-conquer" approach and fails to generalize to new planes; (2) Mixed training—merging data from multiple planes to train a single model, which usually suffers from a significant drop in performance (e.g., U-Net joint training across multiple planes causes Dice to drop from ~89 to ~86). While SAM, as a general segmentation foundation model, performs exceptionally well on natural images, its direct application to ultrasound images yields extremely poor performance (Dice of only ~25%) due to low contrast, high noise, and blurry tissue boundaries.

Key Challenge: The structural differences in multi-plane ultrasound images are too vast (long axes capture longitudinal views, while short axes capture transverse views, with completely different chamber morphologies), and SAM's standard prompt mechanism lacks the semantic affinity to distinguish these structural differences. How can a single model "know" which plane the current input belongs to and adaptively adjust its segmentation behavior?

Goal: (1) Construct a unified model to handle multi-plane echocardiographic segmentation without requiring prior knowledge of the input plane; (2) Design a semantically-aware dense prompt generation mechanism; (3) Effectively adapt SAM to multi-plane echocardiography scenarios.

Key Insight: The authors leverage prior structural knowledge—images from different planes exhibit clustering patterns in the feature space, allowing the pre-computation of the average mask of each cluster as a structural prior. For a new input image, a semantically-aware dense prompt can be generated by calculating its similarity to each cluster prototype to perform a weighted combination of the prior masks, eliminating the need for explicit plane labels.

Core Idea: Generate semantically-aware dense prompts using a similarity-weighted combination of clustering prior masks, enabling SAM to adaptively segment different ultrasound planes.

Method

Overall Architecture

EchoONE consists of three components: (1) The core SAM architecture (ViT-B image encoder + mask decoder); (2) The PC-Mask module, which is responsible for generating semantically-aware dense prompts that are fed into SAM's mask encoder; (3) A CNN branch that interacts with the SAM decoder via the LFFA module for local feature exchange, aiding SAM adaptation. A unified mask representation remaps the annotations of all datasets into 4 classes: background (0), left ventricle (1), left ventricular cavity (2), and myocardium (3).

Key Designs

  1. Prior Composable Mask learning (PC-Mask):

    • Function: Automatically generate high-quality semantically-aware dense prompts without prior knowledge of the input plane information.
    • Mechanism: First, in the latent space of a pre-trained ResNet34, all training images are clustered into \(K\) groups, each with a feature prototype \(u_i\) and a corresponding average mask \(m_i\). For a new input image \(I_j\), its cosine similarity to each prototype is calculated as \(w(i,j) = \text{cossim}(E_{Lat}(I_j), u_i)\), and the prior embedding is formed by concatenating the similarity-weighted average masks, \(PE_j = \text{concat}([w(i,j) \times m_i])\). Finally, a lightweight U-Net refines this prior embedding into the final dense prompt \(PCM_j = UNet_\theta(PE_j)\).
    • Design Motivation: The key lies in the fact that the semantics of PC-Mask come from the clustering priors rather than the input image itself, which allows it to provide structural guidance distinct from traditional segmentation networks. Furthermore, it does not rely on plane labels, making it naturally suited for a unified multi-plane model.
  2. Local Feature Fusion and Adaptation (LFFA):

    • Function: Inject local features extracted by the CNN into the SAM decoder to compensate for ViT's deficiency in capturing local details.
    • Mechanism: A CNN branch (Residual blocks + cross-branch attention) is designed. The local features output at each level are concatenated with the keys of corresponding Transformer blocks in the SAM decoder and fused using a 1x1 convolution, yielding the fused features \(f_{F,l} = \text{conv}_{1\times1}(\text{concat}(f_{CNN,l}, f_{DM-K,l}))\), which are then fed into the next Transformer block as the new image embedding. In addition to the original 2 decoder Transformer blocks of SAM, 3 extra learnable blocks are added to accommodate the injection of the 4 layers of CNN features.
    • Design Motivation: SAM's ViT is pre-trained on natural images and lacks local comprehension of low-contrast structures in ultrasound images. The CNN branch provides complementary local detail features, and directly injecting these into the decoder via skip-like connections is more effective than solely tuning the encoder, while also accelerating convergence.
  3. Unified Mask Representation:

    • Function: Unify training data originating from different sources and annotated under different protocols.
    • Mechanism: All annotations across various datasets are remapped into 4 unified semantic classes (background, LV, LV cavity, MYO). For datasets with only myocardial annotations, the LV cavity masks are generated by detecting anatomical landmarks and filling the enclosed regions.
    • Design Motivation: Multi-source datasets use different annotation protocols (some annotate the left ventricle, some the myocardium, and others the left atrium), making unification essential for joint training.

Loss & Training

The total loss is defined as \(\mathcal{L} = \mathcal{L}_{SEG} + 0.5 \cdot \mathcal{L}_{PCM}\). Here, \(\mathcal{L}_{SEG}\) supervises the final segmentation output (0.8 Dice + 0.2 BCE), and \(\mathcal{L}_{PCM}\) supervises mask learning in the PC-Mask module (using the same Dice + BCE combination). The model is optimized using the Adam optimizer (lr=1e-4) for 100 epochs. All images are resized to 256x256, and training is conducted on a single A6000 GPU.

Key Experimental Results

Main Results

Method 2CH mDice 4CH mDice PSAX mDice Overall Mean
U-Net 86.40 86.62 82.32 ~85.3
SAMUS 87.51 88.37 85.92 ~87.3
SAM (zero-shot) 26.24 27.22 25.10 ~26.2
MedSAM 81.76 84.14 76.99 ~81.0
EchoONE 89.67 90.28 88.26 ~89.4

EchoONE achieves the highest Dice on all planes, with a particularly pronounced advantage on the PSAX short-axis plane (88.26 vs. SAMUS 85.92). This is likely attributable to the effective guidance provided by the PC-Mask structural prior on the short-axis cardiac boundaries.

Ablation Study

Dataset EchoONE mDice SAMUS mDice U-Net mDice
Center A (Internal) 89.67 87.51 86.75
Center B (Internal) 87.27 85.54 77.35
HMC_QU (External) 73.94 72.38 67.47
EchoNet (External) 87.62 85.77 83.10

On two external datasets completely excluded from training, EchoONE still achieves the best performance, demonstrating robust generalization capability. Even on HMC_QU, which contains low-quality images and noisy annotations, EchoONE outperforms all competing methods.

Key Findings

  • Directly applying SAM to ultrasound images completely fails (Dice ~25%), indicating a vast domain gap between natural and medical ultrasound images.
  • Among baseline methods, SAMUS (a SAM adaptation approach specifically designed for ultrasound) performs the best, yet EchoONE still significantly outperforms it across all planes.
  • The PC-Mask module contributes the primary performance gain, with the most pronounced improvement observed on the PSAX plane.
  • The LFFA module not only improves segmentation accuracy but also accelerates model convergence.

Highlights & Insights

  • The "prior composition" concept of PC-Mask is highly ingenious: instead of generating prompts directly from the image, they are learned as weighted combinations from pre-computed clustering priors. This endows the prompts with structural knowledge across datasets, ensuring stronger generalization.
  • The unified mask representation addresses the practical challenge of inconsistent annotations across multi-source ultrasound data, offering an engineering contribution of high value to the ultrasound community.
  • Processing 6 planes with a single model is a first in this field. Compared to traditional methods requiring 6 individual models, it drastically reduces deployment complexity.

Limitations & Future Work

  • The number of clusters \(K\) in PC-Mask is an empirical hyperparameter, and the paper does not fully discuss the impact of its selection on performance.
  • Evaluation is currently limited to echocardiography; whether it can be extended to other multi-plane/multi-planar medical imaging modalities (e.g., MRI, CT) warrants further exploration.
  • The unified mask representation simplifies left atrium (LA) annotations, which may lose some structural information regarding the left atrium.
  • Using ViT-B as the image encoder incurs a high computational overhead during inference, necessitating lighter backbones for mobile deployments.
  • The generalization performance on the external HMC_QU dataset reaches only 73.94% in Dice, leaving a gap compared to internal evaluation; there is still room for improvement in low-quality image scenarios.
  • vs. SAMUS: While SAMUS also utilizes a CNN side-branch to adapt SAM to ultrasound, it lacks a semantically-aware dense prompt mechanism, leading to inferior performance compared to EchoONE in multi-plane scenarios.
  • vs. MedSAM: MedSAM employs bounding box prompts for medical image segmentation but lacks adaptability to structural variations across multiple ultrasound planes.
  • vs. Multi-branch approaches like TransFusion: These approaches use independent branches for each plane and cannot generalize to new planes. In contrast, EchoONE's unified design is significantly superior in terms of practical utility.

Rating

  • Novelty: ⭐⭐⭐⭐ The clustering prior composable design of PC-Mask is novel, though the overall framework builds closely upon existing SAM adaptation studies.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely thorough, validated across 7 datasets (5 internal + 2 external) and 6 different planes.
  • Writing Quality: ⭐⭐⭐⭐ The problem definition is clear, the methodology is described rigorously, and the figures are highly informative.
  • Value: ⭐⭐⭐⭐ It represents the first attempt to address unified multi-plane ultrasound segmentation, holding direct clinical deployment value.