AAAI 2026 Medical Imaging X-ray angiography self-supervised learning masked image modeling vascular anatomy awareness foundation model vessel segmentation stenosis detection

Vascular Anatomy-aware Self-supervised Pre-training for X-ray Angiogram Analysis¶

Conference: AAAI 2026 arXiv: 2602.11536 Code: GitHub Area: Medical Imaging / Self-supervised Pre-training Keywords: X-ray angiography, self-supervised learning, masked image modeling, vascular anatomy awareness, foundation model, vessel segmentation, stenosis detection

TL;DR¶

This paper proposes VasoMIM, a domain-specific self-supervised pre-training framework for X-ray angiograms. It introduces an anatomy-guided masking strategy that prioritizes vessel regions, an anatomical consistency loss to preserve vascular topology in reconstructed images, and a newly constructed XA-170K pre-training dataset — the largest of its kind. VasoMIM comprehensively outperforms both general-purpose and medical SSL methods (including DINOv3 pre-trained on 1.69 billion images) across 4 downstream tasks and 6 datasets.

Background & Motivation¶

Background: Cardiovascular disease is the leading cause of death worldwide, and X-ray angiography serves as the clinical gold standard for diagnosis. Deep learning methods (e.g., UNet, Faster R-CNN) have achieved notable progress in vessel segmentation and stenosis detection, but are severely constrained by the scarcity of annotated data. Self-supervised learning (SSL) offers a promising solution; however, the field lacks dedicated SSL frameworks and large-scale datasets tailored to this modality.

Limitations of Prior Work: - Random masking strategies in generic MIM are ill-suited: Vascular structures in angiograms are extremely sparse, occupying only a small fraction of the image. Random, attention-guided, or loss-guided masking predominantly covers background regions, causing models to learn background reconstruction rather than vessel features. - Pixel-level reconstruction objectives lack semantic discriminability: MSE loss encourages prediction of low-frequency background textures rather than high-frequency vascular details. - Absence of large-scale datasets: Unlike chest X-rays (e.g., CheXpert with 220K+ images) or CT, the angiography domain lacks large-scale pre-training datasets. - General visual foundation models (e.g., DINOv3) transfer poorly across domains: Models pre-trained on natural images lack the anatomical semantics specific to angiography.

Core Idea: Inject strong anatomical inductive biases into MIM — enabling the model to identify vascular regions and forcing it to focus on reconstructing them.

Method¶

Overall Architecture (VasoMIM)¶

Input X-ray angiogram → Frangi filter for vascular anatomy extraction + UNeXt-S segmentor for probability map generation → co-guidance → anatomy-guided masking strategy → ViT encoder + decoder for reconstruction → pixel reconstruction loss \(\mathcal{L}_{rec}\) + anatomical consistency loss \(\mathcal{L}_{cons}\)

Key Designs¶

Frangi Filter for Vascular Anatomy Extraction:
- Multi-scale Hessian analysis (σ = 1, 2, 3, 4) for tubular structure detection
- Adaptive percentile thresholding (α = 92) to generate a coarse binary mask
- Region growing to remove isolated artifacts, yielding the final binary vessel mask \(B \in \{0,1\}^{1 \times H \times W}\)
Anatomy-Guided Masking Strategy:
- Co-guidance: Combines the Frangi filter mask \(B\) and the UNeXt-S segmentation probability map \(M\): \(G = \eta \cdot B + (1-\eta) \cdot M, \quad \eta=0.5\)
- UNeXt-S compensates for the Frangi filter's tendency to miss low-contrast fine vessels.
- Patch-level sampling probability: \(f(g_i) = \frac{\sum_j g_{ij}}{\sum_k \sum_j g_{kj}}\), assigning higher masking probability to patches with greater vessel density.
- Weak-to-strong curriculum: Random masking is blended more heavily in early training stages to avoid premature optimization on difficult targets; anatomical guidance is progressively increased: \(\beta_e = \beta_0 + \frac{e}{E}(\beta_E - \beta_0)\)
- Per epoch: \(\beta_e \gamma N\) patches are sampled via anatomy-guided strategy; \((1-\beta_e)\gamma N\) patches are sampled randomly.
Anatomical Consistency Loss:
- Core idea: applying the same segmentor to both the original and reconstructed images should yield consistent outputs: \(\mathcal{L}_{cons} = \mathcal{L}_{CE}(\mathcal{S}(I), \mathcal{S}(I'))\)
- A lightweight UNeXt-S (only 0.26M parameters) serves as a differentiable proxy, since the Frangi filter is non-differentiable.
- Encourages the model to learn topologically accurate vascular representations rather than merely reproducing pixel intensities.
Overall Training Objective: \(\mathcal{L}_{MIM} = \mathcal{L}_{rec} + \mathcal{L}_{cons}\)

XA-170K Dataset¶

177,478 X-ray angiogram images collected from 4 public sources: - CADICA: 42 patients, 6,594 frames - SYNTAX: 231 patients, 2,943 images - XCAD: 1,621 images - CoronaryDominance: 1,574 patients, 160,320 images (primary source)

Experiments¶

Downstream Tasks and Datasets¶

Vessel segmentation: ARCADE-V, CAXF, XCAV (DSC + clDice)
Vessel segment segmentation: ARCADE-VS (DSC)
Stenosis segmentation: ARCADE-S (DSC)
Stenosis detection: Stenosis (mAP50, mAP75, mAP)

Main Results: Segmentation Tasks¶

Method	Pre-training Data	ARCADE-V DSC	ARCADE-V clDice	XCAV DSC	ARCADE-S DSC	ARCADE-VS DSC	Avg. Rank
UNet (scratch)	-	71.44	70.67	78.18	27.04	38.77	22.00
MAE	XA-170K	79.39	80.74	84.84	51.72	56.69	4.88
DINOv3	LVD-1698M	79.36	80.90	82.76	53.57	54.36	7.25
DeblurringMIM	XA-170K	79.25	80.77	85.38	51.70	56.66	4.38
RAD-DINO	LVD-142M+CXR-838K	78.96	80.26	84.88	51.55	54.81	6.62
VasoMIM-v1	XA-170K	79.90	81.57	85.80	54.52	58.03	2.12
VasoMIM	XA-170K	80.25	82.06	86.09	55.62	58.87	1.00

Key Findings: - VasoMIM achieves the best performance on all metrics with an average rank of 1.00. - Substantial gains over UNet trained from scratch: ARCADE-S DSC +28.58, ARCADE-VS DSC +20.10. - Domain-specific pre-training > general large-scale pre-training: DINOv3, pre-trained on 1.69 billion natural images, is outperformed by VasoMIM pre-trained on only 170K angiograms. - VasoMIM further improves upon VasoMIM-v1 (conference version) with statistical significance (p = 1.18×10⁻⁴, paired t-test).

Stenosis Detection Task¶

Method	mAP50	mAP75	mAP
Faster R-CNN (scratch)	88.37	19.01	36.63
MAE	92.30	24.28	39.69
DINOv3	93.89	23.60	40.90
VasoMIM-v1	94.25	25.01	40.91
VasoMIM	94.91	25.72	41.07

Ablation Study¶

Independent contributions of anatomy-guided masking and anatomical consistency loss:

Guidance	\(\mathcal{L}_{cons}\)	ARCADE-V DSC	XCAV DSC
✗	✗	79.31	84.52
✗	✓	79.85 (+0.54)	85.79 (+1.27)
✓	✗	79.87 (+0.56)	85.92 (+1.40)
✓	✓	80.25 (+0.94)	86.09 (+1.57)

Both components are individually effective; their combination yields super-additive gains.
Anatomy-guided masking provides a larger boost on XCAV (+1.40 vs. +1.27), where vessel structures are more sparse.

Masking guidance visualization: - Baseline (MAE random masking): only 5–10% of masked patches contain vessel structures. - VasoMIM: the proportion of masked vessel patches progressively increases from ~20% to ~70% over training. - Co-guidance captures low-contrast vascular branches missed by Frangi filtering alone.

Comparison with alternative reconstruction targets: The anatomical consistency loss using UNeXt-S (0.26M parameters) is far more lightweight than DINOv2-based (86M) feature distillation, while achieving comparable or superior performance.

Highlights & Insights¶

Domain-knowledge-driven design: The Frangi filter is a classical vessel analysis tool; its integration as an anatomical prior into the MIM framework exemplifies effective combination of traditional methods and deep learning.
Counter-intuitive "small data beats big data" finding: 170K domain-specific images outperform 1.69 billion natural images (DINOv3), underscoring the importance of domain adaptation.
Complementarity of co-guidance: The Frangi filter provides hard edge responses but misses fine vessels, while UNeXt-S provides soft probability maps that compensate for such omissions.
Curriculum from weak to strong guidance: Avoids difficult optimization in early stages by gradually increasing anatomical masking, reflecting sound engineering intuition.
Elegant design of the anatomical consistency loss: A lightweight 0.26M segmentor replaces the non-differentiable Frangi filter with minimal computational overhead.
Dataset contribution: XA-170K, the largest pre-training dataset in this domain, will be made publicly available.

Limitations & Future Work¶

Limitations of the Frangi filter: Sensitivity to noise and potential misclassification of bony structures as vessels (partially mitigated by co-guidance).
Only ViT-B/16 is used as backbone: Whether larger models (ViT-L/H) yield further gains remains unexplored.
Full fine-tuning for downstream tasks: Parameter-efficient fine-tuning strategies (e.g., LoRA) and performance under more extreme few-shot settings are not investigated.
Dataset scale remains limited: Although XA-170K is the largest in this domain, it is still substantially smaller than million-scale pre-training datasets for chest X-rays or CT.
Restricted to coronary angiography: Generalizability to other angiographic modalities (e.g., cerebrovascular, peripheral vascular) has not been validated.
UNeXt-S trained with Frangi pseudo-labels: Segmentation quality is bounded by Frangi filter performance, potentially introducing systematic bias.

General SSL: MAE, SimMIM, DINO, iBOT, I-JEPA, DINOv3
Medical SSL: Model Genesis, LVM-Med, DeblurringMIM, RAD-DINO, CheXWorld, MedDINOv3
MIM masking strategies: AMT (attention-guided), HPM (loss-guided), AnatoMask, HAP (human anatomy prior)
Angiogram analysis: ARCADE dataset, XCAD, supervised methods based on UNet/Faster R-CNN

Rating ⭐⭐⭐⭐⭐¶

The method is elegantly designed with clear motivation and effective integration of domain knowledge. The experiments are highly comprehensive — covering 4 downstream tasks, 6 datasets, 20+ baselines, and detailed ablations. The dataset contribution and scaling law analysis further enhance the paper's value. This is a high-quality contribution to the field of self-supervised pre-training for medical image analysis.