Decoding with Structured Awareness: Integrating Directional, Frequency-Spatial, and Structural Attention for Medical Image Segmentation¶
Conference: AAAI 2026 arXiv: 2512.05494 Code: N/A Area: Medical Imaging Keywords: Medical image segmentation, decoder design, frequency-spatial fusion, directional-aware attention, multi-scale feature fusion
TL;DR¶
This paper proposes a novel decoder framework for medical image segmentation comprising three modules: Adaptive Cross-Fusion Attention (ACFA) for directional awareness, Triple Feature Fusion Attention (TFFA) for spatial-frequency-wavelet fusion, and Structural-aware Multi-scale Masking Module (SMMM), achieving state-of-the-art performance across multiple benchmark datasets.
Background & Motivation¶
- Accurate delineation of organs, tumors, and lesions in medical image segmentation is critical for surgical planning and radiotherapy dose design.
- Limitations of Transformer decoders:
- Insufficient edge detail capture: self-attention excels at global dependencies but is weak in local texture modeling.
- Limited local texture recognition: fixed receptive fields struggle with ambiguous boundaries.
- Inadequate spatial continuity modeling: simple additive skip connections lead to spatial detail loss and redundant information.
- Issue with U-Net skip connections: conventional skip connections rely on simple addition, failing to balance global and local features.
- CNN fixed receptive fields limit long-range dependency modeling; ViTs excel globally but are weak at short-range dependencies.
- A decoder framework is needed that enhances edge and structural detail representation while maintaining global perception.
Method¶
Overall Architecture¶
The encoder uses PVTv2-b2 (ImageNet pre-trained), and the decoder consists of three core modules:
- ACFA (Adaptive Cross-Fusion Attention): directional awareness module
- TFFA (Triple Feature Fusion Attention): frequency-spatial fusion module
- SMMM (Structural-aware Multi-scale Masking Module): skip connection optimization module
Key Designs¶
Module 1: ACFA — Adaptive Cross-Fusion Attention
Enhances model responsiveness to critical regions and structural directional modeling:
-
For input feature map \(X \in \mathbb{R}^{B \times C \times H \times W}\), channel gating and spatial gating are applied:
- Channel gating: \(\hat{X}_{l-1}^{CG} = X \odot \sigma(CG_{avg}(X) + CG_{max}(X))\)
- Spatial gating: \(\hat{X}_{l-1}^{SG} = X \odot \sigma(f_{7 \times 7}^{Conv}(SG(X)))\)
-
Spatially gated features are split into 4 groups along the channel dimension.
-
Three directional branches with learnable directional parameters:
- Planar direction: \(Tensor^{HW} \in [1, C/4, H, W]\)
- Vertical direction: \(Tensor^{H} \in [1, C/4, H, 1]\)
- Horizontal direction: \(Tensor^{W} \in [1, C/4, 1, W]\)
- Each direction employs depth-wise separable convolutions to extract key responses.
-
Fourth branch: standard convolution captures general contextual information, complementing details potentially missed by the directional branches.
- Four-branch features are concatenated and fused via LayerNorm and convolution.
Design Motivation: Structural directionality of organs and lesions in medical images (e.g., vessel orientation) is important; the module learns the most data-appropriate directional attention patterns end-to-end.
Module 2: TFFA — Triple Feature Fusion Attention
Fuses spatial-domain, Fourier-domain, and wavelet-domain features for joint frequency-spatial representation:
-
Wavelet branch: employs DoG (Difference of Gaussians) and Mexican Hat wavelets.
- DoG highlights regions of significant gray-level change, enhancing edge and contour perception: $\(\psi_{a,b}^{DoG}(x) = -\frac{1}{\sqrt{a}} \frac{x-b}{a} e^{-\frac{(x-b)^2}{2a^2}}\)$
- Mexican Hat detects edge zero-crossings via second-order derivatives while suppressing noise: $\(\psi_{a,b}^{MH}(x) = \frac{2}{\sqrt{3a}\pi^{1/4}} (1 - (\frac{x-b}{a})^2) e^{-\frac{(x-b)^2}{2a^2}}\)$
- Scale parameter \(a\) and shift parameter \(b\) are both learnable.
-
Fourier branch: transforms images from the spatial domain to the frequency domain.
- High-frequency components → edges and textures; low-frequency components → contours and background.
- Learnable weight matrices modulate frequency-domain features.
- Compensates for the deficiency of convolutional models in perceiving large-scale structures.
-
Spatial branch: point-wise convolution extracts spatial features.
-
Attention-gated fusion: three branch outputs are adaptively fused via dynamic attention weights, avoiding the over-smoothing of conventional fusion strategies.
Module 3: SMMM — Structural-aware Multi-scale Masking Module
Optimizes skip connections between encoder and decoder:
- Encoder and decoder features are each activated via point-wise convolutions to highlight spatial cues.
-
Multi-scale perception:
- Dual-path extraction using 3×3 and 5×5 depth-wise separable convolutions.
- Two-stage channel splitting with ReLU activation to enlarge the receptive field.
- Two-level feature concatenation followed by 3×3 and 5×5 convolutions for further fusion.
-
Spatial saliency mask:
- Three distinct channel-gated filters identify the most discriminative spatial regions.
- Softmax weighting emphasizes high-response areas.
- Effectively handles ambiguous lesions and indistinct contours.
-
Filtered features are summed and processed by a dilated convolution with dilation=2 to expand the receptive field.
- Final normalization and point-wise convolution perform channel alignment.
Loss & Training¶
- Encoder: PVTv2-b2, ImageNet pre-trained.
- Optimizer: AdamW, lr = 1e-4.
- Batch size = 12; masks normalized to [0, 1]; no data augmentation.
- Training epochs: ISIC 2017/2018 → 200; Synapse → 300; ACDC → 400.
- Hardware: NVIDIA A100 GPU (40 GB).
Key Experimental Results¶
Main Results (Synapse Multi-organ Segmentation)¶
| Method | DSC↑ | HD95↓ | Spl | RKid | LKid | Pan |
|---|---|---|---|---|---|---|
| TransUNet | 77.49 | 31.69 | 85.08 | 77.02 | 81.87 | 55.86 |
| Swin-UNet | 79.13 | 21.55 | 90.66 | 79.61 | 83.28 | 56.58 |
| EMCAD | 83.63 | 15.68 | 92.17 | 84.10 | 88.08 | 68.51 |
| AD-LA Former | 83.48 | 21.31 | 88.72 | 70.82 | 86.50 | 84.69 |
| Ours | 83.92 | 18.91 | 92.46 | 86.47 | 89.26 | 69.95 |
ISIC 2017 Skin Lesion Segmentation:
| Method | DSC↑ | SE | SP | ACC |
|---|---|---|---|---|
| LKA | 90.99 | 90.55 | 98.49 | 96.98 |
| EMCAD | 90.06 | 93.70 | 96.81 | 96.55 |
| Ours | 91.40 | 92.75 | 97.78 | 97.26 |
ACDC Cardiac Segmentation:
| Method | DSC↑ | RV | Myo | LV |
|---|---|---|---|---|
| DMSA-UNet | 92.28 | 90.32 | 90.49 | 96.02 |
| EMCAD | 92.12 | 90.65 | 89.68 | 96.02 |
| Ours | 92.75 | 91.18 | 90.40 | 96.67 |
Ablation Study¶
Module Ablation on Synapse:
| Configuration | DSC↑ | HD95↓ |
|---|---|---|
| Baseline | 81.35 | 20.42 |
| + ACFA | 82.04 | 21.46 |
| + ACFA + TFFA | 83.23 | 16.72 |
| + ACFA + TFFA + SMMM | 83.92 | 18.91 |
Module Ablation on ISIC 2017:
| Configuration | DSC↑ | SE | ACC |
|---|---|---|---|
| Baseline | 85.95 | 83.68 | 96.05 |
| + ACFA | 87.82 | 85.12 | 95.92 |
| + ACFA + TFFA | 89.15 | 89.83 | 96.85 |
| + ACFA + TFFA + SMMM | 91.40 | 92.75 | 97.26 |
TFFA Internal Ablation (ISIC 2018):
| Fourier | Mexican Hat | DoG | DSC↑ | SE |
|---|---|---|---|---|
| ✗ | ✗ | ✗ | 90.32 | 89.31 |
| ✓ | ✗ | ✗ | 90.48 | 91.43 |
| ✓ | ✓ | ✗ | 90.57 | 92.63 |
| ✓ | ✓ | ✓ | 90.71 | 93.34 |
Computational Cost: Full model — 42.52M parameters, 18.29 GMac (Baseline: 25.07M / 11.85 GMac).
Key Findings¶
- ACFA improves DSC by 0.7% on Synapse with only 5.6M additional parameters, validating the effectiveness of directional awareness.
- TFFA contributes the largest gain: DSC from 82.04 → 83.23 (Synapse); SE from 85.12 → 89.83 (ISIC 2017).
- SMMM is more effective on detail-rich datasets (ISIC 2017 DSC +2.25).
- The synergy of all three modules substantially exceeds the sum of their individual contributions.
Highlights & Insights¶
- Frequency-domain fusion design rationale: DoG performs band-pass filtering to enhance textures; Mexican Hat performs second-order derivative edge detection; Fourier captures global dependencies — the three are mutually complementary.
- Learnability of directional awareness: all three directional parameters are learnable, eliminating the need for hand-crafted directional filters.
- Deep-level skip connection optimization: SMMM replaces simple addition with spatial saliency masking, reducing redundant information propagation.
- Cross-dataset validation: state-of-the-art or near state-of-the-art performance is achieved on abdominal multi-organ, skin lesion, and cardiac segmentation tasks.
Limitations & Future Work¶
- HD95 on Synapse is not optimal (18.91 vs. HiFormer-B's 14.70), indicating room for further improvement in boundary precision.
- Parameter count increases from 25M to 42.5M (+70%), and computation from 11.85 to 18.29 GMac (+54%), incurring a non-trivial efficiency cost.
- Validation is limited to 2D segmentation; extension to 3D medical image segmentation remains unexplored.
- Gallbladder segmentation performance lags behind AD-LA Former (67.51 vs. 83.30), indicating small organ segmentation remains a weakness.
- The initialization and convergence behavior of the wavelet scale parameter \(a\) and shift parameter \(b\) are not thoroughly analyzed.
Related Work & Insights¶
- Compared to EMCAD (multi-scale attention decoder), this work provides a complementary perspective on multi-scale information via frequency-domain fusion.
- The value of frequency-domain analysis in medical imaging is reaffirmed: Fourier + wavelet combinations outperform single-transform approaches.
- The saliency masking concept in SMMM can serve as an alternative to conventional attention gating, more directly addressing feature redundancy.
- The directional awareness design is generalizable to tasks with strong directional structure, such as vessel segmentation and crack detection.
Rating¶
- Novelty: ⭐⭐⭐ — The module combination is relatively novel, though individual components are not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, comprehensive ablation, and computational cost analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with complete mathematical formulations.
- Value: ⭐⭐⭐⭐ — Provides a practical decoder design paradigm for medical image segmentation.