SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection¶
Conference: CVPR2025
arXiv: 2602.23447
Code: Planned release (Non-commercial academic research license)
Area: Medical Image
Keywords: Diffusion models, wavelet domain, long-tail detection, CT augmentation, synthetic data, class imbalance, dose-response
TL;DR¶
Proposes SALIENT, a mask-conditioned generative framework based on wavelet-domain diffusion. Through frequency-aware, interpretable optimization objectives and paired lesion-mask volume generation, it achieves controllable and efficient synthetic data augmentation and precision recovery in long-tail CT detection. It systematically characterizes the augmentation dose-response curve for the first time.
Background & Motivation¶
- Two Major Failure Modes of Long-Tail Detection: (1) Intra-patient signal dilution—low target volume ratio (TVR), where lesions occupy a minuscule proportion of the field-of-view in whole-body CT scans; (2) Cross-dataset class imbalance—extreme long-tail distribution of rare lesions.
- Precision Ceiling: Even with a high AUROC, models exhibit low precision, poor AUPRC, and unstable F1 scores in detecting rare lesions, limiting clinical credibility.
- Limitations of Prior Work in Diffusion Models:
- Pixel-space DDPMs in 3D incur prohibitive computational costs, requiring aggressive downsampling that leads to loss of details.
- Existing mask-conditioned methods lack attribute-level controllable adjustment and paired supervision.
- Frequency-domain diffusion methods rely on manual weight tuning and cannot disentangle interpretable image attributes.
- Unexplored Augmentation Dose Effect: Synthetic augmentation typically assumes monotonic benefits, but the optimal "therapeutic dose" and optimal-exceeding "toxic dose" have not been systematically characterized.
Method¶
1. Wavelet-Domain Diffusion Framework¶
- Applies a one-level Haar Discrete Wavelet Transform (DWT) to each CT slice to obtain four sub-bands: LL (low-frequency structure/brightness) and LH/HL/HH (high-frequency edges/texture details).
- Conducts diffusion in the wavelet coefficient space rather than the pixel space to explicitly separate global brightness from high-frequency structural details.
- Conditioning signals include: downsampled lesion masks + wavelet features of adjacent slices (2.5D context).
2. Mask-Gated Frequency Scaling (FSA)¶
- Performs mask-gated frequency modulation on noisy wavelet coefficients before UNet input to generate modulated coefficients.
3. Loss & Training¶
- Wavelet Sub-band Weighted Reconstruction Loss: Imposed with higher weights near lesion boundaries to suppress HH channel amplification.
- Low-Frequency Moment Regularization (\(L_{LL}\)): Constrains the mean and variance of the LL sub-band to match the real distribution, preventing brightness drift.
- High-Frequency Variance Control (\(L_{HF}\)): Constrains the variance of LH/HL/HH sub-bands to match real data, preserving texture fidelity.
- Total loss: \(L = L_{\text{wavelet}} + L_{LL} + L_{HF} + L_{\text{aux}}\)
4. Structured Classifier-Free Guidance (CFG)¶
- Three forward passes: unconditional, mask-only conditional, and mask + adjacent-slice conditional.
- The adj-slice guidance strength decays along diffusion steps, emphasizing global anatomical consistency in early stages and focusing on lesion refinement in later stages.
5. 3D VAE Volumetric Lesion Mask Generation¶
- Trains MaskVAE3D to learn the latent pathological manifold from limited positive samples, sampling to generate morphologically diverse 3D lesion masks.
6. Semi-Supervised Segmentation Pairing¶
- Generates paired pseudo-segmentation masks for synthetic CT using UCMT (Uncertainty-Aware Cross-Model Training).
- Ultimately outputs paired (synthetic CT, lesion mask) for downstream mask-guided detection training.
Key Experimental Results¶
Dataset¶
- 5205 contrast-enhanced whole-body CT cases (200 mediastinal hematoma positive, 5005 negative), exhibiting a natural long-tail distribution (~3% positivity rate).
- Unified preprocessing: orientation alignment, resampling, soft-tissue HU windowing, intensity normalization, and anatomical cropping of the mediastinum region.
Generation Quality¶
| Method | MS-SSIM↑ | FID↓ |
|---|---|---|
| MedDDPM (Pixel Space) | 0.63 | 118.4 |
| SALIENT (Wavelet Domain) | 0.83 | 46.5 |
- Segmentation Fidelity: Dice \(0.72 \pm 0.24\) vs \(0.27 \pm 0.16\) for MedDDPM.
- Computational Acceleration: \(4\times\) faster than 2.5D MedDDPM and \(28\times\) faster than 3D MedDDPM.
Augmentation Dose-Response Experiments¶
| Seed Size | Optimal Dose | 1% Prevalence \(\Delta\)AUPRC | Remarks |
|---|---|---|---|
| \(n=50\) | \(2\times\) | +0.0605 | Stable therapeutic window |
| \(n=25\) | \(4\times\) | +0.12 | Dose shifts right, larger gain |
- Key Findings: As labeled seed size decreases, the optimal dose shifts right (\(2\times \to 4\times\)) and the AUPRC gain becomes larger, suggesting a seed-dependent augmentation mechanism under low-label conditions.
- AUROC remains consistently high, indicating that the core contribution of SALIENT is precision recovery (AUPRC) rather than mere separability inflation.
- Excessive synthetic augmentation (\(10\times\)) leads to performance degradation, confirming the existence of a "toxic dose."
Radiologists Blind Evaluation¶
- 5-point Likert scale evaluation: SALIENT outperforms MedDDPM in brightness/contrast realism, lesion-to-background fusion, high-frequency artifact suppression, and mask fidelity.
Highlights & Insights¶
- Elegant Design of Wavelet-Domain Diffusion: Combining frequency decomposition with diffusion models provides interpretable, attribute-level control "knobs" (brightness, structure, edge, and contrast) while substantially reducing 3D computational cost.
- Paired Generation: Generates paired (synthetic CT, mask) data end-to-end, directly supporting mask-guided detection training workflows without requiring manual annotation.
- Augmentation Dose-Response Analysis: Systematically characterizes the "therapeutic window" and "toxic dose" of synthetic augmentation for the first time, identifying the seed-dependent dose-shift pattern as a methodological contribution.
- Full-Pipeline Design: Houses a complete pipeline from 3D mask generation \(\to\) wavelet-domain synthesis \(\to\) semi-supervised pairing \(\to\) mask-guided detection \(\to\) subject-level aggregation.
- Remarkable Precision Recovery: Elevates AUPRC by 0.12 under extreme long-tail conditions (1% prevalence) while keeping AUROC consistently high.
Limitations & Future Work¶
- Evaluated on a single lesion type (mediastinal hematoma); generalization to other rare lesions (e.g., small nodules, micrometastases, microbleeds) remains to be demonstrated.
- Employs single-level Haar wavelets; more complex multi-level or learnable wavelet transforms might further enhance reconstruction quality.
- The semi-supervised paired masks rely on the quality of pseudo-labels from UCMT, meaning pseudo-label errors can propagate to downstream detection.
- The "therapeutic window" of the augmentation dose-response is an empirical discovery, lacking a theoretical explanation of why \(2\times\) or \(4\times\) is optimal.
- Evaluated solely on single-center data; generalization across multi-center, multi-device, and multi-protocol setups requires further verification.
- The morphological diversity of 3D VAE mask generation is limited by the number of positive samples in the training set (only 200 positive cases).
Rating¶
- Novelty: 4/5 — The combination of wavelet-domain diffusion, frequency-aware optimization objectives, and dose-response analysis is innovative. The structured CFG is also solid.
- Experimental Thoroughness: 4/5 — Comprehensive coverage of generation quality (MS-SSIM/FID), sub-band analysis, radiologists' blind evaluation, downstream detection, and the dose-response curve, though a single lesion type limits the generalizability of the conclusions.
- Writing Quality: 4/5 — Well-structured and richly illustrated (especially the informative frequency analysis figures), though some notation definitions are scattered across sections.
- Value: 4/5 — Provides a systematic and controllable synthetic augmentation solution for long-tail medical image detection, with the augmentation dose-response analysis framework offering methodological value.