LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol¶
Conference: CVPR 2026 arXiv: 2603.14644 Code: Available Area: Medical Imaging Keywords: mammography, multi-vendor dataset, energy harmonization, histogram matching, benchmark
TL;DR¶
This paper introduces LUMINA, a multi-vendor full-field digital mammography (FFDM) dataset comprising 468 patients and 1,824 images, accompanied by a foreground-pixel histogram matching protocol for energy harmonization. The benchmark systematically evaluates CNN and Transformer models across three clinical tasks: diagnosis, BI-RADS classification, and breast density prediction.
Background & Motivation¶
- Background: Existing public mammography datasets (e.g., CBIS-DDSM, INbreast) suffer from notable deficiencies in scale, clinical annotation completeness, and vendor diversity. CBIS-DDSM is derived from legacy screen-film mammography (SFM) scans, while INbreast contains only 115 patients.
- Limitations of Prior Work: Multi-vendor acquisition systems differ in energy settings (high/low energy) and vendor-specific processing pipelines, resulting in substantial domain shift in image appearance and intensity distributions. Consequently, models generalize poorly across vendors.
- Goal: (1) Construct an FFDM benchmark dataset that emphasizes vendor diversity and energy metadata; (2) propose a model-agnostic foreground histogram harmonization method to eliminate vendor/energy-induced domain shift.
Method¶
Overall Architecture¶
The LUMINA workflow consists of three stages: (1) Data collection and curation — 1,824 FFDM images from six vendors, with pathology-confirmed malignancy labels, BI-RADS scores, and breast density annotations; (2) Foreground histogram harmonization (Energy Harmonization) — aligning all images to a low-energy reference distribution; (3) Multi-task benchmark evaluation — comparing CNNs (ResNet-50, DenseNet-121, EfficientNet-B0) and a Transformer (Swin-T) across the three clinical tasks.
Key Designs¶
-
Multi-Vendor Dataset Construction: Data were collected from six vendors — IMS, Metaltronica, FUJIFILM, Siemens, Carestream, and GE — covering 468 patients (250 benign, 218 malignant) in 12–14 bit DICOM format. Annotations include pathology-confirmed outcomes, BI-RADS grades 0–6, and breast density categories A–D. Images in FUJIFILM's MONOCHROME1 format are uniformly converted to MONOCHROME2.
-
Foreground-Only CDF Matching: The core idea is to exclude background pixels (intensity = 0) and perform CDF matching exclusively on the foreground breast region. Specifically, a foreground mask is defined as \(M_s = \{(x,y) \mid \mathbf{I}_s(x,y) > 0\}\); foreground histograms \(H_s(k)\) and \(H_r(k)\) are computed for the source and reference images respectively, normalized into CDFs \(\bar{C}_s(p)\) and \(\bar{C}_r(q)\), and intensity transformation is achieved via the mapping \(\mathcal{T}(p) = \arg\min_q |\bar{C}_s(p) - \bar{C}_r(q)|\). The reference histogram is drawn from the low-energy FFDM subset using 12-bit bins to preserve fine-grained detail. Design Motivation: Standard histogram matching is severely distorted by large areas of black background pixels; the foreground mask effectively mitigates this problem.
-
Dual-View Shared-Backbone Network: CC (craniocaudal) and MLO (mediolateral oblique) views are processed through a weight-sharing backbone independently; the resulting features are concatenated and passed to a fully connected classifier. Compared to independent backbones, weight sharing reduces parameters by 48% (4.34M vs. 8.34M) while achieving comparable or superior performance.
Loss & Training¶
- Standard cross-entropy classification loss
- AdamW optimizer: \(\text{lr}=1 \times 10^{-3}\) for CNNs; \(\text{lr}=1 \times 10^{-5}\) for Swin-T
- 100 epochs with learning rate decay by 0.1 every 30 epochs; weight decay \(1 \times 10^{-5}\)
- 5-fold cross-validation; best model selected by validation AUC
- Data augmentation limited to horizontal flipping and resizing; grayscale images replicated to three channels
- PyTorch with CUDA determinism flags for reproducibility
- Training environment: 8 × NVIDIA A6000 GPUs
Key Experimental Results¶
Main Results¶
| Dataset / Task | Metric | Ours (Best) | Prev. SOTA | Notes |
|---|---|---|---|---|
| Diagnosis (Two-view, 512²) | AUC | 93.54% (EfficientNet-B0) | — | Best overall: dual-view + high resolution |
| Diagnosis (Single-view, 512²) | AUC | 92.13% (EfficientNet-B0) | — | Second best among single-view configs |
| BI-RADS Binary (224²) | AUC | 92.80% (EfficientNet-B0) | — | Low/high risk classification |
| BI-RADS Ternary (224²) | AUC | 83.27% (EfficientNet-B0) | — | Low/intermediate/high risk |
| Density Prediction (224²) | Macro-AUC | 89.43% (Swin-T) | — | Transformer better suited for density |
Ablation Study¶
| Configuration | Key Metric (AUC) | Notes |
|---|---|---|
| Shared backbone EfficientNet-B0 (224²) | 92.99% | 4.34M parameters |
| Independent backbone EfficientNet-B0 (224²) | 93.54% | 8.34M parameters; double the params, marginal gain |
| Raw images (no harmonization) | Baseline | Lower AUC across all tasks |
| Foreground histogram harmonization | +Gain | Consistent improvements in ACC/AUC/F1; Grad-CAM more focused |
Key Findings¶
- Dual-view models consistently outperform single-view counterparts, confirming the complementary value of CC and MLO views.
- EfficientNet-B0 achieves the best performance on diagnosis and BI-RADS tasks with only ~4M parameters; Swin-T excels at density prediction.
- Higher input resolution (512²) generally improves performance, though 224² remains competitive with substantially lower computational cost.
- Histogram harmonization not only improves quantitative metrics but also sharpens Grad-CAM attention, directing model focus toward lesion regions.
- Low-energy images benefit most from harmonization, as high-energy images dominate the dataset distribution.
Highlights & Insights¶
- Practical value of foreground masking: A simple yet effective idea — performing histogram matching after excluding background pixels. Though seemingly straightforward, this design is critical in mammography, where FFDM images contain large areas of black background.
- Model-agnostic preprocessing: The harmonization protocol can be applied as a lightweight preprocessing step to any backbone, making it straightforward to adopt in practice.
- Dataset systematicity: The combination of complete annotations (pathology + BI-RADS + density) with vendor/energy metadata is unique among existing datasets.
- Clinical insight: EfficientNet-B0 wins on diagnostic tasks with the fewest parameters, while Swin-T's global attention mechanism makes it better suited for density prediction — revealing a meaningful relationship between task type and model selection.
Limitations & Future Work¶
- The dataset scale remains modest (468 patients) compared to large-scale resources such as EMBED (~500K images).
- Data originate from a single institution in Turkey, limiting patient population diversity.
- The reference distribution for harmonization is a representative low-energy FFDM subset; no adaptive reference selection mechanism is explored.
- More advanced domain adaptation methods (e.g., adversarial training, frequency-domain alignment) are not investigated.
- No direct experimental comparison with established cross-vendor harmonization methods (e.g., ComBat, HarmoFL) is provided.
- The four-view model underperforms the dual-view model, likely due to overfitting caused by excessive parameters on a small dataset.
Related Work & Insights¶
- ComBat corrects batch effects via empirical Bayes in feature space rather than pixel space.
- HarmoFL reduces cross-site variation through frequency-domain amplitude normalization in federated learning settings.
- LUMINA complements datasets such as VinDr-Mammo (5,000 patients, single vendor, Vietnam) and RSNA (1,970 patients) — smaller in scale but offering greater vendor diversity.
- The pixel-space approach proposed here is more interpretable and requires no training, making it complementary to feature-space methods.
- Insight: For multi-center medical imaging research, lightweight pixel-space preprocessing may be more practical than complex domain adaptation pipelines.
- Combining LUMINA with MIL-PF (also CVPR 2026) is a promising direction: applying LUMINA's harmonization as preprocessing followed by frozen encoder + MIL classification.
Rating¶
- Novelty: ⭐⭐⭐ — Solid dataset contribution, though the proposed method (foreground histogram matching) is technically straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three tasks, multiple models, multiple resolutions, ablations, visualizations, and energy-level analysis.
- Writing Quality: ⭐⭐⭐⭐ — Rich tables and figures, transparent experimental setup, and a persuasive dataset comparison table.
- Value: ⭐⭐⭐⭐ — The multi-vendor benchmark makes a direct contribution to the community and is publicly released on OSF, Kaggle, and GitHub.