Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning¶
Conference: CVPR 2026 arXiv: 2603.12988 Code: GitHub Area: Medical Imaging Keywords: Fairness, Lung Disease Diagnosis, CT Classification, Multiple Instance Learning, Adversarial Training
TL;DR¶
A fairness-aware framework based on attention MIL and gradient reversal layers (GRL) is proposed for multi-class lung disease diagnosis from chest CT volumes, eliminating gender bias while preserving diagnostic accuracy.
Background & Motivation¶
Automated deep learning analysis of chest CT carries significant clinical value for lung cancer screening and COVID-19 detection; however, models may encode and amplify demographic disparities present in training data, leading to systematic unfairness toward minority groups. The Fair Disease Diagnosis Challenge held at the CVPR 2026 PHAROS-AIF-MIH Workshop requires classifying CT scans into four categories (Healthy, COVID-19, Adenocarcinoma, Squamous Cell Carcinoma), evaluated by the gender-stratified macro F1 mean:
This metric explicitly penalizes gender-unequal predictions. The paper addresses three core challenges: (1) sparse pathological signals in CT volumes (only a minority of 200+ slices contain lesions); (2) severe demographic imbalance (only 18 female vs. 91 male squamous cell carcinoma cases); and (3) gender potentially serving as a latent shortcut feature exploited by the model.
Method¶
Overall Architecture¶
An attention MIL model built on a ConvNeXt backbone, augmented with a GRL adversarial branch to enforce gender fairness. The pipeline comprises four stages: slice feature extraction, attention pooling, disease classification, and adversarial gender classification.
Key Designs¶
-
Attention MIL Pooling: Each CT volume is treated as a bag of slices. A ConvNeXt-Base encoder extracts a \(D\)-dimensional embedding \(h_i = f_{\text{enc}}(x_i)\) per slice; a two-layer MLP attention network then assigns importance weights \(w_i = \frac{\exp(s_i)}{\sum_j \exp(s_j)}\) to each slice, and a weighted aggregation yields the scan-level representation \(H = \sum_i w_i h_i\). Zero-padded positions are masked via a binary mask. The core motivation is to enable the model to automatically learn the importance of diagnostically relevant slices without slice-level annotations.
-
Gradient Reversal Layer (GRL) Adversarial Debiasing: A gender classifier is appended to the scan embedding \(H\), with the GRL reversing and scaling gradients during backpropagation: \(z_{\text{gen}} = c(\mathcal{R}_\lambda(H))\). The training objective is \(\mathcal{L} = \mathcal{L}_{\text{disease}} + \lambda_{\text{adv}} \mathcal{L}_{\text{gender}}\). The design motivation is that even without explicit gender inputs, the backbone may encode gender information from CT acquisition parameters and body morphology as spurious correlations.
-
Subgroup Oversampling and Stratified CV: A
WeightedRandomSamplersubstantially increases the sampling weight of female squamous cell carcinoma cases (Female G), ensuring the rarest subgroup is represented in every batch. Five-fold cross-validation is stratified on the joint (class, gender) key, guaranteeing representation of all eight subgroups in each fold.
Loss & Training¶
- Focal Loss + Label Smoothing: \(\mathcal{L}_{\text{disease}} = -\alpha(1-p_t)^\gamma \log \tilde{p}_t\), with \(\gamma=2\), \(\alpha=0.25\), smoothing \(\varepsilon=0.1\), concentrating gradients on hard samples and scarce subgroups.
- Two-Stage Fine-Tuning: Epochs 1–5 freeze the backbone (LR \(10^{-3}\)); from Epoch 6 onward the backbone is unfrozen (backbone LR \(10^{-5}\), heads LR \(10^{-4}\)) with cosine annealing.
- Gradient Accumulation: \(K=4\) steps, yielding an effective batch size of 16 volumes.
- At most \(M=32\) slices per volume; random sampling during training and uniform sampling during inference.
Key Experimental Results¶
Dataset¶
889 3D chest CT scans (734 train / 155 validation), four classes: Adenocarcinoma (300), COVID-19 (240), Healthy (240), Squamous Cell Carcinoma (109). The overall gender distribution is relatively balanced (481 male / 408 female), but only 18 female vs. 91 male SCC cases; volume depth varies widely (20–800+ slices per scan).
Main Results¶
| Dataset | Metric | Ours | Best Single Fold | Notes |
|---|---|---|---|---|
| 889 CT scans (734 train) | Competition Score P | 0.685 ± 0.030 | 0.759 (Fold 1) | 5-fold mean |
| Same | Male macro-F1 | 0.679 ± 0.068 | 0.754 | Gender gap reduced after GRL |
| Same | Female macro-F1 | 0.691 ± 0.030 | 0.722 | Female F1 slightly higher than male |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| Mean Pooling (Baseline) | Low | Tumor signal diluted by healthy slices |
| + Max Pooling | Improved | Restores detection of sparse tumor slices |
| + Attention-MIL | Further improved | Learns to ignore blank lung regions |
| + Subgroup Oversampling | Significant gain in Female F1 | Prevents minority class collapse |
| + GRL | P = 0.685 | Closes fairness gap; male and female performance equalized |
Key Findings¶
- GRL successfully narrows the gender fairness gap: female macro-F1 (0.691) slightly exceeds male (0.679).
- Squamous cell carcinoma (SCC) remains the most challenging class (F1 = 0.366 ± 0.083), limited by data scarcity and clinical overlap.
- OOF threshold optimization maintains strict gender fairness (Male F1 0.679 vs. Female F1 0.688).
Highlights & Insights¶
- The fairness problem is decomposed into three distinct failure modes (sparse signals, demographic imbalance, latent shortcut features), each addressed by a dedicated module.
- GRL adversarial training offers an elegant solution for eliminating gender bias in feature space, more thorough than simple data balancing.
- OOF threshold optimization avoids overfitting to the validation set and serves as a practical post-hoc fairness calibration technique.
Limitations & Future Work¶
- Female SCC data remains extremely scarce (only 18 cases); oversampling cannot fully compensate.
- Generative data augmentation (e.g., diffusion-based CT synthesis for scarce subgroups) is unexplored.
- The fairness constraint could be extended to a stronger fairness-constrained optimization formulation.
- Attention visualization and clinical interpretability are not discussed in depth.
- The 5-fold ensemble at inference incurs high computational cost, requiring each test sample to pass through all five models.
Related Work & Insights¶
- The GRL domain adaptation method of Ganin & Lempitsky (2015) is elegantly transferred to the fairness setting.
- The Attention-MIL framework of Ilse et al. (2018) is well-suited to handling sparse signals in CT volumes.
- The combination of Focal Loss (Lin et al., 2017) and Label Smoothing is friendly to scarce subgroups.
- The inference-time combination of 5-fold ensemble, TTA, and OOF threshold optimization yields robust challenge performance.
Rating¶
- Novelty: ⭐⭐⭐ All components are established methods; however, their integration for fair diagnostic classification is valuable.
- Experimental Thoroughness: ⭐⭐⭐ 5-fold CV and ablation are reasonably complete, though the dataset is small and ablations are primarily qualitative.
- Writing Quality: ⭐⭐⭐⭐ Problem decomposition is clear, motivation is well-articulated, and the challenge paper format is well-structured.
- Value: ⭐⭐⭐ A practical framework for fairness in medical AI, though it primarily constitutes a challenge solution.
Additional Notes¶
This paper presents a competition entry for the CVPR 2026 PHAROS-AIF-MIH Workshop Challenge, achieving a robust competition score of 0.685. The overall methodology is engineering-oriented; nonetheless, the multi-level fairness constraint design — oversampling at the data level, GRL at the feature level, and OOF thresholding at the decision level — provides a reusable toolkit for fairness in medical AI.