Robust Fair Disease Diagnosis in CT Images¶

Conference: CVPR 2026 arXiv: 2604.09710 Code: https://github.com/Purdue-M2/Fair-Disease-Diagnosis Area: Medical Imaging Keywords: CT Diagnosis, Fairness, Class Imbalance, CVaR, Logit Adjustment

TL;DR¶

This paper proposes a dual-objective training framework combining Logit-Adjusted Cross-Entropy (for class imbalance) and CVaR aggregation (for demographic fairness), achieving a gender-averaged macro F1 of 0.8403 with a fairness gap of only 0.0239 on CT disease diagnosis.

Background & Motivation¶

Background: Deep learning has achieved strong aggregate performance on CT diagnosis, yet aggregate metrics obscure uneven model behavior across patient subgroups.

Limitations of Prior Work: Class imbalance and demographic underrepresentation frequently co-occur in clinical data. For example, squamous cell carcinoma has only 84 training samples, of which only 5 are female. Standard training causes the model to learn disease features almost entirely from male samples.

Key Challenge: Logit adjustment corrects for class-frequency bias but is agnostic to group labels, while CVaR balances group-level losses but is agnostic to class structure. Neither alone can address the true risk intersection (female + squamous cell carcinoma).

Goal: Design a unified training objective that simultaneously addresses class imbalance and demographic unfairness.

Key Insight: The two mechanisms operate along orthogonal axes — logit adjustment governs sample-level gradient direction (class axis), while CVaR governs group-level gradient magnitude (demographic axis).

Core Idea: The combination of Logit Adjustment and CVaR constitutes the minimal objective that is simultaneously sensitive to both the class axis and the demographic axis.

Method¶

Overall Architecture¶

3D ResNet-18 (pretrained on Kinetics-400) → 512 → 256 → 4-class head. During training: (1) compute Logit-Adjusted cross-entropy loss per sample; (2) compute mean loss per gender group; (3) CVaR aggregation selects the currently worse-performing group for upweighting.

Key Designs¶

Logit-Adjusted Cross-Entropy:
- Function: Corrects class-frequency bias at the sample level.
- Mechanism: \(\ell^{LA}(x,y) = -\log\frac{\exp(f_y(x)+\tau\log\pi_y)}{\sum_{y'}\exp(f_{y'}(x)+\tau\log\pi_{y'})}\), equivalent to an inter-class margin loss with larger margins for rare classes. At \(\tau=1\), the estimator is Fisher-consistent with the balanced error rate.
- Design Motivation: Unlike inverse-frequency weighting, logit adjustment directly modifies inter-class decision boundary margins, yielding greater effectiveness in separable regions.
CVaR Fairness Aggregation:
- Function: Directs optimization pressure at the group level toward the currently worst-performing demographic group.
- Mechanism: \(\mathcal{L} = \min_\lambda \lambda + \frac{1}{\alpha|\mathcal{G}|}\sum_{g\in\mathcal{G}}[\mathcal{L}_g - \lambda]_+\), where \(\alpha\) controls fairness intensity. The optimal \(\lambda\) is solved via binary search (convex optimization with negligible overhead).
- Design Motivation: CVaR provides a tractable upper bound on worst-case group risk without requiring specific assumptions about the group distribution.
Orthogonality Analysis:
- Function: Theoretically justifies the complementarity of the two mechanisms.
- Mechanism: Logit adjustment is invariant to group membership; CVaR is invariant to class structure. Their combination is the minimal objective sensitive to both axes simultaneously. On female squamous cell carcinoma (5 samples): LA alone causes 94% of gradients to originate from male samples; CVaR alone balances group losses but rare classes remain neglected.
- Design Motivation: Demonstrates that this is not merely a stacking of two known techniques — their interaction produces effects unattainable by either component individually.

Loss & Training¶

Adam optimizer, lr = 1e-4, cosine annealing, 70 epochs. Batch size = 2 (constrained by 3D volume memory). \(\tau = 1.0\) fixed; \(\alpha\) searched over the grid \(\{0.4\text{–}0.9\}\).

Key Experimental Results¶

Main Results¶

Method	α	F1_male	F1_female	Score↑	Gap↓
Baseline (CE)	-	0.7957	0.6868	0.7413	0.1089
LA Only	-	0.8596	0.7375	0.7986	0.1221
CVaR Only	0.7	0.8738	0.7591	0.8165	0.1148
LA+CVaR	0.8	0.8283	0.8522	0.8403	0.0239

Ablation Study¶

Configuration	Score	Gap	Note
CE Baseline	0.7413	0.1089	Female squamous cell carcinoma recall = 0.08
LA Only	0.7986	0.1221	Score improves but gap widens
CVaR Only	0.8165	0.1148	More balanced but rare classes still neglected
LA+CVaR α=0.8	0.8403	0.0239	Only configuration where female F1 exceeds male

Key Findings¶

\(\alpha = 0.8\) is the optimal configuration — the only setting where female macro F1 (0.8522) surpasses male (0.8283).
Female squamous cell carcinoma F1 improves from 0.14 (baseline) to 0.63; recall from 0.08 to 0.46.
The \(\alpha\) sweep exhibits a three-phase non-monotonic pattern: \(0.4\)–\(0.6\) (broad tail dilutes fairness signal), \(0.7\)–\(0.8\) (precise focus on hard subgroups), \(0.9\) (overly narrow, performance rebounds).

Highlights & Insights¶

Orthogonality Analysis: Clearly demonstrates why two seemingly simple components produce synergistic effects that surpass their individual contributions.
Extreme Scenario with 5 Female Squamous Cell Carcinoma Samples: This severely imbalanced intersection perfectly illustrates why a dual-objective formulation is necessary.
Three-Phase Behavior of α: Reveals the nuanced influence of the CVaR concentration parameter, offering practical guidance for hyperparameter tuning.

Limitations & Future Work¶

Fairness is validated only for binary gender partitioning; extension to broader demographic attributes remains unexplored.
The training set contains only 734 samples, representing an extremely limited scale.
As a CVPR Workshop paper, the experimental scope is inherently constrained.

vs. DAW-FDD: DAW-FDD employs stratified CVaR but relies on explicit group annotations and is validated only on binary detection tasks.
vs. LDAM: LDAM lacks Fisher-consistency guarantees; logit adjustment is theoretically grounded at \(\tau = 1\).

Rating¶

Novelty: ⭐⭐⭐ The method combines existing components, though the theoretical analysis is valuable.
Experimental Thoroughness: ⭐⭐⭐ Dataset is small but ablations are complete.
Writing Quality: ⭐⭐⭐⭐ The orthogonality analysis is clearly articulated.
Value: ⭐⭐⭐⭐ Offers practical guidance for fairness in medical AI.