GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification¶

Conference: CVPR 2026 arXiv: 2603.12800 Code: Kaggle Dataset Area: Medical Imaging / Multimodal Learning / Ophthalmic Imaging Keywords: Glaucoma classification, multimodal fusion, masked autoencoder, trimodal dataset, graph attention

TL;DR¶

This paper presents GLEAM, the first publicly available trimodal glaucoma dataset (SLO fundus photography + peripapillary OCT + visual field deviation maps, 1,200 cases, four-stage annotation), along with HAMM, a CNN-based hierarchical attention masked modeling framework. HAMM achieves cross-modal fusion via clinically inspired multi-head modality gating and relational graph attention, attaining a four-class classification accuracy of 81.08%.

Background & Motivation¶

Background: Glaucoma is one of the leading causes of irreversible blindness worldwide, affecting approximately 70 million individuals. Clinical diagnosis relies on the integrated interpretation of multiple examinations: fundus imaging for optic disc morphology, OCT for retinal nerve fiber layer (RNFL) thickness measurement, and visual field testing for functional impairment assessment. Computer-aided diagnosis (CAD) systems have made steady progress over the past decade.

Limitations of Prior Work: Existing public datasets suffer from three key deficiencies: (1) most are unimodal (fundus or OCT only), lacking modality diversity; (2) classification granularity is coarse, typically limited to binary normal/glaucoma labels, insufficient to support staging-based treatment; (3) sample sizes are limited or datasets are not publicly released. Existing multimodal datasets such as GAMMA contain only 200 bimodal cases.

Key Challenge: Clinicians routinely integrate findings from three distinct examinations for cross-validation and holistic judgment, yet there is a lack of corresponding datasets and fusion frameworks to support automated diagnostic research.

Goal: (1) Construct the first publicly available trimodal, four-stage annotated, high-quality glaucoma dataset; (2) design an effective self-supervised multimodal fusion framework to fully exploit complementary inter-modal information.

Key Insight: Emulating the clinical reasoning of ophthalmologists—first assessing the quality and reliability of each modality, then cross-validating structural-functional consistency.

Core Idea: Multi-head gating mechanisms simulate clinician assessment of modality reliability; relational graph attention simulates cross-modal cross-validation; both are embedded within a CNN masked autoencoder for self-supervised pretraining.

Method¶

Overall Architecture¶

HAMM adopts a two-stage training strategy. Stage 1 (Pretraining): Inputs from three modalities are randomly masked (masking ratio 0.7) and processed by three parallel ResNet-50 encoders—each with MCGA modules embedded at every layer for hierarchical cross-modal fusion—to extract features. Lightweight depthwise separable convolutional decoders reconstruct the masked regions, with MSE reconstruction loss as the training objective. Stage 2 (Fine-tuning): The decoders are discarded; the pretrained encoders are retained, and trimodal features are concatenated and passed through a GAP + two-layer fully connected classification head for four-class prediction, trained with cross-entropy loss.

Key Designs¶

Multimodal Channel Graph Attention (MCGA) Module:
- Function: Enables hierarchical cross-modal information interaction at each downsampling layer of the encoder.
- Mechanism: Operates in three steps: (a) GAP, GMP, and GeM pooling are applied to each modality's feature maps and concatenated, then projected through a fully connected layer to produce modality embeddings \(v_k\); (b) a multi-head gating mechanism \(\hat{v}_k = v_k \odot \frac{1}{H}\sum_{h=1}^{H} g^{(h)}(v_k)\) assigns adaptive reliability weights to each modality, simulating multiple ophthalmologists independently assessing modality quality; (c) a relational graph attention network captures inter-modal dependencies and models structural-functional consistency via relation-type embeddings \(R_{r_{ij}}^{(h)}\).
- Design Motivation: Mimics the clinical reasoning of ophthalmologists—first evaluating the reliability of each examination result, then cross-validating consistency across modalities. Hierarchical fusion (applied at every layer) outperforms late fusion, with experimental results showing accuracy improving from 78.50% to 79.17%.
CNN Masked Autoencoder Pretraining:
- Function: Learns robust cross-modal representations by reconstructing masked regions.
- Mechanism: For each modality, 70% of pixel regions are randomly masked; the encoder infers masked content from visible regions and information from other modalities. The decoder adopts a lightweight design (depthwise separable convolution + bilinear interpolation upsampling) with skip connections fusing features from each encoder layer. Training loss is MSE computed exclusively over masked pixels: \(\mathcal{L}_{MSE} = \frac{1}{N}\sum_{i=1}^{N}\sum_{k \in K}\sum_{p=1}^{P}(s_i^k(p) - \hat{s}_i^k(p))^2\)
- Design Motivation: Ophthalmic images frequently suffer from information loss due to artifacts, blur, and anatomical occlusion; masked modeling naturally simulates these scenarios. CNN architectures (vs. Transformer-based MAE) are better suited to small-sample medical data due to visual inductive biases and reduced susceptibility to overfitting.
GLEAM Trimodal Dataset:
- Function: Establishes the first publicly available trimodal, four-stage annotated glaucoma dataset.
- Mechanism: Retrospectively collected 1,200 paired cases (841 patients, aged 8–90 years, mean 55.4±16.7) from Shenyang Fourth People's Hospital, comprising SLO fundus images (Optos ultra-widefield), peripapillary OCT (Heidelberg Spectralis), and visual field PD maps (Zeiss perimeter). Four stages are annotated: normal (NG, 600 cases), early (EaG, 200 cases), intermediate (InG, 200 cases), and advanced (AdG, 200 cases), stratified based on EMR diagnoses and MD values (early: MD > −6 dB; intermediate: −12 dB ≤ MD ≤ −6 dB; advanced: MD < −12 dB).
- Design Motivation: Fills a critical gap in the field—existing datasets are either uni/bimodal or restricted to binary classification, precluding multimodal staging research. Three senior ophthalmologists independently annotated all cases with consensus review; inter-annotator Cohen's Kappa > 95.5% and intra-annotator Kappa > 97.4%.

Loss & Training¶

Pretraining: MSE reconstruction loss (computed only on masked pixels), 20 epochs, learning rate \(1 \times 10^{-5}\), batch size 8.
Fine-tuning: Cross-entropy classification loss, learning rate \(3 \times 10^{-6}\), batch size 16, early stopping (10 epochs without improvement in validation loss).
Data Augmentation: SLO (random cropping / color jitter / vertical flip), OCT (color jitter), VF (vertical flip); synchronized horizontal flipping across all three modalities to preserve anatomical consistency.
Results averaged over five independent training runs to ensure statistical reliability.

Key Experimental Results¶

Main Results¶

Method	Pretraining	Acc (%)	F1 (%)	AUROC (%)	Kappa
ResNet50	—	76.75±1.47	66.84±2.60	89.95±0.27	85.88
ResNet50	TL	77.67±0.86	70.19±0.93	92.14±1.81	87.00
ViT-S	TL	77.75±1.52	69.62±3.79	91.79±0.48	88.03
ConvNeXt-T	TL	79.00±0.76	71.58±1.32	91.87±0.77	87.83
MHCA	TL	78.16±0.63	69.97±3.20	92.28±0.27	87.14
DRIFA-Net	TL	77.83±0.86	69.70±1.96	92.42±0.10	86.75
Corolla	SCL	78.67±0.74	72.87±1.21	92.39±0.55	88.50
ETSCL	SCL	79.08±0.80	72.52±2.11	92.73±0.32	87.31
MultiMAE	SSL	78.00±0.18	69.02±2.18	90.64±0.26	86.98
UrFound	SSL	78.67±0.35	70.67±1.46	92.49±0.44	87.86
HAMM (Ours)	SSL	81.08±0.63	75.90±0.80	93.03±0.26	90.07

HAMM outperforms the strongest baseline (ETSCL) by: Acc +2.00%, F1 +3.38%, AUROC +0.30%, Kappa +2.76.

Ablation Study¶

MCGA	Pretraining	Acc (%)	F1 (%)	AUROC (%)	Kappa
✗	✗	77.67	70.19	92.14	87.00
✓	✗	79.17	71.93	92.89	89.52
✗	✓	79.67	73.68	92.83	89.57
✓	✓	81.08	75.90	93.03	90.07

Modality combination ablation:

Modality	Acc (%)	F1 (%)	AUROC (%)	Acc-EaG (%)
SLO	60.25	37.25	74.72	3.00
OCT	61.75	42.39	76.70	8.00
VF	74.25	59.85	90.42	6.00
SLO+OCT	64.42	46.47	67.22	11.00
SLO+VF	77.67	68.36	91.87	26.00
OCT+VF	77.08	67.38	92.24	22.50
SLO+OCT+VF	81.08	75.90	93.03	51.50

External validation (GAMMA dataset):

Method	Ensemble	Kappa
SmartDSP	✓	85.49
COROLLA	✓	85.50
GeCoM-Net	✓	88.10
ETSCL (+ extra modality)	✗	88.44
HAMM (Ours)	✗	87.59
HAMM (Ours)	✓	89.35

Key Findings¶

Early glaucoma (EaG) classification is the most challenging: single modalities are nearly incapable of detection (VF achieves only 6.0%), while trimodal fusion raises this to 51.50%, demonstrating that multimodal complementarity is essential for early diagnosis.
VF is the most discriminative single modality (Acc 74.25%), yet fails on early-stage cases; SLO and OCT individually perform poorly but contribute substantially to early- and intermediate-stage classification when combined.
A masking ratio of 0.7 is the optimal configuration (with 20 pretraining epochs); at this ratio, the model is compelled to rely more heavily on cross-modal information for inference.
HAMM also outperforms comparison methods under missing-modality conditions (Acc 74.59% vs. UrFound 72.48%), demonstrating robustness.

Highlights & Insights¶

The dataset contribution itself is of significant importance: GLEAM is the first publicly available trimodal, four-stage annotated glaucoma dataset with exceptionally high annotation quality (Kappa > 95.5%).
The clinically inspired design of the MCGA module is particularly elegant—multi-head gating simulates independent assessment of modality reliability by multiple clinicians, while graph attention simulates cross-modal cross-validation.
CNN-based MAE is applied to multimodal medical tasks for the first time, demonstrating greater suitability for small-sample scenarios compared to Transformer-based MAE.
Trimodal fusion completely eliminates cross-class misclassification between NG and AdG (as verified by the confusion matrix), which has important implications for clinical safety.

Limitations & Future Work¶

Data are sourced from a single center (Shenyang Fourth People's Hospital); generalizability requires multi-center validation.
Glaucoma subtypes (primary open-angle, normal-tension, angle-closure, etc.) are not distinguished, despite differing pathological features and spatial damage patterns across subtypes.
The current framework addresses four-class classification; continuous severity estimation (e.g., predicting MD values) may be more clinically granular and practical, potentially benefiting from ordinal regression losses.
Early-stage accuracy of 51.50%, while superior to baselines, leaves room for improvement; larger-scale data or dedicated class imbalance handling strategies may be warranted.
Longitudinal follow-up data analysis (disease progression prediction) is not addressed.

vs. RETFound / EyeCLIP: These are unimodal (fundus-only) self-supervised pretraining approaches that do not cover OCT or visual field data; HAMM explicitly models trimodal interactions.
vs. MultiMAE: A Transformer-based multimodal masked modeling approach prone to overfitting on small-scale medical data (Acc 78.00% on GLEAM); HAMM employs a CNN architecture with MCGA, better suited to limited data regimes.
vs. MHCA / DRIFA-Net: MHCA and DRIFA-Net have parameter counts of 248M and 931M, respectively; HAMM achieves 237M parameters with only 12.68G FLOPs (vs. DRIFA-Net's 88.48G), offering higher efficiency alongside superior performance.
vs. GAMMA dataset: GAMMA contains only 200 bimodal cases with two/three-class labels; GLEAM offers 1,200 trimodal cases with four-class labels, representing a significant improvement in both scale and annotation granularity.

Rating¶

Novelty: ⭐⭐⭐⭐ First trimodal glaucoma dataset + clinically inspired MCGA module design.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage including main experiments, modality ablation, component ablation, masking ratio analysis, external validation, missing-modality robustness, and reliability analysis.
Writing Quality: ⭐⭐⭐⭐ Clear methodological exposition, systematic experimental design, and well-articulated clinical motivation.
Value: ⭐⭐⭐⭐ The dataset fills a critical gap in the field and directly advances ophthalmic AI research.