Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=K5tcKEQaUr
Code: TBD
Area: Medical Imaging / Self-Supervised Representation Learning
Keywords: Retinal fundus images, Masked Autoencoder, frequency bias, mutual information regularization, information bottleneck

TL;DR¶

The authors analyze Masked Autoencoders (MAE) from a spatial frequency perspective, discovering a preference for low-frequency backgrounds and an under-encoding of diagnostically critical high-frequency details. They propose RetMAE: a framework that, without modifying the architecture, introduces a High-Frequency Mutual Information (HighFreqMI) regularization. This allows the retinal encoder to learn "frequency-balanced" representations, surpassing existing fundus foundation models using only ~25.6k unlabeled fundus images.

Background & Motivation¶

Background: Foundation models for fundus photography primarily follow two paths: self-supervised learning (represented by MAE / RETFound) and vision-language pre-training (represented by RET-CLIP). The latter requires expensive and scarce image-text pairs, making MAE-based methods that consume large amounts of unlabeled data more practical for open-domain applications.

Limitations of Prior Work: The reconstruction objective of MAE (random masking + pixel-level MSE) implicitly assumes that "information density is uniform across image regions." However, fundus images are the opposite—most areas consist of smooth low-frequency backgrounds, while diagnostic structures (microaneurysms, exudates, hemorrhages, optic discs, vascular edges) are sparsely concentrated in high-frequency bands. This "uniform information assumption" is severely misaligned with the "strong spatial heterogeneity" of diagnostic signals in fundus images.

Key Challenge: The authors used Centered Kernel Alignment (CKA) to quantify the alignment between MAE features and frequency-separated inputs, revealing a counter-intuitive phenomenon (Table 1): MAE representations are highly aligned with low-frequency components (CKA=0.990) but nearly unaligned with high-frequency components (CKA=0.164). Conversely, linear probing AUROC shows the opposite—retaining only 25% of high-frequency tokens yields the highest AUROC (0.727), far exceeding random masking (0.647) or low-frequency retention (0.641) under the same token budget. In other words, MAE prioritizes encoding the least useful (low-frequency) information, suppressing high-frequency tokens that carry primary diagnostic signals.

Goal: To correct the low-frequency bias of MAE without altering the backbone or introducing image-text pairs, learning "compact yet diagnostically sufficient" frequency-balanced representations.

Core Idea: Reformulate MAE as a Mutual Information (MI) Lagrangian and use a high-frequency MI regularization to pull the bottleneck's attention from low-frequency redundancy toward high-frequency diagnostic cues—a correction strictly at the objective function level.

Method¶

Overall Architecture¶

RetMAE attaches a parallel high-frequency mutual information regularization branch alongside the standard MAE reconstruction branch. Given an input, one path follows standard random masking → encoder \(f_\theta\) → decoder for pixel reconstruction \(\mathcal{L}_{rec}\). The other path applies "high-frequency masking" (high-frequency tokens only) to the same image, feeding it into an EMA teacher encoder to obtain a compact high-frequency context latent variable \(Z^{HF}_c\). The trainable latent variable \(Z\) is then aligned with this context using MI estimated via MINE (\(\mathcal{L}_{hmi}\)). Both branches share the same encoder and are optimized jointly.

flowchart LR
    X[Fundus Image] --> RM[Random Mask] --> E[Encoder f_θ]
    X --> HFM[HF Mask<br/>Select HF tokens] --> T[EMA Teacher]
    E --> Z[Trainable Latent Z]
    T --> Zc[HF Context Z_HF^c]
    Z --> D[Decoder] --> Lrec[L_rec<br/>Min. Recon. Error]
    Z --> MINE[MINE MI Estimation]
    Zc --> MINE --> Lhmi[L_hmi<br/>Max. HF MI]

Key Designs¶

1. Reformulating MAE as an MI Lagrangian. Following the information bottleneck perspective, the MAE objective is written as \(\mathcal{L}=I(X_V;Z)+\beta\,I(X_V;X_M\mid Z)\). The first term \(I(X_V;Z)\) measures the complexity of \(Z\) (compression/de-redundancy), and the second term \(I(X_V;X_M\mid Z)\) is the "information distortion" (retaining info to predict masked parts \(X_M\)). Theorem 1 proves that under the assumption of an isotropic Gaussian decoder with fixed variance, minimizing reconstruction MSE is equivalent to minimizing the conditional MI \(I(X_V;X_M\mid Z)\). Thus, \(\mathcal{L}_{rec}\) already handles the second term. The focus shifts to the first term \(I(X_V;Z)\), where MAE currently wastes capacity on low-frequency backgrounds.

2. Tightening the marginal term \(I(X_V;Z)\) via high-frequency context alignment. Since directly constraining \(I(X_V;Z)\) is intractable, the authors use alignment. Theorem 2 provides a bound: if the context representation \(Z_c=g(X)\) is \(\varepsilon\)-compact (\(I(X;Z_c)\le\varepsilon\)) and \(Z\) is aligned to \(Z_c\) with error \(\delta\), then \(I(X_V;Z)\le I(X_V;Z_c)+\delta\le\varepsilon+\delta\). By choosing a high-frequency focused context, the model removes redundancy while directing retained information toward diagnostic cues.

3. HighFreqMI: Aligning trainable latents to EMA HF contexts via MINE. The high-frequency context \(Z^{HF}_c\) is generated by feeding high-frequency tokens into the EMA teacher encoder. As MI is not directly computable, the Donsker–Varadhan lower bound MINE estimator is used: \(\mathcal{L}_{MINE}(Z_c,Z)=-\mathbb{E}_{p(Z_c,Z)}[f_\psi(Z_c,Z)]+\log\mathbb{E}_{p(Z_c)\otimes p(Z)}[\exp f_\psi(Z_c,Z')]\). The regularization term is \(\mathcal{L}_{hmi}=\mathcal{L}_{MINE}(Z,Z^{HF}_c)\). The total loss is \(\mathcal{L}_{total}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{hmi}\mathcal{L}_{hmi}\). A warmup period is used for \(\mathcal{L}_{hmi}\) until the EMA teacher stabilizes.

4. HF Token Extraction and Optional Auxiliary MI. HF tokens are selected by processing the green channel (optimized for vessel/lesion contrast) with Soft-FOV masking, Gaussian blurring to suppress low frequencies, and Butterworth high-pass filtering in the Fourier domain. The top 25% of tokens with the highest response are selected. An auxiliary version adds \(\mathcal{L}_{aux}=\mathcal{L}_{MINE}(Z,Z^{aux}_c)\) to align \(Z\) with a frozen pre-trained retinal encoder (e.g., RET-CLIP). Fixed weights are \(\lambda_{rec}{=}1, \lambda_{hmi}{=}0.1, \lambda_{aux}{=}0.01\).

Key Experimental Results¶

Benchmarks include IDRiD, RFMiD (DR and AMD subsets), and CHAKSU, covering Diabetic Retinopathy (DR), AMD, and Glaucoma (GL). APTOS is used for Out-of-Distribution (OOD) testing. Evaluation primarily uses linear probing AUROC.

Main Results (Linear Probing AUROC)¶

Method	Aux Signal	IDRiD	RFMiD(DR)	RFMiD(AMD)	CHAKSU	APTOS†	AVG
MAE	✗	0.726	0.721	0.793	0.371	0.812	0.685
RETFound	✗	0.736	0.760	0.784	0.464	0.706	0.690
RetMAE	✗	0.816	0.848	0.852	0.516	0.862	0.779
UrFound	✓	0.836	0.955	0.953	0.604	0.927	0.855
MAE	✓	0.887	0.949	0.959	0.912	0.910	0.923
RET-CLIP	✓	0.898	0.955	0.962	0.930	0.940	0.937
RetMAE	✓	0.910	0.952	0.980	0.911	0.952	0.941

In the image-only (no aux) setting, RetMAE (0.779) significantly outperforms MAE/RETFound. With auxiliary signals, it achieves 0.941, surpassing RET-CLIP (0.937) and reaching 0.952 on the OOD APTOS dataset.

Ablation Study (Signal-level vs. Latent-level HF, Avg. AUROC)¶

Method	AVG	+ \(\mathcal{L}_{hmi}\)
MAE	0.685	0.750
MAE w/ HF masking	0.679	0.737
MAE w/ HF input	0.746	0.769

Adding \(\mathcal{L}_{hmi}\) consistently improves performance across all MAE variants, indicating that latent-level HF regularization captures information unavailable through input-level modifications (HF masking/concatenation).

Key Findings¶

Diagnostic Signal Inverse Effect: Higher alignment with MAE representations (low-freq, CKA=0.990) correlates with lower AUROC (0.641); lower alignment (high-freq, CKA=0.164) yields the highest AUROC (0.727)—confirming MAE prioritizes the least diagnostically valuable bands.
Data Efficiency: With only ~25.6k unlabeled images, RetMAE achieves a macro-AUROC of 0.940, while RETFound and UrFound used 904k and 187k images, respectively.
HF Alignment over Language Supervision: The gains of the auxiliary version (ΔAUROC +0.018) suggest that HF alignment, rather than text supervision, is the primary driver of performance.

Highlights & Insights¶

Empirical "Counter-intuitive" Evidence: The use of CKA and linear probing cross-validation rigorously proves that MAE learns the least useful frequency bands best.
Theoretical and Engineering Loop: Theorem 1 and 2 map reconstruction loss and context alignment to MI bounds, providing a solid Information Bottleneck foundation for the HF regularization.
Zero Architecture Modification: Improvements stem entirely from the objective function, allowing it to be applied "plug-and-play" to existing MAE encoders.
Data Efficiency: Outperforming vision-language models with 1/35th of the data is a compelling result for medical imaging where labels are scarce.

Limitations & Future Work¶

HF extraction relies on a manual pipeline (Green channel + Soft-FOV + Butterworth) requiring hyperparameter tuning on a small subset with vessel/lesion labels; its robustness across different devices/modalities needs verification.
The highest performance (HF input + \(\mathcal{L}_{hmi}\) at 0.769) requires architecture changes; the pure latent-level regularization is lighter but slightly less powerful.
The approach assumes "high-frequency sparse diagnostic signals," which may not generalize to medical modalities with different information densities (e.g., CT, pathology slides).
Evaluation is limited to AUROC; lesion-level localization and clinical interpretability metrics are missing.

Fundus Foundation Models: RETFound (fundus MIM), UrFound (anatomy-guided masking), RET-CLIP (image-text alignment). RetMAE is orthogonal and can potentially be combined with hierarchical masking.
Frequency-domain MIM: Previous methods focused on frequency-domain inputs or reconstruction targets; RetMAE regularizes "semantic latents derived from original HF regions."
MI Representation Learning: RetMAE specializes the MI-MAE framework (minimizing complexity via context) for the high-frequency domain.
Insight: When data exhibits "information density mismatch," calibrating the information flow of self-supervised objectives toward task-critical frequency bands is a powerful alternative to increasing supervision or model size.

Rating¶

Novelty: ⭐⭐⭐⭐ Frequency perspective on MAE + MI theory + HF alignment regularization is novel and original.
Experimental Thoroughness: ⭐⭐⭐⭐ Five benchmarks across multiple settings, though lacks lesion-level localization.
Writing Quality: ⭐⭐⭐⭐ Logical flow from phenomenon → theory → method → validation is clear.
Value: ⭐⭐⭐⭐ High data efficiency and zero architecture change make it highly practical for low-resource medical imaging.