Skip to content

Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=K5tcKEQaUr
Code: TBD
Area: Medical Imaging / Self-Supervised Representation Learning
Keywords: Retinal fundus images, Masked Autoencoder, frequency bias, mutual information regularization, information bottleneck

TL;DR

The authors analyze Masked Autoencoders (MAE) from a spatial frequency perspective, discovering a preference for low-frequency backgrounds and an under-encoding of diagnostically critical high-frequency details. They propose RetMAE: a framework that, without modifying the architecture, introduces a High-Frequency Mutual Information (HighFreqMI) regularization. This allows the retinal encoder to learn "frequency-balanced" representations, surpassing existing fundus foundation models using only ~25.6k unlabeled fundus images.

Background & Motivation

Background: Foundation models for fundus photography primarily follow two paths: self-supervised learning (represented by MAE / RETFound) and vision-language pre-training (represented by RET-CLIP). The latter requires expensive and scarce image-text pairs, making MAE-based methods that consume large amounts of unlabeled data more practical for open-domain applications.

Limitations of Prior Work: The reconstruction objective of MAE (random masking + pixel-level MSE) implicitly assumes that "information density is uniform across image regions." However, fundus images are the opposite—most areas consist of smooth low-frequency backgrounds, while diagnostic structures (microaneurysms, exudates, hemorrhages, optic discs, vascular edges) are sparsely concentrated in high-frequency bands. This "uniform information assumption" is severely misaligned with the "strong spatial heterogeneity" of diagnostic signals in fundus images.

Key Challenge: The authors used Centered Kernel Alignment (CKA) to quantify the alignment between MAE features and frequency-separated inputs, revealing a counter-intuitive phenomenon (Table 1): MAE representations are highly aligned with low-frequency components (CKA=0.990) but nearly unaligned with high-frequency components (CKA=0.164). Conversely, linear probing AUROC shows the opposite—retaining only 25% of high-frequency tokens yields the highest AUROC (0.727), far exceeding random masking (0.647) or low-frequency retention (0.641) under the same token budget. In other words, MAE prioritizes encoding the least useful (low-frequency) information, suppressing high-frequency tokens that carry primary diagnostic signals.

Goal: To correct the low-frequency bias of MAE without altering the backbone or introducing image-text pairs, learning "compact yet diagnostically sufficient" frequency-balanced representations.

Core Idea: Reformulate MAE as a Mutual Information (MI) Lagrangian and use a high-frequency MI regularization to pull the bottleneck's attention from low-frequency redundancy toward high-frequency diagnostic cues—a correction strictly at the objective function level.

Method

Overall Architecture

RetMAE attaches a parallel high-frequency mutual information regularization branch alongside the standard MAE reconstruction branch. Given an input, one path follows standard random masking → encoder \(f_\theta\) → decoder for pixel reconstruction \(\mathcal{L}_{rec}\). The other path applies "high-frequency masking" (high-frequency tokens only) to the same image, feeding it into an EMA teacher encoder to obtain a compact high-frequency context latent variable \(Z^{HF}_c\). The trainable latent variable \(Z\) is then aligned with this context using MI estimated via MINE (\(\mathcal{L}_{hmi}\)). Both branches share the same encoder and are optimized jointly.

flowchart LR
    X[Fundus Image] --> RM[Random Mask] --> E[Encoder f_θ]
    X --> HFM[HF Mask<br/>Select HF tokens] --> T[EMA Teacher]
    E --> Z[Trainable Latent Z]
    T --> Zc[HF Context Z_HF^c]
    Z --> D[Decoder] --> Lrec[L_rec<br/>Min. Recon. Error]
    Z --> MINE[MINE MI Estimation]
    Zc --> MINE --> Lhmi[L_hmi<br/>Max. HF MI]

Key Designs

1. Reformulating MAE as an MI Lagrangian. Following the information bottleneck perspective, the MAE objective is written as \(\mathcal{L}=I(X_V;Z)+\beta\,I(X_V;X_M\mid Z)\). The first term \(I(X_V;Z)\) measures the complexity of \(Z\) (compression/de-redundancy), and the second term \(I(X_V;X_M\mid Z)\) is the "information distortion" (retaining info to predict masked parts \(X_M\)). Theorem 1 proves that under the assumption of an isotropic Gaussian decoder with fixed variance, minimizing reconstruction MSE is equivalent to minimizing the conditional MI \(I(X_V;X_M\mid Z)\). Thus, \(\mathcal{L}_{rec}\) already handles the second term. The focus shifts to the first term \(I(X_V;Z)\), where MAE currently wastes capacity on low-frequency backgrounds.

2. Tightening the marginal term \(I(X_V;Z)\) via high-frequency context alignment. Since directly constraining \(I(X_V;Z)\) is intractable, the authors use alignment. Theorem 2 provides a bound: if the context representation \(Z_c=g(X)\) is \(\varepsilon\)-compact (\(I(X;Z_c)\le\varepsilon\)) and \(Z\) is aligned to \(Z_c\) with error \(\delta\), then \(I(X_V;Z)\le I(X_V;Z_c)+\delta\le\varepsilon+\delta\). By choosing a high-frequency focused context, the model removes redundancy while directing retained information toward diagnostic cues.

3. HighFreqMI: Aligning trainable latents to EMA HF contexts via MINE. The high-frequency context \(Z^{HF}_c\) is generated by feeding high-frequency tokens into the EMA teacher encoder. As MI is not directly computable, the Donsker–Varadhan lower bound MINE estimator is used: \(\mathcal{L}_{MINE}(Z_c,Z)=-\mathbb{E}_{p(Z_c,Z)}[f_\psi(Z_c,Z)]+\log\mathbb{E}_{p(Z_c)\otimes p(Z)}[\exp f_\psi(Z_c,Z')]\). The regularization term is \(\mathcal{L}_{hmi}=\mathcal{L}_{MINE}(Z,Z^{HF}_c)\). The total loss is \(\mathcal{L}_{total}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{hmi}\mathcal{L}_{hmi}\). A warmup period is used for \(\mathcal{L}_{hmi}\) until the EMA teacher stabilizes.

4. HF Token Extraction and Optional Auxiliary MI. HF tokens are selected by processing the green channel (optimized for vessel/lesion contrast) with Soft-FOV masking, Gaussian blurring to suppress low frequencies, and Butterworth high-pass filtering in the Fourier domain. The top 25% of tokens with the highest response are selected. An auxiliary version adds \(\mathcal{L}_{aux}=\mathcal{L}_{MINE}(Z,Z^{aux}_c)\) to align \(Z\) with a frozen pre-trained retinal encoder (e.g., RET-CLIP). Fixed weights are \(\lambda_{rec}{=}1, \lambda_{hmi}{=}0.1, \lambda_{aux}{=}0.01\).

Key Experimental Results

Benchmarks include IDRiD, RFMiD (DR and AMD subsets), and CHAKSU, covering Diabetic Retinopathy (DR), AMD, and Glaucoma (GL). APTOS is used for Out-of-Distribution (OOD) testing. Evaluation primarily uses linear probing AUROC.

Main Results (Linear Probing AUROC)

Method Aux Signal IDRiD RFMiD(DR) RFMiD(AMD) CHAKSU APTOS† AVG
MAE 0.726 0.721 0.793 0.371 0.812 0.685
RETFound 0.736 0.760 0.784 0.464 0.706 0.690
RetMAE 0.816 0.848 0.852 0.516 0.862 0.779
UrFound 0.836 0.955 0.953 0.604 0.927 0.855
MAE 0.887 0.949 0.959 0.912 0.910 0.923
RET-CLIP 0.898 0.955 0.962 0.930 0.940 0.937
RetMAE 0.910 0.952 0.980 0.911 0.952 0.941

In the image-only (no aux) setting, RetMAE (0.779) significantly outperforms MAE/RETFound. With auxiliary signals, it achieves 0.941, surpassing RET-CLIP (0.937) and reaching 0.952 on the OOD APTOS dataset.

Ablation Study (Signal-level vs. Latent-level HF, Avg. AUROC)

Method AVG + \(\mathcal{L}_{hmi}\)
MAE 0.685 0.750
MAE w/ HF masking 0.679 0.737
MAE w/ HF input 0.746 0.769

Adding \(\mathcal{L}_{hmi}\) consistently improves performance across all MAE variants, indicating that latent-level HF regularization captures information unavailable through input-level modifications (HF masking/concatenation).

Key Findings

  • Diagnostic Signal Inverse Effect: Higher alignment with MAE representations (low-freq, CKA=0.990) correlates with lower AUROC (0.641); lower alignment (high-freq, CKA=0.164) yields the highest AUROC (0.727)—confirming MAE prioritizes the least diagnostically valuable bands.
  • Data Efficiency: With only ~25.6k unlabeled images, RetMAE achieves a macro-AUROC of 0.940, while RETFound and UrFound used 904k and 187k images, respectively.
  • HF Alignment over Language Supervision: The gains of the auxiliary version (ΔAUROC +0.018) suggest that HF alignment, rather than text supervision, is the primary driver of performance.

Highlights & Insights

  • Empirical "Counter-intuitive" Evidence: The use of CKA and linear probing cross-validation rigorously proves that MAE learns the least useful frequency bands best.
  • Theoretical and Engineering Loop: Theorem 1 and 2 map reconstruction loss and context alignment to MI bounds, providing a solid Information Bottleneck foundation for the HF regularization.
  • Zero Architecture Modification: Improvements stem entirely from the objective function, allowing it to be applied "plug-and-play" to existing MAE encoders.
  • Data Efficiency: Outperforming vision-language models with 1/35th of the data is a compelling result for medical imaging where labels are scarce.

Limitations & Future Work

  • HF extraction relies on a manual pipeline (Green channel + Soft-FOV + Butterworth) requiring hyperparameter tuning on a small subset with vessel/lesion labels; its robustness across different devices/modalities needs verification.
  • The highest performance (HF input + \(\mathcal{L}_{hmi}\) at 0.769) requires architecture changes; the pure latent-level regularization is lighter but slightly less powerful.
  • The approach assumes "high-frequency sparse diagnostic signals," which may not generalize to medical modalities with different information densities (e.g., CT, pathology slides).
  • Evaluation is limited to AUROC; lesion-level localization and clinical interpretability metrics are missing.
  • Fundus Foundation Models: RETFound (fundus MIM), UrFound (anatomy-guided masking), RET-CLIP (image-text alignment). RetMAE is orthogonal and can potentially be combined with hierarchical masking.
  • Frequency-domain MIM: Previous methods focused on frequency-domain inputs or reconstruction targets; RetMAE regularizes "semantic latents derived from original HF regions."
  • MI Representation Learning: RetMAE specializes the MI-MAE framework (minimizing complexity via context) for the high-frequency domain.
  • Insight: When data exhibits "information density mismatch," calibrating the information flow of self-supervised objectives toward task-critical frequency bands is a powerful alternative to increasing supervision or model size.

Rating

  • Novelty: ⭐⭐⭐⭐ Frequency perspective on MAE + MI theory + HF alignment regularization is novel and original.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Five benchmarks across multiple settings, though lacks lesion-level localization.
  • Writing Quality: ⭐⭐⭐⭐ Logical flow from phenomenon → theory → method → validation is clear.
  • Value: ⭐⭐⭐⭐ High data efficiency and zero architecture change make it highly practical for low-resource medical imaging.