AAAI 2026 Medical Imaging Data Pruning Biological Foundation Models Influence Functions Coreset Selection RNA-FM ESM Protein Language Models

Investigating Data Pruning for Pretraining Biological Foundation Models at Scale¶

Conference: AAAI 2026 arXiv: 2512.12932 Code: github.com/victor-yifanwu/bio-coreset Area: Medical Imaging / Bioinformatics / Foundation Models Keywords: Data Pruning, Biological Foundation Models, Influence Functions, Coreset Selection, RNA-FM, ESM, Protein Language Models

TL;DR¶

This paper proposes a post-hoc data pruning framework based on influence functions, leveraging Subset-Based Self-Influence estimation and two selection strategies (Top-k Influence and Coverage-Centric Influence). Under an extreme pruning rate exceeding 99%, an RNA-FM pretrained on only 0.2M sequences matches or surpasses the full model trained on 23M sequences across multiple downstream tasks, revealing substantial redundancy in biological sequence datasets.

Background & Motivation¶

Background: Biological foundation models (BioFMs) such as RNA-FM (23M RNA sequences) and ESM (2.78B protein sequences) have demonstrated strong performance on structure prediction and functional annotation tasks, yet their training costs are prohibitively high (RNA-FM: 8×A100 GPUs, 30 days), severely limiting reproducibility and accessibility for academic laboratories.

Gap in Data Pruning for Biological Domains: - Extensive data pruning work exists in CV/NLP, but pruning for BioFM pretraining remains largely unexplored. - Training-dynamics-based methods (EL2N, AUM) require the full training process, making them infeasible for BioFMs. - Local-density-based methods require pairwise similarity computation, which does not scale to millions of sequences. - Influence-function-based methods require inverting the full-training-set Hessian, which is infeasible for models with hundreds of millions of parameters.

Core Problem: Can a post-hoc method identify the most informative training subset without accessing the full training process?

Key Insight: Leverage influence function theory to approximate full-dataset curvature information on a small subset, enabling efficient estimation of per-sample importance.

Method¶

Overall Architecture¶

Three-stage pipeline: 1. Influence Score Estimation: Subset-based self-influence functions 2. Coreset Selection: Top I or CCI strategy 3. Pretraining from Scratch: Retrain BioFMs on the selected coreset

Subset-Based Self-Influence¶

Classical Influence Functions: The influence of a training sample $z_{tr}$ on a validation sample $z_{val}$ is defined as: $$\mathcal{I}(z_{tr}; z_{val}) = g_{z_{val}}^\top H_{\theta^*}^{-1} g_{z_{tr}}$$

where $g$ denotes gradients and $H$ denotes the Hessian. Computing $H^{-1}$ is infeasible for large models.

Key Innovation — Subset Approximation:

Assumption 1: A model $\tilde{\theta}$ trained on a randomly sampled subset $D_{sub}$ is approximately optimal on that subset.

Under this assumption, the full-training-set Hessian $H_{tr}$ is replaced by the subset Hessian $\tilde{H}_{sub}$:

\[\mathcal{I}(z_{tr}, D_{sub}) = \tilde{g}_{z_{tr}}^\top \tilde{H}_{sub}^{-1} \tilde{g}_{z_{tr}}\]

Theoretical Support (Proposition 1): Under the flat loss landscape condition of large models (confirmed by recent studies), subset curvature sufficiently approximates full-dataset curvature.

Further Acceleration — Diagonal Empirical Fisher Approximation:

\[\tilde{H}_{sub}^{-1} \approx \text{diag}(\tilde{F}_{sub})^{-1}\]

where $\text{diag}(\tilde{F}_{sub}) = \frac{1}{M}\sum_{m=1}^M \tilde{g}_{z_m} \odot \tilde{g}_{z_m}$

Computational complexity is reduced from $O(M \cdot d^2 + d^3)$ to $O(M \cdot d)$, making the approach feasible for models with billions of parameters.

Practical Note: Lightweight fine-tuning for one epoch on the subset is sufficient to satisfy Assumption 1 at negligible cost.

Two Coreset Selection Strategies¶

Top-k Influence (Top I): Selects the $k$ samples with the highest influence scores.
- Prioritizes samples with the greatest impact on model parameters.
- Theoretically corresponds to the most informative data points.
Coverage-Centric Influence (CCI): Applies stratified sampling over the influence score distribution.
- Maintains a balanced distribution of "easy" and "hard" samples.
- Inspired by Sorscher et al. 2022: retaining only the hardest samples under extreme pruning leads to overfitting.
- Stratified sampling ensures distributional coverage.

Experimental Setup¶

Extreme pruning rate: Only 0.2M sequences retained (RNA: ~1% of 23M; protein: ~4.4% of 4.5M).
Models trained from scratch for 10 epochs on the coreset.
Evaluation on diverse downstream tasks.

Key Experimental Results¶

RNA-FM Experiments¶

Functional and Engineering Prediction Tasks¶

Method	Data Size	TypeCls ACC(%)	TypeCls F1(%)	Modif AUC(%)	CRI-On SC(%)	CRI-On MSE↓
RNA-FM	23M	91.93	91.87	94.98	31.87	.0118
Random	2M	82.21	82.01	92.82	26.72	.0158
Random	0.2M	82.15	81.97	91.86	26.67	.0161
Top I	0.2M	82.51	82.53	93.20	27.08	.0149
CCI	0.2M	82.88	83.12	93.86	32.90	.0135

CCI surpasses the full RNA-FM (23M sequences) on CRISPR On-Target prediction.
The 0.2M coreset outperforms 2M random selection.

Structure and Interaction Prediction Tasks¶

Method	Sec. Struct. F1(%)	Distance Map SC(%)	Contact Map Top-1.0L(%)	RBP Interaction ACC(%)
RNA-FM	62.20	89.21	93.93	72.47
Random 0.2M	55.60	84.90	94.18	69.65
Top I	57.05	86.47	94.36	71.25
CCI	56.36	85.59	94.20	69.46

Top I outperforms CCI on structure-related tasks, as high-influence samples encode richer structural information.
Top I surpasses the full RNA-FM on contact map prediction.

ESM-C Protein Experiments (Generalizability Validation)¶

Method	Data Size	Localization ACC(%)	Sec. Struct. ACC(%)	PPI MAE↓	PPI RMSE↓
ESM-C	2.78B	91.63	86.10	1.92	2.44
Random	2M	75.76	67.20	2.39	2.87
Random	0.2M	73.64	66.18	2.51	3.01
Top I	0.2M	77.13	69.34	2.06	2.64
CCI	0.2M	79.25	71.48	2.14	2.69

Both Top I and CCI at 0.2M outperform 2M random selection, further confirming substantial redundancy in protein data.
CCI performs better in the protein setting.

Ablation Study: Necessity of Fine-tuning¶

Variant	Modif AUC(%)	Distance Map SC(%)
Top I (w/o ft)	92.94	84.13
CCI (w/o ft)	93.31	84.95
Top I	93.20	86.47
CCI	93.86	85.59

Lightweight fine-tuning on the subset prior to influence score computation consistently improves results, validating the importance of Assumption 1.

Highlights & Insights¶

Reveals substantial redundancy in biological training data: Less than 1% of the data suffices to approach or exceed the full model's performance.
Post-hoc framework requires no training process: Only pretrained model weights and lightweight subset fine-tuning are needed, applicable even to publicly released models without disclosed training details.
Complementarity of Top I and CCI:
- CCI excels at functional/engineering prediction (benefits from distributional coverage).
- Top I excels at structural/interaction prediction (benefits from information density).
Complete theoretical derivation: Each step — from classical influence functions to subset approximation to diagonal Fisher factorization — is theoretically grounded.
High practical value: Enables academic laboratories with limited computational resources to pretrain BioFMs at drastically reduced cost.

Limitations & Future Work¶

RNA experiments are validated only on RNA-FM; larger RNA models (e.g., Evo 2) are not evaluated.
Protein experiments are limited to 4.5M sequences due to resource constraints (far smaller than ESM-C's 2.78B training set); coreset effectiveness at full scale remains unverified.
The error bound for Assumption 1 (approximate optimality on the subset) is not quantitatively analyzed.
The diagonal Fisher approximation may be inaccurate when the Hessian has strong off-diagonal structure.
Validation is restricted to self-supervised pretraining (MLM); data pruning effects in supervised fine-tuning settings are not discussed.
No direct comparison with density- or diversity-based methods (e.g., Facility Location) is provided.

Data Pruning: EL2N (training dynamics), Sorscher et al. 2022 (data density), D2 Pruning (difficulty-diversity balance)
Influence Functions: Koh & Liang 2017 (classical IF), DataInf, TRAK (efficient approximations)
Biological Foundation Models: RNA-FM (Chen 2022, 23M sequences), ESM-C/ESM3 (Hayes 2025, 2.78B sequences), Evo 2

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ — Subset influence function approximation offers a theoretical contribution; first systematic introduction of data pruning to BioFM pretraining.
Experimental Thoroughness: ⭐⭐⭐⭐ — Dual-modality validation on RNA and protein with comprehensive evaluation across diverse downstream tasks.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and experimental presentation is rigorous.
Value: ⭐⭐⭐⭐⭐ — Directly reduces BioFM pretraining cost, offering substantial practical value to computationally constrained research groups.