BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models¶
Conference: ICML 2026
arXiv: 2512.00283
Code: None
Area: Scientific Computing / Neural Architecture Search
Keywords: Biological Foundation Models, Neural Architecture Search, Heterogeneous Search Space, DNA/Protein Modeling, Hybrid Architecture
TL;DR¶
BioArc proposes a heterogeneous neural architecture search framework for biological foundation models. By automatically discovering optimal hybrid architectures within a search space comprising five basic modules (CNN, LSTM, Transformer, Mamba, and Hyena), it surpasses existing SOTA biological foundation models with less than 1/25 of the parameters.
Background & Motivation¶
Background: Current biological foundation models (e.g., ESM, Nucleotide Transformer, DNABERT-2) almost exclusively adopt the Transformer architecture. However, Transformers were originally designed for human language and are not naturally adapted to the unique "grammar" of biological sequences.
Limitations of Prior Work: Biological sequences present dual challenges: they require processing extremely long contexts (e.g., whole genomes), where the quadratic complexity of standard Transformers is prohibitive; and they require precise capture of local structural motifs, which global attention mechanisms do not naturally prioritize. Furthermore, the biological domain lacks a universally accepted optimal architectural paradigm like NLP, leading to a heavy reliance on manual intuition for architecture design.
Key Challenge: Biological information arises from natural evolution, and its underlying physicochemical laws are not yet fully revealed. Consequently, architecture design cannot be guided by a prior understanding of language as in NLP. Moreover, architecture selection is deeply coupled with tokenization strategies and training objectives, preventing isolated optimization.
Goal: (1) Automatically discover optimal biological sequence architectures within a heterogeneous search space; (2) systematically decouple the interaction between architecture, tokenization, and training strategies; (3) verify whether the discovered architectures can hierarchically capture biological grammar.
Key Insight: Existing NAS methods are limited to homogeneous spaces (e.g., only tuning CNN layers/channels) or hybrids of at most two types of modules. This work expands the search space to open-ended combinations of five heterogeneous modules, allowing data-driven discovery of optimal topologies.
Core Idea: Utilize heterogeneous NAS to search within a combinatorial space of CNN, LSTM, Transformer, Mamba, and Hyena modules to discover optimal hybrid architectures for DNA and proteins.
Method¶
Overall Architecture¶
The BioArc pipeline consists of four stages: (1) Search space design—defining five basic module types and their depth/width combinations; (2) Supernet construction—encoding all candidate architectures into a weight-shared supernet, where each path represents a candidate; (3) Supernet pre-training—employing a one-shot strategy where a path is randomly sampled for self-supervised pre-training during each forward pass; (4) Evaluation and ranking—independently fine-tuning sampled architectures and aggregating rankings via Z-score normalization to select the optimal architecture for full pre-training as a foundation model.
Key Designs¶
-
Heterogeneous Search Space and Representative Sampling:
- Function: Construct a search space containing CNN, LSTM, Transformer, Mamba, and Hyena modules, supporting open-ended combinations.
- Mechanism: Each path \(a = (l_1, l_2, \dots, l_d)\) is defined by depth \(d \in \mathcal{D}\), a module type tuple \(\mathbf{m}\), and a hidden dimension tuple \(\mathbf{h}\). Due to the uneven distribution of topological families in heterogeneous spaces, naive sampling overfits to redundant patterns. BioArc selects 360 representative architectures using a three-step strategy: distance filtering based on log-transformed dimensions to remove topological redundancy; imposing monotonic width constraints (fixing the widest layer at the end) to increase the sampling frequency of parameter-intensive modules; and finally using K-Means clustering to select cluster centers.
- Design Motivation: The combinatorial explosion of heterogeneous spaces makes exhaustive search impossible; representative sampling maintains structural diversity while keeping the search scale computationally feasible.
-
One-Shot Supernet Pre-training:
- Function: Simultaneously train all candidate architectures within a single weight-shared supernet.
- Mechanism: During each forward pass, a path \(a \sim \mathcal{A}\) is sampled uniformly at random, and the shared weights corresponding to that path are updated to minimize the self-supervised loss: \(\min_W \mathbb{E}_{a \sim \mathcal{A}}[\mathcal{L}(\mathcal{A}(X; w(a)))]\). A single supernet supports three pre-training objectives: Masked Modeling (MM), Contrastive Learning (CL), and Next Token Prediction (NTP). Random path sampling acts as a natural regularizer, preventing excessive co-adaptation between modules.
- Design Motivation: Compared to training each architecture independently, the one-shot strategy allows the evaluation of all candidates within a single pre-training run, significantly reducing computational costs.
-
Z-score Normalized Cross-task Architecture Ranking:
- Function: Fairly aggregate rankings across multiple heterogeneous downstream tasks to select the optimal architecture.
- Mechanism: Raw performance \(P_t(a)\) for each task \(t\) is standardized using Z-score and averaged: \(\text{Score}(a) = \frac{1}{|\mathcal{T}|}\sum_{t \in \mathcal{T}} s_t \cdot \frac{P_t(a) - \mu_t}{\sigma_t}\), where \(s_t \in \{1, -1\}\) aligns the optimization direction. During ranking, individual paths are fine-tuned independently rather than using the supernet weights to eliminate interference from shared parameters.
- Design Motivation: Different tasks use varying metrics (e.g., Accuracy vs. RMSE); raw summation would bias towards tasks with wider numerical distributions. Z-score normalization ensures each task contributes equally to the final ranking.
Key Experimental Results¶
Main Results (DNA Experiment on GUE benchmark, 12 tasks)¶
| Method | Parameters | TFP-0 | TFP-1 | TFP-2 | TFP-3 | TFP-4 | CPD-all | CPD-notata | CPD-tata |
|---|---|---|---|---|---|---|---|---|---|
| NT-2500M | 2500M | 66.31 | 68.30 | 58.70 | 49.08 | 67.59 | 67.39 | 67.46 | 69.66 |
| DNABERT-2 | 117M | 71.99 | 76.06 | 66.52 | 58.54 | 77.43 | 69.37 | 68.04 | 74.17 |
| VQDNA | 103M | 72.48 | 76.43 | 66.85 | 58.92 | 78.10 | 71.02 | 70.58 | 78.50 |
| Ours (mask-ft) | 4.89M | 84.80 | 86.00 | 85.80 | 77.10 | 89.20 | 83.60 | 85.43 | 89.40 |
| Ours (con-ft) | 3.28M | 84.80 | 86.10 | 86.50 | 77.50 | 89.30 | 83.53 | 84.66 | 90.05 |
BioArc outperforms all baselines across all DNA tasks, with parameter counts between 1/24 and 1/36 of DNABERT-2, achieving performance gains of 12-19 percentage points.
Main Results (Protein Experiment on PEER benchmark, controlled comparison)¶
| Method | Parameters | Solubility | HumanPPI | PPIAffinity↓ | Fold | Subcellular | Binary |
|---|---|---|---|---|---|---|---|
| ESM-2 8M (Official) | 8M | 73.48 | 80.16 | 3.098 | 22.14 | 71.47 | 91.25 |
| ESM-2 8M (Reprod.) | 8M | 71.84 | 74.68 | 3.567 | 18.25 | 70.36 | 90.68 |
| Ours 8M | 8M | 73.29 | 76.79 | 2.756 | 20.75 | 72.77 | 91.82 |
Under identical pre-training conditions (full UniRef50, 50K steps), BioArc 8M outperforms ESM-2 8M on all 6 protein tasks, demonstrating architectural superiority beyond mere pre-training scale.
Key Findings¶
| Finding | Details |
|---|---|
| Optimal DNA Architecture Pattern | Hyena (Long-range) → Transformer (Contextual) → CNN (Local feature extraction) |
| Shared Architectures for Similar Tasks | Top 10% architectures for TFP-3 and TFP-4 show 98.0% similarity |
| Foundation Model Performance | BioArc-F surpasses DNABERT-2 with 1/20 of the parameters and 1/10 of the training steps |
| Tokenization & Architecture Coupling | Transformers prefer 6-mer, CNNs prefer 1-mer; joint optimization is necessary |
| Pre-training Strategy No Universal Winner | Masked modeling is overall best, but training from scratch is superior for some tasks |
| Interpretability Validation | Hybrid architectures hierarchically capture promoter grammar: Hyena establishes global context → Transformer anchors TSS → CNN detects Inr+DPE coordination |
Highlights & Insights¶
- First to construct a NAS search space containing five heterogeneous module types in the biological sequence domain, breaking the previous limitation of mixing only two types.
- Discovered architectures significantly outperform baselines 100x their size with minimal parameters (3-8M), demonstrating the importance of architectural inductive bias.
- Systematically decoupled the interaction between architecture, tokenization, and training strategies, providing actionable design principles for biological models.
- Interpretability analysis shows that hybrid architectures spontaneously learn hierarchical representations corresponding to known biological mechanisms.
Limitations & Future Work¶
- Validated only on DNA and protein modalities; the generalization to RNA and single-cell data remains unknown.
- The search space is fixed to 360 representative architectures, potentially missing promising topological combinations.
- Absolute performance on protein structure prediction (e.g., Fold task) still lags behind large-scale pre-trained models like ESM-2, indicating that architectural advantages cannot fully substitute for data scale.