JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model¶
Conference: NeurIPS 2025 arXiv: 2505.17257 Code: GitHub Area: Medical Imaging Keywords: DNA foundation model, bidirectional modeling, Mamba-Attention, Mixture-of-Experts, genomics
TL;DR¶
JanusDNA is proposed as the first bidirectional DNA foundation model, combining a Mamba-Attention-MoE hybrid architecture with the Janus Modeling pretraining paradigm to achieve bidirectional understanding at the training efficiency of autoregressive methods, attaining state-of-the-art performance across multiple genomic benchmarks.
Background & Motivation¶
Background: Large language models are being applied to DNA sequence modeling, yet direct transfer faces unique challenges—handling long-range dependencies in ultra-long sequences (>10k base pairs) while requiring bidirectional understanding.
Limitations of Prior Work: - Sequence length vs. resolution trade-off: Attention mechanisms struggle with long sequences; k-mer tokenization expands the context window but sacrifices resolution (losing SNP information). - Unidirectional understanding: Decoder-based models (HyenaDNA, Evo) support only unidirectional context, whereas many regulatory elements (e.g., bidirectional promoters) require bidirectional modeling. - Training inefficiency: MLM (BERT-style) involves only ~15% of tokens in loss computation, which is extremely inefficient for long-sequence training.
Key Challenge: An inherent trade-off exists between bidirectional understanding capability (MLM) and training efficiency (autoregressive).
Goal: To construct an efficient bidirectional DNA foundation model that simultaneously handles long sequences and maintains high training efficiency.
Key Insight: Design a novel pretraining paradigm (Janus Modeling) in which all tokens contribute to loss computation (as in autoregressive training) while preserving bidirectional understanding (as in MLM).
Core Idea: Achieve full-token-loss bidirectional pretraining through independent bidirectional encoding combined with a carefully designed attention mask fusion mechanism.
Method¶
Overall Architecture¶
JanusDNA comprises three core components: (1) Janus Modeling—an efficient bidirectional pretraining method; (2) a Mamba-Attention-MoE hybrid architecture; and (3) a reverse complement (RC) processing strategy. The forward and reverse sequences are independently encoded through separate Mamba+MoE stacks and subsequently fused via FlexAttention, enabling bidirectional prediction without information leakage.
Key Designs¶
-
Janus Modeling (Efficient Bidirectional Training):
- Function: Enables every token to be predicted based on full bidirectional context, with all tokens contributing to the loss.
- Design Motivation: MLM computes loss over only 15% of tokens, resulting in low efficiency; autoregressive methods are efficient but unidirectional.
- Mechanism:
- Forward encoding: \(H_t^F = \text{ForwardEncoder}(x_1, ..., x_t)\)
- Backward encoding: \(H_t^B = \text{BackwardEncoder}(x_T, ..., x_t)\)
- Bidirectional fusion: a carefully designed attention mask \(\mathcal{M}_{ij}\) ensures that prediction of \(x_t\) uses only \(H_k^F\ (k<t)\) and \(H_j^B\ (j>t)\)
- Training objective: \(\mathcal{L}_{bidirectional} = -\sum_{t=1}^{T} \log P(x_t | x_1,...,x_{t-1}, x_{t+1},...,x_T)\)
- Novelty: Approximately 2× faster than MLM (sparse masking) with significantly higher learning efficiency.
-
Hybrid Architecture (Mamba-Attention-MoE):
- Function: Combines the long-sequence efficiency of SSMs, the global comprehension of attention, and the sparse capacity expansion of MoE.
- Design Motivation: Pure attention cannot scale to million-level base pairs; pure SSMs lack global fusion capability.
- Mechanism:
- Mamba layers efficiently encode local context.
- MoE layers replace FFN layers proportionally to expand model capacity via sparse activation.
- FlexAttention layers realize bidirectional fusion.
- MoE auxiliary loss: \(\mathcal{L}_{total} = \alpha \cdot N \cdot \sum_{i=1}^N f_i \cdot P_i\) ensures balanced expert utilization.
- Novelty: Capable of processing 1 million base pairs on a single 80 GB GPU.
-
Reverse Complement (RC) Processing:
- Function: Processes the forward DNA strand and its reverse complement strand in parallel.
- Design Motivation: The double-stranded DNA structure carries equivalent information; non-palindromic motifs must be recognized in both orientations simultaneously.
- Mechanism: Both the forward strand and the RC strand are fed independently into the same model; output representations are pooled and merged.
-
Attention Mask Design (FlexAttention Mask):
- Function: Controls information flow in attention over the \(2T\)-length input sequence.
- Design Motivation: Information leakage at position \(t\) during prediction must be strictly prevented.
- Mechanism: Four rules govern intra-forward-segment, intra-backward-segment, and cross-directional (forward-to-backward) attention.
Loss & Training¶
- Primary loss: bidirectional prediction loss \(\mathcal{L}_{bidirectional}\) (all tokens participate).
- MoE auxiliary loss: ensures balanced expert load.
- Pretraining data: human reference genome HG38, tokenized at single-nucleotide resolution.
- Context length: 131,072 (extensible to 1M).
Key Experimental Results¶
Main Results¶
Genomic Benchmark (8 tasks, Top-1 Accuracy, 5-fold CV):
| Model | Active Params | Mouse Enhancers | Coding vs Inter. | Human Regulatory | Human NonTATA |
|---|---|---|---|---|---|
| HyenaDNA | 436k | 0.780 | 0.904 | 0.869 | 0.944 |
| Caduceus-PS | 470k | 0.793 | 0.910 | 0.873 | 0.945 |
| JanusDNA | 426k | 0.770 | 0.912 | 0.877 | 0.957 |
Nucleotide Transformer Benchmark (18 tasks) — Selected Histone Marks:
| Model | Active Params | H3 | H3k14ac | H3k36me3 | H3k4me3 |
|---|---|---|---|---|---|
| Enformer | 252M | 0.719 | 0.288 | 0.344 | 0.158 |
| NT-v2 | 500M | 0.784 | 0.551 | 0.625 | 0.410 |
| Caduceus-PH | 1.9M | 0.815 | 0.631 | 0.601 | 0.544 |
| JanusDNA | 2M | 0.835 | 0.729 | 0.702 | 0.688 |
DNALongBench eQTL Task (AUROC, sequence length 450k):
| Model | Artery Tibial | Muscle Skeletal | Nerve Tibial | Whole Blood |
|---|---|---|---|---|
| Enformer (252M) | 0.741 | 0.621 | 0.683 | 0.689 |
| Caduceus-PH (7.7M) | 0.690 | 0.789 | 0.842 | 0.769 |
| JanusDNA (7.7M) | 0.852 | 0.864 | 0.914 | 0.821 |
Ablation Study¶
Janus Modeling vs. Masked Modeling Efficiency Comparison (10k training steps, last-token prediction accuracy): - Janus Modeling substantially outperforms Masked Modeling across all hidden dimensions (32/64/128). - Janus training speed: ~27 minutes per 1,000 steps, approximately 2× faster than Masked Modeling. - At hidden dimension 128, Janus achieves at 5k steps the accuracy that Masked Modeling requires 10k steps to reach.
Key Findings¶
- JanusDNA achieves state-of-the-art performance on 12 out of 18 NT benchmark tasks, surpassing models with 250× more parameters.
- JanusDNA substantially outperforms the specialist model Enformer on long-range eQTL tasks.
- Janus Modeling improves training efficiency by approximately 2× over MLM.
- Processing of 1 million base pairs on a single 80 GB GPU demonstrates strong practical utility.
- MoE layers effectively expand model capacity without significantly increasing computational cost.
Highlights & Insights¶
- The apt "Janus" metaphor: Independent encoding in two directions followed by fusion perfectly mirrors the biological nature of double-stranded DNA.
- Breaking the linear relationship between parameter count and performance: A 2M-parameter model surpasses models with 500M+ parameters.
- Training paradigm innovation: The approach simultaneously resolves two fundamental problems—the low efficiency of MLM and the unidirectionality of autoregressive methods.
- Elegant FlexAttention mask design: Achieves full-token bidirectional prediction without information leakage over inputs of length \(2T\).
Limitations & Future Work¶
- Pretraining is conducted exclusively on the human reference genome, lacking cross-species and genomic variation data.
- Epigenetic information (chromatin accessibility, histone modifications, and other multimodal data) has not been integrated.
- Computational resource requirements for long sequences remain substantial.
- Future work may explore modeling of functional features such as CTCF-mediated chromatin loops.
Related Work & Insights¶
- The approach differs from Caduceus's bidirectional SSM strategy: Caduceus achieves bidirectionality via bidirectional Mamba, whereas JanusDNA employs Janus Modeling combined with fusion attention.
- The trend of Mamba + Attention hybrid architectures is emerging concurrently in NLP (e.g., Jamba) and genomics.
- Sparse MoE capacity expansion is particularly valuable for ultra-long-sequence models.
- Single-nucleotide resolution tokenization is essential for SNP-related research.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The Janus Modeling training paradigm is highly innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 35 tasks, three major benchmarks, and comprehensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with well-crafted figures and tables.
- Value: ⭐⭐⭐⭐⭐ Establishes a new paradigm for DNA foundation models with broad practical impact.