TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling¶
Conference: AAAI 2026 arXiv: 2507.19229 Code: Not released Area: AI for Science / Genomics Keywords: DNA foundation model, long-sequence modeling, reverse complement, groove fusion, multi-window attention, evolutionary training strategy
TL;DR¶
TrinityDNA is a bio-inspired DNA foundation model integrating three innovations: a Groove Fusion module for capturing major/minor groove structural features, a Gated Reverse Complement mechanism for handling double-strand complementary symmetry, and Sliding Multi-Window Attention for multi-scale long-range dependency modeling. Combined with an Evolutionary Training Strategy (ETS) progressing from prokaryotes to eukaryotes, TrinityDNA achieves an average MCC of 0.708 across 15 GUE benchmark tasks (surpassing NT with 2.5B parameters), leads on both prokaryotic and eukaryotic zero-shot tasks across 19 benchmarks, and introduces a new CDS annotation benchmark for long-sequence inference evaluation.
Background & Motivation¶
Background: DNA sequence modeling in genomics faces unique challenges—sequences are extremely long (tens to hundreds of thousands of base pairs), information density is low (abundant repetitive and non-coding regions), and sequences exhibit complex biological structures (double-strand complementarity, groove structures, long-range regulatory dependencies). Existing DNA foundation models (e.g., HyenaDNA, Caduceus/MambaDNA, DNABERT2) each have notable limitations.
Limitations of Prior Work: - Locality bias in SSMs: Although SSMs can theoretically handle long sequences, empirical analysis (Figure 2) shows that influence scores in Caduceus decay rapidly with distance, losing focus at long range. - Over-smoothing in full attention: As sequence length increases, self-attention entropy tends toward a uniform distribution (Figure 3), making all token weights nearly equal and drowning out useful signals. - Lack of biological structural awareness: Existing models do not explicitly model major/minor groove structures of DNA, nor do they fully exploit reverse complement strand information. - Poor cross-species generalization from single-species training: Many models are trained on data from a single species, limiting cross-species generalization.
Key Challenge: How can a model maintain computational efficiency while capturing both biological structural features of DNA and multi-scale dependencies in ultra-long sequences?
Key Insight: A trinity of "sequence + structure + strategy"—sequence modeling (multi-window attention), structure awareness (groove fusion + reverse complement), and training strategy (evolutionary learning from prokaryotes to eukaryotes).
Method¶
Overall Architecture¶
Input DNA sequence → Groove Fusion multi-scale convolutional tokenization → TrinityDNA Transformer blocks (SMWA + FFN) × L → Gated Reverse Complement double-strand fusion → Output
Key Designs¶
-
Groove Fusion Module:
- Design Motivation: The DNA double helix contains a major groove (5–7 nucleotides wide) and a minor groove (3–5 nucleotides wide), which play distinct roles in protein binding and molecular interactions.
- Mechanism: Three convolutional kernels (\(k=3,5,7\)) are used for multi-scale tokenization, corresponding to the spatial scales of the minor groove, transitional region, and major groove respectively: \(\text{GrooveFusion}(S) = \sum_{k \in \{3,5,7\}} \text{GELU}(\text{Conv}_k(S))\)
- Effect: Pretraining perplexity reduced by 0.065.
-
Sliding Multi-Window Attention (SMWA):
- Design Motivation: Addresses over-smoothing in full attention and locality bias in SSMs.
- Mechanism: Different attention heads are assigned different window sizes \(L_h\); each head attends to dependencies at different scales via a sliding window: \(\text{Attn}_h(S_i) = \text{Softmax}\left(\frac{Q_h(i) K_h(i+[-L_h, L_h])^T}{\sqrt{d_k}}\right) V_h(i+[-L_h, L_h])\)
- Small-window heads capture local features (promoters, binding sites); large-window heads capture long-range regulatory relationships.
- Efficiency: On the 1B-parameter model, FLOPs are reduced by 31% (TFLOPs: 64.5→44.5) with only a 0.010 increase in perplexity.
-
Gated Reverse Complement (GRC):
- Design Motivation: Double-strand complementarity is fundamental to gene expression; both the forward strand \(S\) and its reverse complement \(S^R\) carry important information.
- Mechanism: A parameter-shared Transformer processes both the forward and reverse complement strands simultaneously, fusing them via a gating mechanism: \(\text{GRC}(S, S^R) = f_\theta(S) + \sigma(W_G \cdot f_\theta(\text{Flip}(S^R)))\)
- Here \(\sigma\) is the identity function and \(W_G\) is a learnable gating weight.
- Effect: Perplexity reduced by 0.132 (the largest contribution among the three modules).
-
Evolutionary Training Strategy (ETS):
- Stage 1: Pretraining on prokaryotic (bacterial/archaeal) DNA at sequence length 8K to learn fundamental nucleotide patterns.
- Stage 2: Continued training on multi-species data (fungi, vertebrates, etc.) at sequence lengths up to 100K to learn complex intron–exon structures and cross-gene regulatory elements.
- The two stages yield TrinityMicroDNA (prokaryotes only) and TrinityDNA (prokaryotes + eukaryotes), respectively.
Scaling Laws¶
- Across 6M to 1B parameters, TrinityDNA's perplexity–FLOPs Pareto frontier outperforms Transformer, Caduceus, EVO, and EVO2 at every compute level.
- Perplexity consistently and steadily decreases as the context window extends from 8K → 30K → 100K.
Key Experimental Results¶
Main Results 1: GUE Benchmark (15 Genome Understanding Tasks)¶
| Model | Params | H3 | H3K14ac | H3K36me3 | Human TF | Mouse TF | Splice | Avg |
|---|---|---|---|---|---|---|---|---|
| DNABERT | 86M | 0.731 | 0.401 | 0.473 | 0.642 | 0.564 | 0.841 | 0.552 |
| NT | 2.5B | 0.788 | 0.562 | 0.620 | 0.633 | 0.670 | 0.894 | 0.636 |
| DNABERT2 | 117M | 0.783 | 0.526 | 0.569 | 0.701 | 0.680 | 0.850 | 0.621 |
| Caduceus | 40M | 0.799 | 0.541 | 0.609 | - | - | - | 0.586 |
| TrinityDNA | 1B | 0.814 | 0.694 | 0.692 | 0.714 | 0.786 | 0.927 | 0.708 |
- TrinityDNA achieves the best results on most of the 15 tasks with an overall average MCC of 0.708, surpassing NT (0.636) despite NT having 2.5B parameters.
- Improvements are especially pronounced on tasks requiring long-range dependencies, such as histone modification prediction and transcription factor binding site prediction.
Main Results 2: Zero-Shot Performance (19 Tasks)¶
| Model | Params | Prokaryotic RNA/Protein DMS Avg | Eukaryotic ClinVar+DMS Avg |
|---|---|---|---|
| TrinityMicroDNA | 1B | 0.475 | 0.404 |
| TrinityDNA | 1B | 0.366 | 0.699 |
| EVO | 7B | 0.328 | 0.415 |
| EVO2 | 40B | 0.335 | 0.667 |
| EVO2 | 1B | 0.353 | 0.670 |
| Caduceus | 40M | 0.099 | 0.314 |
Key Findings: - TrinityMicroDNA dominates all baselines on prokaryotic tasks (0.475 vs. 0.335 for EVO2-40B), validating the effectiveness of ETS. - TrinityDNA surpasses EVO2 with 40B parameters on eukaryotic tasks (0.699 vs. 0.667), demonstrating a dramatic efficiency advantage at the 1B scale. - The complementary strengths confirm the value of ETS: the prokaryotic stage learns fundamental patterns; the eukaryotic stage learns complex structures.
Main Results 3: CDS Annotation Benchmark (Newly Proposed)¶
| Method | Category | Exact Match F1 | 75% Match F1 |
|---|---|---|---|
| Prodigal | Classical pipeline | 0.725 | 0.829 |
| GENSCAN | Classical pipeline | 0.702 | 0.799 |
| TrinityMicroDNA-1B | Pretrained model | 0.754 | 0.803 |
| Caduceus-40M | Pretrained model | 0.140 | 0.180 |
- TrinityMicroDNA surpasses the classical tool Prodigal on Exact Match F1 (0.754 vs. 0.725), demonstrating strong generalization.
- CDS annotation on 20K-length sequences validates long-sequence inference capability.
Ablation Study¶
| Component | w/o PPL | w/ PPL | FLOPs Change |
|---|---|---|---|
| GRC | 2.731 | 2.599 (−0.132) | — |
| GFM | 2.599 | 2.534 (−0.065) | — |
| SMWA | 2.534 | 2.544 (+0.010) | −31% |
- GRC contributes the most, indicating that reverse complement information is critical for DNA modeling.
- SMWA incurs only a marginal 0.010 perplexity increase while reducing FLOPs by 31%, representing a favorable efficiency–performance trade-off.
- ETS validation: initializing from prokaryotic pretraining followed by joint-data fine-tuning outperforms training from scratch on joint data.
Efficiency Analysis¶
- At sequence lengths of 64K tokens, TrinityDNA maintains >80% of short-sequence throughput.
- This is attributed to SMWA and optimized fused kernels, whose memory footprint scales minimally with context length.
Highlights & Insights¶
- Deep integration of biological knowledge: Rather than straightforwardly adapting NLP models, the architecture is designed from the physical structure (grooves), chemical properties (base complementarity), and evolutionary patterns of DNA.
- "Sequence + structure + strategy" trinity design philosophy: Each component addresses one core challenge; the name "Trinity" is apt.
- Smaller model outperforms larger models: A 1B-parameter model surpasses EVO (7B) and EVO2 (40B), demonstrating that inductive biases matter more than brute-force scaling.
- Biological intuition behind ETS: Learning simple patterns first (prokaryotes: small, structurally simple genomes) before complex ones (eukaryotes: large genomes with introns and exons) is consistent with curriculum learning principles.
- Practical value of the CDS annotation benchmark: Evaluation is extended from manually crafted small tasks to real genome annotation scenarios, with 20K sequence lengths closer to real-world applications.
- Clear diagnosis of the over-smoothing problem: Figure 3 visually demonstrates entropy uniformization in full attention on long sequences, providing direct empirical motivation for SMWA.
Limitations & Future Work¶
- 1B parameters remains large: Inference costs are still non-trivial for practical bioinformatics pipelines.
- MLM-only pretraining objective: Autoregressive and other pretraining paradigms are not explored.
- Marginal perplexity increase from SMWA: Although computationally efficient, the slight perplexity degradation may affect tasks that are extremely sensitive to accuracy.
- Training data quality control: Multiple databases (GTDB, IMG, RefSeq, etc.) are integrated without detailed discussion of data cleaning and deduplication.
- Identity gating function in GRC: \(\sigma\) is set to the identity; more expressive gating mechanisms (e.g., sigmoid or softmax) are not explored.
- CDS annotation benchmark limited to prokaryotes: Eukaryotic CDS annotation (involving intron–exon structure) is more complex and remains unvalidated.
- Code not released: Reproducibility and community follow-up are constrained.
Related Work & Insights¶
- DNA foundation models: DNABERT, DNABERT2, Nucleotide Transformer (NT), HyenaDNA, Caduceus/MambaDNA, EVO, EVO2, VQDNA
- SSM architectures: S4, Mamba, Hyena
- Genomics tasks: GUE benchmark, ProteinGym, ClinVar, ENCODE
- Long-sequence modeling: BigBird, DuoAttention, Longformer
Rating ⭐⭐⭐⭐¶
The depth of biological knowledge integration is outstanding, the design philosophy is clear, and experimental results are strong (1B outperforming 40B). However, the code is not released, certain design choices (e.g., GRC gating function) are insufficiently discussed, and the CDS benchmark covers prokaryotes only. Overall, this is a valuable contribution to the DNA foundation model field.