AntigenLM: Structure-Aware DNA Language Modeling for Influenza¶

Conference: ICLR 2026 arXiv: 2602.09067 Code: https://github.com/peilab-cnic/AntigenLM Area: Biological Sequence Generation / DNA Language Models Keywords: DNA language model, influenza virus prediction, functional unit encoding, whole-genome modeling, vaccine design

TL;DR¶

AntigenLM is a GPT-2-style DNA language model that preserves the integrity of genomic functional units. Pretrained on complete influenza virus whole genomes and subsequently fine-tuned, it autoregressively predicts antigenic sequences of future circulating strains, achieving significantly lower amino acid mismatch rates than the evolutionary model beth-1 and general-purpose genomic models.

Background & Motivation¶

Background: Influenza viruses evolve rapidly to escape host immunity, necessitating frequent vaccine strain updates. Current WHO vaccine recommendations rely on phylodynamic indicators (e.g., LBI) and site-level evolutionary prediction models (e.g., beth-1).

Limitations of Prior Work: - Site-level models (beth-1) treat mutations as independent events and cannot capture co-evolution across genomic segments. - General-purpose genomic foundation models (DNABERT, HyenaDNA) are trained on multi-species heterogeneous corpora, losing species-specific structural information. - Protein-level models (ESM, ProtGPT2) entirely ignore nucleotide-level evolutionary mechanisms such as synonymous mutations, non-coding regulatory elements, and codon adaptation.

Key Challenge: Viral evolution is driven by coordinated multi-segment genome-wide interactions (RNA–RNA interactions, segment reassortment constraints, polymerase–antigen co-adaptation); fragmented modeling discards critical signals.

Goal: To construct a DNA language model that preserves functional unit integrity and captures whole-genome dependencies at the nucleotide level for accurate influenza antigen sequence prediction.

Key Insight: The influenza genome is compact (~13k nucleotides), making it suitable for single-Transformer whole-genome modeling. By maintaining a fixed ordering and complete boundaries of the 8 gene segments, the model can learn cross-segment co-evolutionary patterns.

Core Idea: Maintaining the integrity and correct arrangement of genomic functional units during pretraining enables the DNA language model to capture high-order evolutionary constraints across segments.

Method¶

Overall Architecture¶

AntigenLM adopts a GPT-2-style decoder-only Transformer architecture. The input is the full-genome nucleotide sequence of influenza A virus (up to 13k tokens), and the output is the autoregressively generated next nucleotide. The pipeline consists of two stages: (1) unsupervised pretraining on 54,512 complete influenza genomes; and (2) fine-tuning separately for two downstream tasks—antigen sequence prediction and subtype classification.

Key Designs¶

Functional-Unit–Aware Pretraining:
- Function: The 8 gene segments (PB2, PB1, PA, HA, NP, NA, MP, NS) are concatenated in a fixed descending-length order into a single whole-genome sequence.
- Mechanism: Each training sample retains the complete genome; positional encodings span the full 13k positions without truncation or segmentation. A standard causal language modeling loss \(\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T-1} \log p(x_{t+1} \mid x_{\leq t})\) is used for training.
- Design Motivation: Preserving segment ordering and boundary integrity allows the Transformer's attention mechanism to model cross-segment co-evolutionary dependencies (e.g., compensatory mutations between HA and NA), which is unachievable through fragmented training.
- Novelty: General DNA models trained on multi-species heterogeneous corpora lose species-specific structure; this work targets a single species (influenza) while preserving biological structure.
Two-Stage Functional Unit Encoding Strategy:
- Function: Implicit positional alignment during pretraining; explicit sentinel tokens during fine-tuning.
- Mechanism: During pretraining, fixed segment ordering combined with positional encoding implicitly encodes segment boundaries. During fine-tuning, special tokens such as <HA>, <NA>, and <sep> are introduced to explicitly delimit functional regions, directing attention and constraining decoding to avoid cross-segment continuation.
- Design Motivation: Omitting explicit markers during pretraining allows the model to freely learn structural patterns, while adding markers during fine-tuning enables precise generation control—balancing generality with controllability.
Temporal Prediction Fine-Tuning Scheme:
- Function: HA/NA sequences from three consecutive months are used to predict the antigen sequence of the following month.
- Mechanism: The input format is \(\text{block}^{(1)}\text{block}^{(2)}\text{block}^{(3)}\text{block}^{(\star)}\), where each block = <subtype><HA>HA<NA>NA<sep>. Training optimizes the causal LM loss over the full sequence; at inference, three historical blocks are fed as context and the future block is generated autoregressively.
- Design Motivation: Concatenating antigen sequences from multiple time points implicitly encodes evolutionary trajectories, enabling the model to infer the next evolutionary direction from patterns of sequence change.
Dual-Head Multi-Task Design:
- Function: Shared Transformer backbone + LM head (next-nucleotide prediction) + Classification head (subtype classification).
- Mechanism: The LM head shares weights with the embedding matrix; the Classification head extracts hidden states at sentinel token positions and projects them to subtype logits, trained with cross-entropy loss.
- Design Motivation: The generative task captures global evolutionary dynamics, while the classification task provides supervisory signal to improve representation quality; both tasks share the backbone and mutually reinforce each other.

Loss & Training¶

Pretraining: Standard causal LM loss; AdamW optimizer (learning rate \(1 \times 10^{-4}\), linear warmup for 5% of steps + cosine decay, dropout 0.1, gradient clipping 1.0).
Effective batch size = 32 genomes/step (8 GPUs × 1 sample × 4 gradient accumulation steps).
Compact model scale: 6-layer Transformer, 384 hidden dimensions, 6 attention heads, FFN inner dimension 1536.

Key Experimental Results¶

Main Results — Next-Season Antigen Sequence Prediction (Japan, post-2022)¶

Method	H1N1-HA AA Mismatch	H3N2-HA AA Mismatch	H1N1-NA AA Mismatch	H3N2-NA AA Mismatch
WHO Current System	~10+	~10+	~2	~5+
beth-1	~6–8	~6–8	~1–2	~3–4
LBI	High	High	—	—
AntigenLM	~3–4	~3–4	<1	~1–2

AntigenLM reduces mismatches by >70% relative to WHO recommendations and by ~50% relative to beth-1 on H1N1-HA and H3N2-NA.
Next-month prediction: average of 3–4 amino acid mismatches on HA (<1% of 566 AAs) and 1–2 on NA (<0.5% of 469 AAs).

Ablation Study — Pretraining Strategy Comparison¶

Pretraining Configuration	Next-Month Token Perplexity	Sequence Generation Validity	Subtype Classification F1
Full-genome (full model)	1.26	High	99.81%
Incomplete-genome (random cropping)	3.55	Low (frequently invalid sequences)	Lower; frequent subtype confusion
Segment-wise (single segment)	4.42	Moderate	Lower
Antigen-only (nuc)	4.56	Moderate	100% (subtype determined by antigen)
Antigen-only (protein)	—	Moderate	100%

Key Findings¶

Whole-genome context is critical: Removing non-HA/NA internal segments raises perplexity from 1.26 to 4.56, demonstrating that segments such as PB1/PB2/PA provide meaningful predictive signal.
Functional unit integrity matters more than data volume: Incomplete-genome uses the same amount of data but disrupts segment boundaries, yielding the worst performance.
Cross-subtype generalization: H7N9 constitutes only 4.68% of pretraining data and 0.3% (48 sequences) of fine-tuning data, yet AntigenLM still predicts accurately.
Geographic generalization: Trained exclusively on European and Asian data, AntigenLM still significantly outperforms beth-1 on completely unseen U.S. data.

Highlights & Insights¶

The functional unit preservation principle is broadly applicable: This strategy is not limited to influenza; any genome with well-defined functional unit structures (e.g., segmented RNA viruses) can benefit from analogous structure-aware pretraining.
Compact, domain-specialized models can outperform large generalist ones: A 6-layer, 384-dimensional model specifically designed for influenza whole genomes surpasses general-purpose genomic models with far more parameters.
Elegance of the two-stage encoding scheme: Withholding explicit markers during pretraining lets the model freely learn structural patterns, while introducing them during fine-tuning enables precise generation control—successfully balancing generality with controllability.

Limitations & Future Work¶

Predictions are inherently probabilistic and should serve as a complement to expert decision-making rather than a replacement.
Validation is limited to influenza A; generalizability to other pathogens (e.g., SARS-CoV-2, HIV) remains unexplored.
The small model scale (6 layers) may need to be expanded for more complex genomes.
Training data depend on GISAID, introducing geographic sampling bias.

vs. beth-1: beth-1 models site-level independent mutations, whereas AntigenLM models cross-segment co-evolution; AntigenLM outperforms beth-1 on all tasks.
vs. HyenaDNA/DNABERT: General-purpose DNA models trained on multi-species corpora lose species-specific structure and frequently generate invalid sequences (large length deviations, missing markers).
vs. ProtGPT2: Protein-level models cannot capture nucleotide-level evolutionary signals such as synonymous mutations.

Rating¶

Novelty: ⭐⭐⭐⭐ Functional-unit–aware pretraining is a novel design principle, though the overall architecture follows standard GPT-2.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five pretraining ablations, comparisons with three categories of methods, and cross-subtype/cross-geographic generalization evaluations—highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear logical structure with well-motivated reasoning.
Value: ⭐⭐⭐⭐ Directly applicable to vaccine design; the functional unit preservation principle has broad transferable significance.