Skip to content

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Conference: ICLR 2026 arXiv: 2602.09067 Code: https://github.com/peilab-cnic/AntigenLM Area: Biological Sequence Generation / DNA Language Models Keywords: DNA language model, influenza virus prediction, functional unit encoding, whole-genome modeling, vaccine design

TL;DR

AntigenLM is a GPT-2-style DNA language model that preserves the integrity of genomic functional units. Pretrained on complete influenza virus whole genomes and subsequently fine-tuned, it autoregressively predicts antigenic sequences of future circulating strains, achieving significantly lower amino acid mismatch rates than the evolutionary model beth-1 and general-purpose genomic models.

Background & Motivation

Background: Influenza viruses evolve rapidly to escape host immunity, necessitating frequent vaccine strain updates. Current WHO vaccine recommendations rely on phylodynamic indicators (e.g., LBI) and site-level evolutionary prediction models (e.g., beth-1).

Limitations of Prior Work: - Site-level models (beth-1) treat mutations as independent events and cannot capture co-evolution across genomic segments. - General-purpose genomic foundation models (DNABERT, HyenaDNA) are trained on multi-species heterogeneous corpora, losing species-specific structural information. - Protein-level models (ESM, ProtGPT2) entirely ignore nucleotide-level evolutionary mechanisms such as synonymous mutations, non-coding regulatory elements, and codon adaptation.

Key Challenge: Viral evolution is driven by coordinated multi-segment genome-wide interactions (RNA–RNA interactions, segment reassortment constraints, polymerase–antigen co-adaptation); fragmented modeling discards critical signals.

Goal: To construct a DNA language model that preserves functional unit integrity and captures whole-genome dependencies at the nucleotide level for accurate influenza antigen sequence prediction.

Key Insight: The influenza genome is compact (~13k nucleotides), making it suitable for single-Transformer whole-genome modeling. By maintaining a fixed ordering and complete boundaries of the 8 gene segments, the model can learn cross-segment co-evolutionary patterns.

Core Idea: Maintaining the integrity and correct arrangement of genomic functional units during pretraining enables the DNA language model to capture high-order evolutionary constraints across segments.

Method

Overall Architecture

AntigenLM adopts a GPT-2-style decoder-only Transformer architecture. The input is the full-genome nucleotide sequence of influenza A virus (up to 13k tokens), and the output is the autoregressively generated next nucleotide. The pipeline consists of two stages: (1) unsupervised pretraining on 54,512 complete influenza genomes; and (2) fine-tuning separately for two downstream tasks—antigen sequence prediction and subtype classification.

Key Designs

  1. Functional-Unit–Aware Pretraining:

    • Function: The 8 gene segments (PB2, PB1, PA, HA, NP, NA, MP, NS) are concatenated in a fixed descending-length order into a single whole-genome sequence.
    • Mechanism: Each training sample retains the complete genome; positional encodings span the full 13k positions without truncation or segmentation. A standard causal language modeling loss \(\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T-1} \log p(x_{t+1} \mid x_{\leq t})\) is used for training.
    • Design Motivation: Preserving segment ordering and boundary integrity allows the Transformer's attention mechanism to model cross-segment co-evolutionary dependencies (e.g., compensatory mutations between HA and NA), which is unachievable through fragmented training.
    • Novelty: General DNA models trained on multi-species heterogeneous corpora lose species-specific structure; this work targets a single species (influenza) while preserving biological structure.
  2. Two-Stage Functional Unit Encoding Strategy:

    • Function: Implicit positional alignment during pretraining; explicit sentinel tokens during fine-tuning.
    • Mechanism: During pretraining, fixed segment ordering combined with positional encoding implicitly encodes segment boundaries. During fine-tuning, special tokens such as <HA>, <NA>, and <sep> are introduced to explicitly delimit functional regions, directing attention and constraining decoding to avoid cross-segment continuation.
    • Design Motivation: Omitting explicit markers during pretraining allows the model to freely learn structural patterns, while adding markers during fine-tuning enables precise generation control—balancing generality with controllability.
  3. Temporal Prediction Fine-Tuning Scheme:

    • Function: HA/NA sequences from three consecutive months are used to predict the antigen sequence of the following month.
    • Mechanism: The input format is \(\text{block}^{(1)}\text{block}^{(2)}\text{block}^{(3)}\text{block}^{(\star)}\), where each block = <subtype><HA>HA<NA>NA<sep>. Training optimizes the causal LM loss over the full sequence; at inference, three historical blocks are fed as context and the future block is generated autoregressively.
    • Design Motivation: Concatenating antigen sequences from multiple time points implicitly encodes evolutionary trajectories, enabling the model to infer the next evolutionary direction from patterns of sequence change.
  4. Dual-Head Multi-Task Design:

    • Function: Shared Transformer backbone + LM head (next-nucleotide prediction) + Classification head (subtype classification).
    • Mechanism: The LM head shares weights with the embedding matrix; the Classification head extracts hidden states at sentinel token positions and projects them to subtype logits, trained with cross-entropy loss.
    • Design Motivation: The generative task captures global evolutionary dynamics, while the classification task provides supervisory signal to improve representation quality; both tasks share the backbone and mutually reinforce each other.

Loss & Training

  • Pretraining: Standard causal LM loss; AdamW optimizer (learning rate \(1 \times 10^{-4}\), linear warmup for 5% of steps + cosine decay, dropout 0.1, gradient clipping 1.0).
  • Effective batch size = 32 genomes/step (8 GPUs × 1 sample × 4 gradient accumulation steps).
  • Compact model scale: 6-layer Transformer, 384 hidden dimensions, 6 attention heads, FFN inner dimension 1536.

Key Experimental Results

Main Results — Next-Season Antigen Sequence Prediction (Japan, post-2022)

Method H1N1-HA AA Mismatch H3N2-HA AA Mismatch H1N1-NA AA Mismatch H3N2-NA AA Mismatch
WHO Current System ~10+ ~10+ ~2 ~5+
beth-1 ~6–8 ~6–8 ~1–2 ~3–4
LBI High High
AntigenLM ~3–4 ~3–4 <1 ~1–2
  • AntigenLM reduces mismatches by >70% relative to WHO recommendations and by ~50% relative to beth-1 on H1N1-HA and H3N2-NA.
  • Next-month prediction: average of 3–4 amino acid mismatches on HA (<1% of 566 AAs) and 1–2 on NA (<0.5% of 469 AAs).

Ablation Study — Pretraining Strategy Comparison

Pretraining Configuration Next-Month Token Perplexity Sequence Generation Validity Subtype Classification F1
Full-genome (full model) 1.26 High 99.81%
Incomplete-genome (random cropping) 3.55 Low (frequently invalid sequences) Lower; frequent subtype confusion
Segment-wise (single segment) 4.42 Moderate Lower
Antigen-only (nuc) 4.56 Moderate 100% (subtype determined by antigen)
Antigen-only (protein) Moderate 100%

Key Findings

  • Whole-genome context is critical: Removing non-HA/NA internal segments raises perplexity from 1.26 to 4.56, demonstrating that segments such as PB1/PB2/PA provide meaningful predictive signal.
  • Functional unit integrity matters more than data volume: Incomplete-genome uses the same amount of data but disrupts segment boundaries, yielding the worst performance.
  • Cross-subtype generalization: H7N9 constitutes only 4.68% of pretraining data and 0.3% (48 sequences) of fine-tuning data, yet AntigenLM still predicts accurately.
  • Geographic generalization: Trained exclusively on European and Asian data, AntigenLM still significantly outperforms beth-1 on completely unseen U.S. data.

Highlights & Insights

  • The functional unit preservation principle is broadly applicable: This strategy is not limited to influenza; any genome with well-defined functional unit structures (e.g., segmented RNA viruses) can benefit from analogous structure-aware pretraining.
  • Compact, domain-specialized models can outperform large generalist ones: A 6-layer, 384-dimensional model specifically designed for influenza whole genomes surpasses general-purpose genomic models with far more parameters.
  • Elegance of the two-stage encoding scheme: Withholding explicit markers during pretraining lets the model freely learn structural patterns, while introducing them during fine-tuning enables precise generation control—successfully balancing generality with controllability.

Limitations & Future Work

  • Predictions are inherently probabilistic and should serve as a complement to expert decision-making rather than a replacement.
  • Validation is limited to influenza A; generalizability to other pathogens (e.g., SARS-CoV-2, HIV) remains unexplored.
  • The small model scale (6 layers) may need to be expanded for more complex genomes.
  • Training data depend on GISAID, introducing geographic sampling bias.
  • vs. beth-1: beth-1 models site-level independent mutations, whereas AntigenLM models cross-segment co-evolution; AntigenLM outperforms beth-1 on all tasks.
  • vs. HyenaDNA/DNABERT: General-purpose DNA models trained on multi-species corpora lose species-specific structure and frequently generate invalid sequences (large length deviations, missing markers).
  • vs. ProtGPT2: Protein-level models cannot capture nucleotide-level evolutionary signals such as synonymous mutations.

Rating

  • Novelty: ⭐⭐⭐⭐ Functional-unit–aware pretraining is a novel design principle, though the overall architecture follows standard GPT-2.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five pretraining ablations, comparisons with three categories of methods, and cross-subtype/cross-geographic generalization evaluations—highly comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Clear logical structure with well-motivated reasoning.
  • Value: ⭐⭐⭐⭐ Directly applicable to vaccine design; the functional unit preservation principle has broad transferable significance.