Skip to content

CrossNovo: Bidirectional Representations Augmented Autoregressive Biological Sequence Generation

Conference: NeurIPS 2025 arXiv: 2510.08169 Code: To be confirmed Area: Biological Sequence Generation / Proteomics Keywords: de novo peptide sequencing, autoregressive, non-autoregressive, CTC, knowledge distillation

TL;DR

CrossNovo integrates autoregressive (AR) and non-autoregressive (NAR) decoders through a shared spectrum encoder, importance annealing, and gradient-blocked knowledge distillation, enabling the bidirectional global understanding of NAR to augment AR sequence generation. On the 9-Species benchmark, it achieves amino acid accuracy of 0.811 (+2.6%) and peptide recall of 0.654 (+5.3%).

Background & Motivation

Background: De novo peptide sequencing from mass spectrometry data is a core task in proteomics. AR models (e.g., ContraNovo) generate sequences step-by-step but lack a global view; NAR models (e.g., PrimeNovo) generate in parallel but suffer from training instability.

Limitations of Prior Work: AR models are prone to errors when distinguishing amino acids with similar masses (e.g., I/L, K/Q), as each step only attends to local history; NAR models have bidirectional attention but produce incoherent sequences and exhibit unstable training.

Key Challenge: The causal dependency of AR ensures generation coherence but limits access to global information; the strong bidirectional understanding of NAR cannot guarantee sequence consistency.

Goal: To introduce bidirectional representations from NAR to enhance the global understanding of AR, while preserving AR generation quality.

Key Insight: A shared encoder allows both decoders to learn complementary representations → NAR representations serve as additional context distilled into AR → gradient blocking prevents AR loss from corrupting NAR features.

Core Idea: The NAR decoder provides bidirectional global representations, which are distilled into the AR decoder via gradient-blocked cross-decoder attention, combined with a shared mass spectrum encoder for hybrid sequence generation.

Method

Overall Architecture

Input spectrum → Shared Spectrum Encoder (Transformer with domain-adapted sinusoidal encoding for m/z-intensity pairs) → encoded features \(E^{(b)}\)AR Decoder (causal self-attention + cross-attention to \(E^{(b)}\) + prefix/suffix mass constraints) + NAR Decoder (non-causal self-attention + cross-attention to \(E^{(b)}\), CTC loss) → Two-stage training: multi-task learning (importance annealing) → knowledge distillation (gradient blocking).

Key Designs

  1. Shared Spectrum Encoder + Mass Constraints:

    • Function: Encodes mass spectrum data into shared features and injects biochemical priors into AR.
    • Mechanism: m/z-intensity pairs are encoded using domain-adapted sinusoidal encoding (adapted to the value range), followed by \(b\)-layer Transformer self-attention. At each AR step, prefix/suffix mass constraints (cumulative mass of generated tokens and remaining mass) are additionally provided.
    • Design Motivation: The shared encoder ensures both decoders observe a consistent spectral representation; mass constraints inject biochemical knowledge into AR.
  2. Importance Annealing Multi-task Training (Stage 1):

    • Function: Jointly trains AR (CE loss) and NAR (CTC loss) with dynamically adjusted weights.
    • Mechanism: \(\mathcal{L} = \lambda_{AT} \mathcal{L}_{AT} + (1-\lambda_{AT}) \mathcal{L}_{NAT}\), where \(\lambda_{AT}(i) = i/T_{total}\) — early training emphasizes NAR (learning global patterns), while later training emphasizes AR (fine-grained sequence generation guidance).
    • Design Motivation: NAR learns global bidirectional understanding faster (parallel generation); fine-grained AR guidance requires NAR to have already learned good representations.
  3. Gradient-Blocked Knowledge Distillation (Stage 2):

    • Function: Distills bidirectional NAR representations into AR.
    • Mechanism: The encoder and NAR are frozen; only AR is fine-tuned. The AR cross-attention attends jointly to \([\mathbb{GB}(V^{(L')}) \oplus E^{(b)}]\) — the last-layer NAR features (with gradient blocking \(\mathbb{GB}\) = detach) concatenated with encoder features.
    • Design Motivation: Gradient blocking is critical — ablation studies show that removing it causes gradient explosion. NAR's bidirectional features provide AR with a global perspective without being corrupted by AR loss in return.

Loss & Training

  • AR: Cross-entropy \(\mathcal{L}_{AT} = -\sum_t \log p(a_t | a_{<t}, \mathcal{S})\)
  • NAR: CTC loss (handling variable-length alignment)
  • 8×A100, AdamW lr=5e-4, cosine annealing, optimal beam size=5

Key Experimental Results

Main Results

Benchmark Metric CrossNovo ContraNovo PrimeNovo (NAR)
9-Species-v1 AA Accuracy 0.811 0.785 0.788
9-Species-v1 Peptide Recall 0.654 0.621 0.638
9-Species-v2 AA Accuracy 0.906 0.882 0.891
9-Species-v2 Peptide Recall 0.786 0.752 0.777

Ablation Study

Configuration AA Accuracy Peptide Recall Note
w/o gradient blocking Gradient explosion
w/o shared encoder 0.698 0.546 Severe degradation
Full model 0.811 0.654 Best

Key Findings

  • Consistently outperforms all baselines across all 9 species, with +9% improvement on species where AR excels (Human, Mouse).
  • Achieves consistently superior disambiguation accuracy for amino acids with similar masses (I/L, K/Q).
  • Zero-shot antibody sequencing yields +5% peptide recall, demonstrating generalization capability.
  • Gradient blocking is essential for training stability — its removal causes gradient explosion.

Highlights & Insights

  • The AR+NAR fusion knowledge distillation paradigm is broadly applicable: The idea of NAR providing global representations distilled into AR via gradient blocking is transferable to other sequence generation tasks (e.g., speech recognition, machine translation).
  • Importance annealing is a simple yet effective curriculum strategy: Allowing global understanding to mature early and refining sequence generation later avoids complex training schedules.

Limitations & Future Work

  • Two-stage training increases pipeline complexity.
  • Beam search introduces inference latency (beam=5 is optimal but slower).
  • Validation is limited to peptide sequencing; other biological sequences (e.g., RNA) remain untested.
  • The NAR decoder is unused at inference time (used only for distillation), resulting in low resource utilization.
  • vs ContraNovo: A purely AR model lacking a global view; CrossNovo compensates via NAR distillation.
  • vs PrimeNovo: A purely NAR model with poor generation coherence; CrossNovo preserves AR sequence consistency.
  • vs CTC-based methods: CTC is used for NAR training rather than inference, avoiding the decoding degradation issues associated with CTC.

Rating

  • Novelty: ⭐⭐⭐⭐ The AR+NAR fusion with gradient-blocked distillation is a novel design.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks + ablation studies + downstream antibody task.
  • Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-structured.
  • Value: ⭐⭐⭐⭐ Provides a generalizable paradigm for AR/NAR fusion in sequence generation.