CrossNovo: Bidirectional Representations Augmented Autoregressive Biological Sequence Generation¶

Conference: NeurIPS 2025 arXiv: 2510.08169 Code: To be confirmed Area: Biological Sequence Generation / Proteomics Keywords: de novo peptide sequencing, autoregressive, non-autoregressive, CTC, knowledge distillation

TL;DR¶

CrossNovo integrates autoregressive (AR) and non-autoregressive (NAR) decoders through a shared spectrum encoder, importance annealing, and gradient-blocked knowledge distillation, enabling the bidirectional global understanding of NAR to augment AR sequence generation. On the 9-Species benchmark, it achieves amino acid accuracy of 0.811 (+2.6%) and peptide recall of 0.654 (+5.3%).

Background & Motivation¶

Background: De novo peptide sequencing from mass spectrometry data is a core task in proteomics. AR models (e.g., ContraNovo) generate sequences step-by-step but lack a global view; NAR models (e.g., PrimeNovo) generate in parallel but suffer from training instability.

Limitations of Prior Work: AR models are prone to errors when distinguishing amino acids with similar masses (e.g., I/L, K/Q), as each step only attends to local history; NAR models have bidirectional attention but produce incoherent sequences and exhibit unstable training.

Key Challenge: The causal dependency of AR ensures generation coherence but limits access to global information; the strong bidirectional understanding of NAR cannot guarantee sequence consistency.

Goal: To introduce bidirectional representations from NAR to enhance the global understanding of AR, while preserving AR generation quality.

Key Insight: A shared encoder allows both decoders to learn complementary representations → NAR representations serve as additional context distilled into AR → gradient blocking prevents AR loss from corrupting NAR features.

Core Idea: The NAR decoder provides bidirectional global representations, which are distilled into the AR decoder via gradient-blocked cross-decoder attention, combined with a shared mass spectrum encoder for hybrid sequence generation.

Method¶

Overall Architecture¶

Input spectrum → Shared Spectrum Encoder (Transformer with domain-adapted sinusoidal encoding for m/z-intensity pairs) → encoded features \(E^{(b)}\) → AR Decoder (causal self-attention + cross-attention to \(E^{(b)}\) + prefix/suffix mass constraints) + NAR Decoder (non-causal self-attention + cross-attention to \(E^{(b)}\), CTC loss) → Two-stage training: multi-task learning (importance annealing) → knowledge distillation (gradient blocking).

Key Designs¶

Shared Spectrum Encoder + Mass Constraints:
- Function: Encodes mass spectrum data into shared features and injects biochemical priors into AR.
- Mechanism: m/z-intensity pairs are encoded using domain-adapted sinusoidal encoding (adapted to the value range), followed by \(b\)-layer Transformer self-attention. At each AR step, prefix/suffix mass constraints (cumulative mass of generated tokens and remaining mass) are additionally provided.
- Design Motivation: The shared encoder ensures both decoders observe a consistent spectral representation; mass constraints inject biochemical knowledge into AR.
Importance Annealing Multi-task Training (Stage 1):
- Function: Jointly trains AR (CE loss) and NAR (CTC loss) with dynamically adjusted weights.
- Mechanism: \(\mathcal{L} = \lambda_{AT} \mathcal{L}_{AT} + (1-\lambda_{AT}) \mathcal{L}_{NAT}\), where \(\lambda_{AT}(i) = i/T_{total}\) — early training emphasizes NAR (learning global patterns), while later training emphasizes AR (fine-grained sequence generation guidance).
- Design Motivation: NAR learns global bidirectional understanding faster (parallel generation); fine-grained AR guidance requires NAR to have already learned good representations.
Gradient-Blocked Knowledge Distillation (Stage 2):
- Function: Distills bidirectional NAR representations into AR.
- Mechanism: The encoder and NAR are frozen; only AR is fine-tuned. The AR cross-attention attends jointly to \([\mathbb{GB}(V^{(L')}) \oplus E^{(b)}]\) — the last-layer NAR features (with gradient blocking \(\mathbb{GB}\) = detach) concatenated with encoder features.
- Design Motivation: Gradient blocking is critical — ablation studies show that removing it causes gradient explosion. NAR's bidirectional features provide AR with a global perspective without being corrupted by AR loss in return.

Loss & Training¶

AR: Cross-entropy \(\mathcal{L}_{AT} = -\sum_t \log p(a_t | a_{<t}, \mathcal{S})\)
NAR: CTC loss (handling variable-length alignment)
8×A100, AdamW lr=5e-4, cosine annealing, optimal beam size=5

Key Experimental Results¶

Main Results¶

Benchmark	Metric	CrossNovo	ContraNovo	PrimeNovo (NAR)
9-Species-v1	AA Accuracy	0.811	0.785	0.788
9-Species-v1	Peptide Recall	0.654	0.621	0.638
9-Species-v2	AA Accuracy	0.906	0.882	0.891
9-Species-v2	Peptide Recall	0.786	0.752	0.777

Ablation Study¶

Configuration	AA Accuracy	Peptide Recall	Note
w/o gradient blocking	—	—	Gradient explosion
w/o shared encoder	0.698	0.546	Severe degradation
Full model	0.811	0.654	Best

Key Findings¶

Consistently outperforms all baselines across all 9 species, with +9% improvement on species where AR excels (Human, Mouse).
Achieves consistently superior disambiguation accuracy for amino acids with similar masses (I/L, K/Q).
Zero-shot antibody sequencing yields +5% peptide recall, demonstrating generalization capability.
Gradient blocking is essential for training stability — its removal causes gradient explosion.

Highlights & Insights¶

The AR+NAR fusion knowledge distillation paradigm is broadly applicable: The idea of NAR providing global representations distilled into AR via gradient blocking is transferable to other sequence generation tasks (e.g., speech recognition, machine translation).
Importance annealing is a simple yet effective curriculum strategy: Allowing global understanding to mature early and refining sequence generation later avoids complex training schedules.

Limitations & Future Work¶

Two-stage training increases pipeline complexity.
Beam search introduces inference latency (beam=5 is optimal but slower).
Validation is limited to peptide sequencing; other biological sequences (e.g., RNA) remain untested.
The NAR decoder is unused at inference time (used only for distillation), resulting in low resource utilization.

vs ContraNovo: A purely AR model lacking a global view; CrossNovo compensates via NAR distillation.
vs PrimeNovo: A purely NAR model with poor generation coherence; CrossNovo preserves AR sequence consistency.
vs CTC-based methods: CTC is used for NAR training rather than inference, avoiding the decoding degradation issues associated with CTC.

Rating¶

Novelty: ⭐⭐⭐⭐ The AR+NAR fusion with gradient-blocked distillation is a novel design.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks + ablation studies + downstream antibody task.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-structured.
Value: ⭐⭐⭐⭐ Provides a generalizable paradigm for AR/NAR fusion in sequence generation.