🧬 Computational Biology¶
🧠 NeurIPS2025 · 44 paper notes
📌 Same area in other venues: 🧪 ICML2026 (8) · 💬 ACL2026 (2) · 📷 CVPR2026 (5) · 🔬 ICLR2026 (24) · 🤖 AAAI2026 (15) · 📹 ICCV2025 (3)
🔥 Top topics: Biomolecules ×20 · Diffusion Models ×8
- AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation
-
To address the unavailability of holo protein structures in real-world drug discovery, this paper proposes AANet—a framework that aligns representations via tri-modal contrastive learning (ligand–holo pocket–detected cavity) and aggregates multiple candidate binding sites through cross-attention. AANet substantially outperforms SOTA methods in blind screening on apo/predicted protein structures (EF1% on DUD-E: 11.75 → 37.19).
- Amortized Active Generation of Pareto Sets
-
This paper proposes the A-GPS framework, which learns a conditional generative model over the Pareto set to perform online discrete black-box multi-objective optimization. It employs a non-dominance class probability estimator (CPE) as an implicit substitute for explicit hypervolume computation in PHVI, and achieves amortized posterior preference conditioning via preference direction vectors (without retraining). The approach demonstrates superior sample efficiency on synthetic benchmarks and protein design tasks.
- Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra
-
This paper proposes ChefNMR, the first end-to-end framework based on 3D atomic diffusion models that directly predicts the molecular structure of unknown small molecules (especially complex natural products) from 1D NMR spectra and molecular formulae alone, achieving state-of-the-art performance on both synthetic and experimental datasets.
- GraphFLA: Augmenting Biological Fitness Prediction Benchmarks with Landscape Features
-
GraphFLA is an efficient fitness landscape analysis framework that computes 20 biologically meaningful landscape features (ruggedness / epistasis / navigability / neutrality) across 5,300+ real-world landscapes (ProteinGym / RNAGym / CIS-BP), revealing that model performance is highly dependent on landscape topology—e.g., VenusREM outperforms ProSST on highly navigable landscapes but underperforms it on highly epistatic ones—while processing one million mutants in just 20 seconds (vs. 5 hours for MAGELLAN).
- Autoencoding Random Forests
-
RFAE is the first principled encode-decode framework for random forests. It exploits the positive-definiteness and universality of the RF kernel to derive low-dimensional encodings via diffusion-map spectral decomposition, and decodes back to the original feature space through k-NN regression in leaf-node space. Across 20 tabular datasets, RFAE achieves an average reconstruction rank of 1.80, substantially outperforming TVAE (3.38) and AE (3.27), and is successfully applied to MNIST reconstruction and scRNA-seq batch-effect removal.
- BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research
-
BarcodeMamba+ is an SSM-based foundation model for fungal ITS DNA barcode classification. By adopting a pretrain-then-finetune paradigm to leverage large-scale unlabeled sequences, and incorporating three enhancements—hierarchical label smoothing, inverse square-root weighted loss, and multi-head outputs—it substantially outperforms BLAST, CNN, and Transformer baselines across all taxonomic ranks on three test sets, achieving a top species-level accuracy of 88.9%.
- CrossNovo: Bidirectional Representations Augmented Autoregressive Biological Sequence Generation
-
CrossNovo integrates autoregressive (AR) and non-autoregressive (NAR) decoders through a shared spectrum encoder, importance annealing, and gradient-blocked knowledge distillation, enabling the bidirectional global understanding of NAR to augment AR sequence generation. On the 9-Species benchmark, it achieves amino acid accuracy of 0.811 (+2.6%) and peptide recall of 0.654 (+5.3%).
- Compressing Biology: Evaluating the Stable Diffusion VAE for Phenotypic Drug Discovery
-
This work presents the first systematic evaluation of the Stable Diffusion VAE (SD-VAE) for reconstructing Cell Painting fluorescence microscopy images. Results show that SD-VAE preserves phenotypic information well at both the pixel level and the biological signal level (with negligible drop in Fraction Retrieved), and that the general-purpose feature extractor InceptionV3 matches or outperforms the domain-specific model OpenPhenom on retrieval tasks.
- ConfRover: Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression
-
ConfRover proposes an autoregressive framework that factorizes protein MD trajectories into frame-wise conditional generation \(p(\mathbf{x}^{1:L}) = \prod_l p(\mathbf{x}^l | \mathbf{x}^{<l})\), and through a modular architecture consisting of an encoder, a causal Transformer, and an SE(3) diffusion decoder, unifies three tasks—trajectory simulation, time-independent conformational sampling, and conformational interpolation—within a single model for the first time, achieving comprehensive improvements over MDGen on the ATLAS benchmark.
- Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
-
This paper identifies an inconsistency between sampling and simulation in diffusion models (particularly at small diffusion timesteps), proposes a Fokker-Planck-based regularization term to enforce consistency, and combines it with a time-partitioned Mixture-of-Experts (MoE) strategy to achieve consistent and efficient sampling and molecular dynamics simulation across multiple biomolecular systems.
- De novo generation of functional terpene synthases using TpsGPT
-
TpsGPT fine-tunes a distilled ProtGPT2 Tiny (38.9M parameters) on 79K terpene synthase (TPS) sequences to generate 28K candidate sequences, which are subsequently filtered through a multi-stage pipeline (perplexity / pLDDT / EnzymeExplorer / CLEAN / InterPro / Foldseek) to yield 7 de novo TPS sequences that are evolutionarily distant (<60% sequence identity) yet structurally conserved. Wet-lab experiments confirm that 2 of the 7 candidates possess TPS enzymatic activity—achieving functional enzyme de novo design at a GPU cost below $200.
- DesignX: Human-Competitive Algorithm Designer for Black-Box Optimization
-
This paper proposes DesignX, the first automated algorithm design framework that jointly learns two sub-tasks—optimizer workflow generation and dynamic hyperparameter control—through dual Transformer agents pre-trained at scale on 10k synthetic problems. DesignX surpasses human-designed optimizers on both synthetic benchmarks and real-world tasks including protein docking, AutoML, and UAV path planning.
- EDBench: Large-Scale Electron Density Data for Molecular Modeling
-
This work constructs EDBench, the largest electron density (ED) dataset to date (3.3 million molecules, computed via B3LYP/6-31G** DFT), and designs a three-category benchmark evaluation framework covering prediction, retrieval, and generation tasks. It provides the first systematic assessment of deep learning models' ability to understand and exploit electron density.
- FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models
-
This paper presents FGBench, a dataset comprising 625K molecular property reasoning questions focused on functional group-level reasoning evaluation. Through three dimensions (single functional group effect, multi-functional group interaction, and molecular comparison), it systematically reveals the severe deficiencies of current LLMs in fine-grained chemical reasoning.
- Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning
-
This paper proposes Flow Density Control (FDC), which generalizes the fine-tuning of pretrained flow/diffusion models from KL-regularized expected reward maximization to a unified framework supporting arbitrary distributional utility functions with arbitrary divergence regularization. The approach decomposes nonlinear objectives into a sequence of linear fine-tuning subproblems and provides convergence guarantees.
- Fractional Diffusion Bridge Models
-
This paper proposes Fractional Diffusion Bridge Models (FDBM), which incorporate fractional Brownian motion (fBM) into the generative diffusion bridge framework. The Hurst exponent \(H\) controls the roughness and long-range dependence of trajectories, yielding improvements over Brownian motion baselines on protein conformation prediction and image translation tasks.
- Generative Distribution Embeddings: Lifting Autoencoders to the Space of Distributions for Multiscale Representation Learning
-
This paper proposes Generative Distribution Embeddings (GDE), which lifts autoencoders to the space of distributions — the encoder operates on sets of samples while the decoder is replaced by a conditional generative model — thereby learning distribution-level representations. The framework is validated on 6 computational biology tasks.
- Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings
-
This work proposes LD-FPG, a framework that encodes full-atom MD trajectories into a low-dimensional latent space via Chebyshev graph neural networks and applies DDPM in that space to generate novel conformational ensembles. To the authors' knowledge, this is the first approach to generate protein conformations that includes all heavy atoms of the side chains.
- Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry
-
This work constructs a multi-level interpretability toolkit for SynFlowNet (a GFlowNet grounded in synthetic reaction templates), integrating gradient saliency, counterfactual perturbation, sparse autoencoders (SAE), and motif probes to reveal how internal representations encode physicochemical properties and functional group information relevant to medicinal chemistry.
- Is Sequence Information All You Need for Bayesian Optimization of Antibodies?
-
This paper systematically compares the roles of sequence and structural information in antibody Bayesian optimization, finding that sequence-only methods augmented with protein language model (pLM) soft constraints can match the performance of structure-based methods, thereby questioning the necessity of structural information in antibody Bayesian optimization.
- Iterative Foundation Model Fine-Tuning on Multiple Rewards
-
This paper proposes IterativeRS (Iterative Rewarded Soups), which alternates between independent fine-tuning of per-objective expert policies and policy merging. The method unifies reward combination and expert merging approaches, outperforming MORLHF and Rewarded Soups on small molecule design, DNA sequence generation, and text summarization tasks.
- JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles
-
This paper proposes JAMUN, a conformational ensemble generation method built on the Walk-Jump Sampling (WJS) framework. By performing Langevin dynamics on a noise-smoothed manifold and using an SE(3)-equivariant denoiser to jump back to the original distribution, JAMUN achieves peptide conformational sampling an order of magnitude faster than conventional molecular dynamics while retaining transferability to out-of-training systems.
- JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model
-
JanusDNA is proposed as the first bidirectional DNA foundation model, combining a Mamba-Attention-MoE hybrid architecture with the Janus Modeling pretraining paradigm to achieve bidirectional understanding at the training efficiency of autoregressive methods, attaining state-of-the-art performance across multiple genomic benchmarks.
- Learning Conformational Ensembles of Proteins Based on Backbone Geometry
-
This paper proposes BBFlow, a flow matching generative model based on protein backbone geometry for conformational ensemble sampling. BBFlow requires neither evolutionary sequence information nor pretrained folding models, achieves inference speeds more than an order of magnitude faster than AlphaFlow, and generalizes to multi-chain proteins.
- Learning Relative Gene Expression Trends from Pathology Images in Spatial Transcriptomics
-
This paper proposes STRank, a loss function that reformulates gene expression estimation from pathology images as a ranking score estimation task. By modeling the stochastic noise inherent in expression counts via binomial/multinomial distributions, STRank enables models to learn robust relative expression relationships from spatial transcriptomics data subject to batch effects and random fluctuations.
- Manipulating 3D Molecules in a Fixed-Dimensional E(3)-Equivariant Latent Space
-
This paper proposes MolFLAE, a 3D molecular variational autoencoder that learns a fixed-dimensional, E(3)-equivariant latent space. By introducing learnable virtual nodes and a Bayesian Flow Network (BFN) decoder, MolFLAE enables zero-shot molecular editing — including atom-count editing, structural reconstruction, and property interpolation — and demonstrates practical utility in drug optimization targeting the human glucocorticoid receptor (hGR).
- Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Models
-
This paper proposes Mol-LLaMA, a large molecular language model for general molecular understanding. By designing three types of instruction data and a 2D-3D molecular representation fusion module, Mol-LLaMA surpasses GPT-4o in molecular feature understanding while exhibiting interpretability and reasoning capabilities.
- Multimodal 3D Genome Pre-training
-
This paper proposes MIX-HIC — the first multimodal foundation model for 3D genomics — which integrates Hi-C contact maps and epigenomic signals via cross-modal interaction blocks and cross-modal mapping blocks. Pre-trained on over 1.27 million paired samples, MIX-HIC achieves state-of-the-art performance across three downstream tasks: Hi-C prediction, chromatin loop detection, and CAGE-seq expression prediction.
- Multiscale Guidance of Protein Structure Prediction with Heterogeneous Cryo-EM Data
-
CryoBoltz leverages cryo-EM density maps to guide the sampling trajectory of a pretrained diffusion-based structure prediction model (Boltz-1) via a multiscale guidance mechanism (global → local), generating multi-conformational atomic models consistent with experimental data without any retraining.
- One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra
-
By employing MIST as a spectra-to-fingerprint encoder and MolForge as a fingerprint-to-structure decoder, combined with a prior-adjusted thresholding strategy, this work achieves a tenfold performance improvement on the MassSpecGym benchmark for de novo molecular structure generation from mass spectra (top-1 accuracy from 2.3% to 31%).
- Pharmacophore-Guided Generative Design of Novel Drug-Like Molecules
-
This paper proposes a pharmacophore-guided molecular generation framework that simultaneously maximizes pharmacophore similarity and minimizes structural similarity within the reward function of a reinforcement learning model (FREED++), generating candidate drug molecules that retain bioactivity features while exhibiting high structural novelty.
- Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number
-
This paper proposes PAFlow, a 3D molecule generation model built on the flow matching framework, which guides the vector field via a protein–ligand interaction predictor and determines atom counts through a learnable atom number predictor. PAFlow achieves a new state-of-the-art Avg. Vina Score of −8.31 on CrossDocked2020, substantially outperforming existing methods.
- PROSPERO: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhood
-
This paper proposes ProSpero, an active learning framework that discovers high-fitness and novel protein sequences even under surrogate model mismatch, via inference-time sampling of a frozen pretrained generative model (EvoDiff) guided by a surrogate, a targeted masking strategy, and biologically-constrained SMC sampling.
- Protein Design with Dynamic Protein Vocabulary
-
ProDVa introduces natural protein fragments as a "dynamic vocabulary" for generative protein design, employing a three-component architecture consisting of a text encoder, a protein language model, and a fragment encoder. Using less than 0.04% of the training data required by prior work, ProDVa designs functionally aligned and structurally foldable protein sequences, surpassing the SOTA model Pinal by 7.38% on the pLDDT>70 ratio.
- Quantifying the Role of OpenFold Components in Protein Structure Prediction
-
This paper proposes a systematic methodology for evaluating the contribution of individual Evoformer components in OpenFold/AlphaFold2 to protein structure prediction accuracy. The study finds that MSA column attention and MLP Transition layers are the most critical components, and that the importance of multiple components is significantly correlated with protein sequence length.
- Random Search Neural Networks for Efficient and Expressive Graph Learning
-
This paper proposes Random Search Neural Networks (RSNN), which replace random walks with randomized depth-first search (DFS) for graph structure sampling. On sparse graphs, RSNN achieves complete edge coverage with only \(O(\log|V|)\) searches. Paired with a universal sequence model, RSNN attains universal approximation capability, and consistently outperforms RWNN on molecular and protein benchmarks using up to 16× fewer samples.
- SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding
-
SpecMER introduces speculative decoding into protein sequence generation, employing a K-mer-guided batch selection strategy to choose the candidate most consistent with evolutionary conservation from multiple draft model outputs for target model verification. It achieves 24–32% speedup while preserving distributional consistency, and the generated sequences demonstrate significantly improved NLL and pLDDT structural confidence scores compared to unguided baselines.
- Steering Generative Models with Experimental Data for Protein Fitness Optimization
-
This work systematically evaluates strategies for steering protein generative models (discrete diffusion models and language models) toward fitness optimization, finding that plug-and-play guidance methods using small labeled datasets (~200 samples)—particularly DAPS—outperform RL-based fine-tuning, and proposes a Thompson sampling strategy incorporating predictive uncertainty for adaptive optimization.
- Towards Multiscale Graph-based Protein Learning with Geometric Secondary Structural Motifs
-
This paper proposes SSHG (Secondary Structure-based Hierarchical Graph), a framework that constructs two-level hierarchical graph representations from protein secondary structure motifs — an intra-motif residue-level graph and an inter-motif global graph — and employs a two-stage GNN to learn local and global features respectively. Theoretical guarantees of maximal expressiveness are provided, with empirical improvements in both accuracy and computational efficiency on enzyme classification and ligand affinity prediction tasks.
- Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling
-
This paper proposes UAE-3D, a multimodal variational autoencoder that compresses atomic types, chemical bonds, and 3D coordinates of molecules into a unified, near-lossless latent space. By eliminating the complexity of handling multimodality and equivariance, a general-purpose Diffusion Transformer achieves state-of-the-art 3D molecular generation.
- Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design
-
This paper proposes an uncertainty-aware multi-objective reinforcement learning framework that guides a 3D molecular diffusion model (EDM) to simultaneously optimize drug-likeness (QED), synthetic accessibility (SAS), and binding affinity. The framework dynamically shapes the reward function using predictive uncertainty from surrogate models, consistently outperforms baselines across three benchmark datasets, and validates candidate molecules through molecular dynamics simulations and ADMET analysis.
- Unified All-Atom Molecule Generation with Neural Fields
-
This paper proposes FuncBind, a framework that represents molecules as continuous atomic density functions via neural fields, constructing a unified conditional generative model capable of target-conditioned generation across three drug modalities: small molecules, macrocyclic peptides, and antibody CDR loops.
- UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
-
This work introduces UniSite-DS, the first UniProt (unique protein)-centric ligand binding site dataset, and UniSite, the first end-to-end binding site detection framework. UniSite directly predicts multiple potentially overlapping binding sites via set prediction loss and bijective matching, and further proposes IoU-based AP as a more accurate evaluation metric.
- Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion
-
This paper reveals the fundamental reason for the superiority of masking diffusion models — they implicitly condition on the known jump-time distribution — and proposes the Schedule-Conditioned Diffusion (SCUD) framework, which generalizes this advantage to arbitrary discrete diffusion models. Combined with structured forward processes, SCUD surpasses masking diffusion on both image and protein generation tasks.