🧬 Computational Biology¶

🧪 ICML2025 · 48 paper notes

📌 Same area in other venues: 📷 CVPR2026 (21) · 🔬 ICLR2026 (156) · 💬 ACL2026 (5) · 🧪 ICML2026 (52) · 🤖 AAAI2026 (20) · 🧠 NeurIPS2025 (76)

🔥 Top topics: Biomolecules ×14 · Diffusion Models ×6 · Reinforcement Learning ×2

ADIOS: Antibody Development via Opponent Shaping: Introducing opponent shaping from multi-agent reinforcement learning into antibody design, this paper proposes the ADIOS meta-learning framework: the outer loop optimises the antibody, and the inner loop simulates adaptive viral escape, ensuring that the designed "shaper" antibodies (shapers) not only counter current viral variants but also actively steer viral evolution toward weaker, more easily targeted directions.
Aligning Protein Conformation Ensemble Generation with Physical Feedback: This work proposes Energy-based Alignment (EBA), which integrates energy feedback from physical force fields into the fine-tuning process of diffusion generative models. By aligning the generative distribution with the physical energy landscape via a Boltzmann factor-weighted classification objective, the method achieves state-of-the-art (SOTA) performance in protein conformation ensemble generation on the ATLAS MD benchmark.
CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models: Proposing CFP-Gen—a large-scale diffusion language model that achieves combinatorial protein generation under multimodal functional constraints (functional annotations + sequence motifs + 3D structures) via Annotation-Guided Feature Modulation (AGFM) and Residue-level Control Function Encoding (RCFE), improving the F1 score by 30% compared to ESM3.
Compositional Flows for 3D Molecule and Synthesis Pathway Co-design: Proposes CGFlow (Compositional Generative Flows)—extending flow matching to the step-by-step generation of compositional objects. It interleaves discrete compositional structure sampling (synthesis pathways) and continuous state transport (3D conformations). Applied as 3DSynthFlow to synthesizable drug design, it achieves SOTA results in both binding affinity and synthesizability across 15 targets of LIT-PCBA for the first time.
ComRecGC: Global Graph Counterfactual Explainer through Common Recourse: This paper formally defines the Common Recourse global counterfactual explanation problem for Graph Neural Networks (GNNs) for the first time, proves its NP-hardness, and proposes the ComRecGC algorithm. By searching for counterfactual graphs using Multi-head Vertex-Reinforced Random Walk (VRRW) and extracting common recourses via DBScan clustering, ComRecGC consistently outperforms existing baselines by 10%–30% in coverage across four real-world datasets: NCI1, Mutagenicity, AIDS, and Proteins.
DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models: Proposed the DeepSeq pipeline, which utilizes large language models (especially Agentic GPT-4o with real-time web search capabilities) to automatically annotate cell types in single-cell RNA sequencing data. It achieves a maximum accuracy of 82.5%, resolving the throughput bottleneck of large-scale omics data annotation.
Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling: The authors propose the CpSDE framework, which uses alternate sampling between a harmonic SDE generative model (AtomSDE) and a residue-type predictor (ResRouter) to achieve the first all-type cyclic peptide design based on 3D receptor structures, surpassing existing linear peptide design methods in both stability and affinity.
eccDNAMamba: A Pre-Trained Model for Ultra-Long eccDNA Sequence Analysis: eccDNAMamba is the first bidirectional state space encoder tailored for circular DNA. Combining BPE tokenization, circular data augmentation, and SpanBERT-style pre-training, it coordinates linear time complexity with ultra-long eccDNA sequence modeling up to 200Kbp. It significantly outperforms DNABERT-2, HyenaDNA, and Caduceus in cancer classification and genuine eccDNA identification tasks.
Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow: Proposes an SO(3)-Averaged Flow training objective to eliminate the need for rotation alignment between the prior and data distributions by analytically averaging over all rotations in the rotation group SO(3). Combined with Reflow and distillation, it achieves high-quality few-step or even single-step molecular conformer generation.
Elucidating the Design Space of Multimodal Protein Language Models: This work systematically explores the design space of token-based multimodal protein language models (PLMs). Through innovations across four dimensions—bit-wise discrete modeling, geometry-aware architectures, representation alignment, and multimer data expansion—it reduces the folding RMSD of a 650M parameter model from 5.52 to 2.36, surpassing a 3B baseline model and approaching the level of specialized folding models.
Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks: This paper proposes the MolJO framework, which leverages the continuously differentiable parameter space \(\boldsymbol{\theta}\) of Bayesian Flow Networks (BFNs) to achieve joint gradient-guided optimization of both molecular coordinates (continuous) and atom types (discrete). It incorporates a sliding-window backward correction strategy to balance exploration and exploitation, outperforming existing methods with a 51.3% Success Rate on CrossDocked2020.
ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models: This paper provides the first systematic analysis of the impact of [MASK] tokens on performance in MLMs, revealing that corrupted semantics exerts a more detrimental effect than unreal tokens. Based on this finding, ExLM is proposed: by expanding each [MASK] into multiple hidden states and modeling their dependencies with transition matrices, it effectively mitigates the semantic multimodality problem, yielding significant improvements on both text and molecular modeling tasks.
Flexibility-conditioned Protein Structure Design with Flow Matching: BackFlip (predicting residue-level flexibility from backbones) and FliPS (an SE(3)-equivariant flow matching model conditioned on flexibility profiles) are proposed, achieving the first flexibility-controlled generation of protein backbones with desired dynamic properties, validated by 300 ns molecular dynamics simulations.
GenMol: A Drug Discovery Generalist with Discrete Diffusion: This work proposes GenMol, a generalist molecular generation framework based on Masked Discrete Diffusion, which generates SAFE sequences via non-autoregressive bidirectional parallel decoding. By introducing fragment remasking and Molecular Context Guidance (MCG), it covers four major drug discovery scenarios: de novo generation, fragment-constrained generation, target-directed hit generation, and lead optimization, using a single model and comprehensively outperforming previous state-of-the-art methods.
Geometric Generative Modeling with Noise-Conditioned Graph Networks: Proposes Noise-Conditioned Graph Networks (NCGNs) to dynamically adjust message passing range and graph resolution in GNN architectures based on noise levels: long-range connections with low resolution at high noise levels, and local connections with high resolution at low noise levels, outperforming static architecture baselines in 3D point cloud, spatial transcriptomics, and image generation.
Geometric Representation Condition Improves Equivariant Molecule Generation: GeoRCG proposes a two-stage molecule generation framework: first generating a low-dimensional geometric representation (informative representation), and then generating the complete molecule conditioned on this representation. This achieves an average improvement of 50% on conditional molecule generation tasks, while reducing the number of diffusion steps from 1000 to 100.
Global Context-aware Representation Learning for Spatially Resolved Transcriptomics: The proposed Spotscape framework captures global similarity relations among spots via the Similarity Telescope module (rather than solely relying on spatial local neighbors). By introducing prototypical contrastive learning and a similarity scale matching strategy to handle multi-slice batch effects, it comprehensively outperforms existing methods in tasks such as spatial domain identification, trajectory inference, and multi-slice integration and alignment.
Graph Generative Pre-trained Transformer (G2PT): Proposes G2PT, which encodes a graph as a sequence of node and edge tokens, generates graphs using a GPT-style autoregressive Transformer via next-token prediction, and achieves goal-directed molecular generation through Rejection Fine-Tuning (RFT) and PPO reinforcement learning, achieving SOTA performance on both general graph and molecular datasets.
Improved Off-policy Reinforcement Learning in Biological Sequence Design: This paper proposes \(\delta\)-Conservative Search (\(\delta\)-CS), a novel off-policy search method for biological sequence design. By applying token-level noise injection (random masking with probability \(\delta\)) to high-scoring offline sequences and then denoising them with a GFlowNet policy, while adaptively adjusting the degree of conservatism based on proxy model uncertainty, \(\delta\)-CS significantly outperforms existing methods on DNA, RNA, protein, and peptide design tasks.
Improving Flow Matching by Aligning Flow Divergence: Analyses the error between the learned probability path and the true probability path in Flow Matching (FM) from a PDE perspective. It proves that this error is controlled by the divergence gap of the vector fields, and proposes a joint flow and divergence matching (FDM) training objective, which significantly improves FM performance on density estimation, DNA sequence generation, and video prediction tasks.
Kinetic Langevin Diffusion for Crystalline Materials Generation: KLDM proposes using Kinetic Langevin Diffusion to address the issue of fractional atomic coordinates residing on a hypertorus in crystal material generation. By introducing an auxiliary velocity variable, the diffusion process is shifted to a flat Euclidean space while preserving periodic translational symmetry, achieving competitive performance on crystal structure prediction and de novo generation tasks.
Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing: LIPNovo proposes a new paradigm of performing latent imputation before peptide prediction to address missing fragmentation information in mass spectrometry. By utilizing learnable peak queries and bipartite matching to impute latent representations of theoretical peaks, it significantly outperforms state-of-the-art models like CasaNovo across three benchmarks (improving amino acid precision by 5.6%-20%).
LDMol: A Text-to-Molecule Diffusion Model with Structurally Informative Latent Space Surpasses AR Models: This work proposes LDMol, which constructs a structure-aware latent space through SMILES-enumeration contrastive learning. A conditional latent diffusion model is then trained on this space to achieve text-to-molecule generation, outperforming autoregressive (AR) models on text-based data generation tasks for the first time.
Leveraging Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks: The PSV-PPO algorithm is proposed, introducing a Partial SMILES Validation (PSV) truth table at each step of autoregressive SMILES molecule generation to penalize invalid tokens in real-time, enhancing chemical space exploration capability while maintaining molecular validity.
MF-LAL: Drug Compound Generation Using Multi-Fidelity Latent Space Active Learning: The MF-LAL framework is proposed to unify multi-fidelity surrogate models and molecular generative models into a hierarchical latent space. Through active learning, it efficiently integrates molecular docking (low-fidelity) and binding free energy calculation (high-fidelity) oracles, generating candidate drug molecules with significantly improved binding free energies (averaging an approximately 50% improvement in ABFE score).
Multivariate Conformal Selection: Extends Conformal Selection from univariate responses to multivariate settings, introduces the concept of Regional Monotonicity, designs distance-based (mCS-dist) and learning-based (mCS-learn) nonconformity scores, and guarantees finite-sample FDR control while improving selection power.
Neural Graph Matching Improves Retrieval Augmented Generation in Molecular Machine Learning: This work proposes MARASON, introducing Neural Graph Matching into the Retrieval-Augmented Generation (RAG) framework for molecular machine learning. Through a differentiable fragment-level alignment mechanism, it effectively integrates reference molecular spectra info retrieved from databases into the target molecule's mass spectrometry prediction, improving top-1 retrieval accuracy from 19% to 28% on the NIST dataset.
PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion: PepTune combines a Masked Discrete Language Model (MDLM) with a Monte Carlo Tree Search (MCTS) multi-objective-guided strategy within the discrete peptide SMILES space to optimize multiple therapeutic properties (binding affinity, solubility, membrane permeability, etc.) simultaneously, enabling the de novo design of peptide drugs containing non-canonical amino acids and cyclization modifications.
Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule: Proposes the VLB-Optimal Scheduling (VOS) strategy. By theoretically analyzing the path-dependent VLB characteristics of joint noise scheduling of multimodal data (continuous 3D coordinates + discrete 2D topology), it utilizes dynamic programming to search for the optimal noise scheduling path, achieving state-of-the-art performance in SBDD with a 95.9% PoseBusters pass rate on CrossDock.
PolyConf: Unlocking Polymer Conformation Generation through Hierarchical Generative Models: Introduces PolyConf, the first hierarchical generative framework specifically designed for polymer conformation generation. Phase 1 utilizes a masked autoregressive (MAR) model and a diffusion process to generate the local conformations of repeating units in a random order, while Phase 2 utilizes an SO(3) diffusion model to generate orientation transformations to assemble the local conformations into a complete polymer molecular structure. Additionally, the work constructs the first polymer conformation benchmark, PolyBench (containing 50k+ polymers and approximately 2,000 atoms per conformation), consistently outperforming prior methods by over 25% across all structural and energy metrics.
Protein Structure Tokenization: Benchmarking and New Recipe: Proposes StructTokenBench—the first comprehensive evaluation framework for protein structure tokenizers (PSTs). It assesses existing methods across four dimensions: downstream effectiveness, sensitivity, distinctiveness, and codebook utilization efficiency. It further introduces AminoAseed, a strategy that significantly improves the quality of VQ-VAE-type tokenizers through codebook reparameterization and Pareto-optimal configurations (achieving a 6.31% improvement and a 124% increase in utilization compared to ESM3).
Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction: Protriever is proposed as the first end-to-end differentiable protein homology sequence retrieval framework, which jointly trains the retriever and the reader. It achieves sequence-model SOTA performance on protein fitness prediction tasks while being two orders of magnitude faster than traditional MSA retrieval.
Reliable Algorithm Selection for Machine Learning-Guided Design: Proposed a design algorithm selection method that formulates the success determination of candidate design algorithm configurations as a multiple hypothesis testing problem. By incorporating Prediction-Powered Inference (PPI) techniques to correct prediction errors, the method guarantees with high probability the selection of algorithm configurations that satisfy user-defined success criteria on unlabelled design distributions.
Roll the Dice & Look Before You Leap: Going Beyond the Creative Limits of Next-Token Prediction: This paper designs a suite of minimal algorithmic tasks to quantify the "creative limits" of language models, demonstrating that next-token prediction (NTP) learning is short-sighted in open-ended tasks requiring "leaps of thought," whereas multi-token methods (such as teacherless training and discrete diffusion models) and input-layer noise injection (seed-conditioning) can significantly improve generative diversity and originality.
Scalable Equilibrium Sampling with Sequential Boltzmann Generators: SBG achieves efficient equilibrium sampling of hexapeptide (66 atoms) systems in Cartesian coordinates for the first time by utilizing a Transformer-based normalizing flow (TarFlow) and sequential Monte Carlo with annealed Langevin dynamics.
Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching: STFlow is proposed, a generative model based on flow matching that explicitly captures inter-cellular interactions by modeling the joint distribution of gene expressions across the entire slide. It leverages local spatial attention for efficient whole-slide encoding, achieving an 18% improvement over the best baseline on HEST-1k and STImage-1K4M.
Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment: Proposes RADM (Rotationally Aligned Diffusion Model), which constructs an aligned latent space by learning sample-dependent SO(3) rotational transformations, enabling non-equivariant diffusion models to generate 3D molecules effectively, achieving generation quality comparable to SOTA equivariant models while offering better scalability and sampling efficiency.
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data: This paper introduces scSSL-Bench, a systematic benchmark that evaluates 19 self-supervised learning (SSL) methods across 9 single-cell datasets on three downstream tasks: batch correction, cell type annotation, and missing modality prediction. The results reveal task-specific trade-offs between general-purpose SSL methods and domain-specific approaches.
SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model: Proposes SPACE (Species-Profile Adaptive Collaborative Experts), demonstrating that supervised genomic profile prediction learns more effective DNA representations than unsupervised sequence pre-training, and achieves state-of-the-art (SOTA) performance on 11 out of 18 downstream NT tasks using a species-aware MoE encoder and a dual-gated decoder.
Steering Protein Language Models: This work migrates the Activation Steering technique from the LLM domain to protein language models (PLMs) for the first time. By editing internal model activations during inference, the proposed method guides protein sequence generation and optimization toward target properties (e.g., thermostability, solubility) completely training-free. Additionally, it introduces an Activation Steering-based Protein Optimization (ASPO) algorithm for mutation site identification using steering vector dissimilarity, significantly outperforming traditional methods in lysozyme and GFP optimization tasks.
SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics: SToFM is proposed as the first multi-scale spatial transcriptomics foundation model. By integrating gene-scale domain adaptation, micro-scale subpatch partitioning, and macro-scale virtual cell injection, combined with an SE(2) Transformer and pre-trained on a large-scale corpus of 88M cells, SToFM significantly outperforms existing methods in tasks such as tissue domain semantic segmentation and cell-type annotation.
SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics: The authors propose SUICA, which compresses super-high dimensional sparse spatial transcriptomic data into a compact embedding space via a graph-enhanced autoencoder. It then utilizes Implicit Neural Representations (INR) to model the continuous mapping from coordinates to embeddings, achieving spatial imputation, gene imputation, and denoising across various ST platforms.
Supercharging Graph Transformers with Advective Diffusion: Proposes Advective Diffusion Transformer (AdvDIFFormer), a physics-inspired Graph Transformer model. By combining non-local diffusion (global attention) and advection (local message passing) mechanisms, it achieves provable generalization error bounds under topological distribution shifts, outperforming GNNs that solely rely on local diffusion.
Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra: This paper introduces DeepWAS (Deep genome Wide Association Studies), which leverages modern fast linear algebra techniques (banded matrix approximation + iterative solvers) to resolve the computational bottleneck of large-scale LD matrix inversion in GWAS. This achieves, for the first time, the training of functional annotation-driven genetic variant effect prediction models by maximizing the full marginal likelihood with large-scale neural networks. The authors discover that larger models yield improved performance only under full-likelihood training (contrary to traditional summary statistics fitting).
UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design: UniMoMo is proposed as the first unified 3D binder design framework for three types of molecules: small molecules, peptides, and antibodies. It uses "Graph of Blocks" as a unified representation, an iterative all-atom autoencoder for compressing the latent space, and an E(3)-equivariant diffusion model for generation, outperforming domain-specific models across three benchmarks.
UniSim: A Unified Simulator for Time-Coarsened Dynamics of Biomolecules: UniSim is the first deep generative model for cross-domain (small molecules/peptides/proteins) all-atom time-coarsened molecular dynamics. Through a three-stage pipeline—multi-head pre-training for unified atomic representation, a stochastic interpolation vector field model for long-timestep state propagation, and a force-guidance kernel for parameter-efficient adaptation to different chemical environments—it achieves transferable dynamics simulation across molecular domains.
Weisfeiler and Leman Go Gambling: Why Expressive Lottery Tickets Win: This work theoretically connects GNN expressiveness (Weisfeiler-Leman test) with the lottery ticket hypothesis (LTH) for the first time. It proposes and proves the Strong Expressive Lottery Ticket Hypothesis (SELTH), demonstrating that there exist trainable subnetworks in sparsely initialized GNNs that preserve 1-WL expressiveness, and that sparsely initialized networks with higher expressiveness are more likely to become "winning tickets." Additionally, it highlights the severe consequences of unrecoverable expressiveness loss caused by improper pruning in scenarios such as drug discovery.
WGFormer: An SE(3)-Transformer Driven by Wasserstein Gradient Flows for Molecular Generation: This paper proposes WGFormer, an SE(3)-Transformer driven by Wasserstein gradient flows. Operating within an autoencoder framework, WGFormer optimizes molecular conformations by minimizing energy functions on latent mixture models of atoms, consistently outperforming the state-of-the-art (SOTA) on ground-state conformation prediction tasks.