NeurIPS2025 Computational Biology AI paper notes paper summaries Biomolecules Diffusion Models LLM Reasoning Alignment/RLHF Adversarial Robustness

🧬 Computational Biology¶

🧠 NeurIPS2025 · 76 paper notes

📌 Same area in other venues: 📷 CVPR2026 (21) · 🔬 ICLR2026 (156) · 💬 ACL2026 (5) · 🧪 ICML2026 (52) · 🤖 AAAI2026 (20) · 📹 ICCV2025 (4)

🔥 Top topics: Biomolecules ×23 · Diffusion Models ×14 · LLM ×3 · Reasoning ×3 · Alignment/RLHF ×2

A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification: This paper presents ESCAPE—the first standardized multilabel antimicrobial peptide classification benchmark, integrating 80,000+ peptides from 27 public databases, along with a dual-branch Transformer + bidirectional cross-attention baseline model that achieves a 2.56% relative improvement in mAP over the second-best method.
A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random: Within a Gaussian mixture model clustering framework, this paper jointly addresses variable selection (distinguishing signal, redundant, and noise variables) and MNAR missing data modeling. A two-stage strategy—LASSO-penalized ranking followed by BIC-based role assignment—combined with spectral-distance adaptive penalty weights enables efficient inference in high-dimensional settings. Identifiability and asymptotic selection consistency are established theoretically.
AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation: To address the unavailability of holo protein structures in real-world drug discovery, this paper proposes AANet—a framework that aligns representations via tri-modal contrastive learning (ligand–holo pocket–detected cavity) and aggregates multiple candidate binding sites through cross-attention. AANet substantially outperforms SOTA methods in blind screening on apo/predicted protein structures (EF1% on DUD-E: 11.75 → 37.19).
Amortized Active Generation of Pareto Sets: This paper proposes the A-GPS framework, which learns a conditional generative model over the Pareto set to perform online discrete black-box multi-objective optimization. It employs a non-dominance class probability estimator (CPE) as an implicit substitute for explicit hypervolume computation in PHVI, and achieves amortized posterior preference conditioning via preference direction vectors (without retraining). The approach demonstrates superior sample efficiency on synthetic benchmarks and protein design tasks.
Amortized Sampling with Transferable Normalizing Flows: This work proposes Prose — a 285M-parameter all-atom transferable normalizing flow based on the TarFlow architecture, trained on 21,700 short-peptide MD trajectories (totaling 4.3 ms of simulation time). Prose enables zero-shot uncorrelated proposal sampling for arbitrary short-peptide systems, outperforms MD baselines under equal energy evaluation budgets, and generates samples 4,000× faster than the prior transferable Boltzmann generator (TBG).
Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra: This paper proposes ChefNMR, the first end-to-end framework based on 3D atomic diffusion models that directly predicts the molecular structure of unknown small molecules (especially complex natural products) from 1D NMR spectra and molecular formulae alone, achieving state-of-the-art performance on both synthetic and experimental datasets.
GraphFLA: Augmenting Biological Fitness Prediction Benchmarks with Landscape Features: GraphFLA is an efficient fitness landscape analysis framework that computes 20 biologically meaningful landscape features (ruggedness / epistasis / navigability / neutrality) across 5,300+ real-world landscapes (ProteinGym / RNAGym / CIS-BP), revealing that model performance is highly dependent on landscape topology—e.g., VenusREM outperforms ProSST on highly navigable landscapes but underperforms it on highly epistatic ones—while processing one million mutants in just 20 seconds (vs. 5 hours for MAGELLAN).
Autoencoding Random Forests: RFAE is the first principled encode-decode framework for random forests. It exploits the positive-definiteness and universality of the RF kernel to derive low-dimensional encodings via diffusion-map spectral decomposition, and decodes back to the original feature space through k-NN regression in leaf-node space. Across 20 tabular datasets, RFAE achieves an average reconstruction rank of 1.80, substantially outperforming TVAE (3.38) and AE (3.27), and is successfully applied to MNIST reconstruction and scRNA-seq batch-effect removal.
BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research: BarcodeMamba+ is an SSM-based foundation model for fungal ITS DNA barcode classification. By adopting a pretrain-then-finetune paradigm to leverage large-scale unlabeled sequences, and incorporating three enhancements—hierarchical label smoothing, inverse square-root weighted loss, and multi-head outputs—it substantially outperforms BLAST, CNN, and Transformer baselines across all taxonomic ranks on three test sets, achieving a top species-level accuracy of 88.9%.
Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX: This paper constructs ChemX — a suite of 10 multimodal chemical data extraction benchmark datasets manually annotated and validated by domain experts, spanning nanomaterials and small molecules. It systematically evaluates state-of-the-art agentic systems including ChatGPT Agent, SLM-Matrix, FutureHouse, and nanoMINER, as well as frontier LLMs such as GPT-5 and GPT-5 Thinking. The proposed single-agent method achieves F1=0.61 on the nanozyme dataset through structured document preprocessing (marker-pdf → Markdown → LLM extraction), surpassing all general-purpose multi-agent systems, while revealing systemic challenges in chemical information extraction such as SMILES parsing failures and terminology ambiguity.
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations: This paper introduces ChemCoTBench, the first CoT-based benchmark for evaluating chemical reasoning in LLMs. It decomposes complex chemical problems into modular chemical operations (adding/deleting/substituting functional groups), and is accompanied by ChemCoTDataset — a large-scale dataset of 22,000 expert-annotated CoT samples — enabling systematic evaluation of both reasoning and non-reasoning LLMs across molecular understanding, editing, optimization, and reaction prediction.
CrossNovo: Bidirectional Representations Augmented Autoregressive Biological Sequence Generation: CrossNovo integrates autoregressive (AR) and non-autoregressive (NAR) decoders through a shared spectrum encoder, importance annealing, and gradient-blocked knowledge distillation, enabling the bidirectional global understanding of NAR to augment AR sequence generation. On the 9-Species benchmark, it achieves amino acid accuracy of 0.811 (+2.6%) and peptide recall of 0.654 (+5.3%).
Compressing Biology: Evaluating the Stable Diffusion VAE for Phenotypic Drug Discovery: This work presents the first systematic evaluation of the Stable Diffusion VAE (SD-VAE) for reconstructing Cell Painting fluorescence microscopy images. Results show that SD-VAE preserves phenotypic information well at both the pixel level and the biological signal level (with negligible drop in Fraction Retrieved), and that the general-purpose feature extractor InceptionV3 matches or outperforms the domain-specific model OpenPhenom on retrieval tasks.
ConfRover: Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression: ConfRover proposes an autoregressive framework that factorizes protein MD trajectories into frame-wise conditional generation $p(\mathbf{x}^{1:L}) = \prod_l p(\mathbf{x}^l | \mathbf{x}^{<l})$, and through a modular architecture consisting of an encoder, a causal Transformer, and an SE(3) diffusion decoder, unifies three tasks—trajectory simulation, time-independent conformational sampling, and conformational interpolation—within a single model for the first time, achieving comprehensive improvements over MDGen on the ATLAS benchmark.
Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models: This paper identifies an inconsistency between sampling and simulation in diffusion models (particularly at small diffusion timesteps), proposes a Fokker-Planck-based regularization term to enforce consistency, and combines it with a time-partitioned Mixture-of-Experts (MoE) strategy to achieve consistent and efficient sampling and molecular dynamics simulation across multiple biomolecular systems.
Constrained Discrete Diffusion: This paper proposes CDD (Constrained Discrete Diffusion), which embeds a differentiable constrained optimization projection operator into the denoising process of discrete diffusion models. Without retraining, CDD enforces sequence-level constraints at sampling time, achieving zero constraint violations across three task categories: toxic text generation, molecular design, and instruction following.
Curly Flow Matching for Learning Non-gradient Field Dynamics: The authors propose Curly Flow Matching (Curly-FM), which designs a Schrödinger Bridge problem with a non-zero reference drift. This allows flow matching to learn non-gradient field dynamics, such as periodic and rotational behaviors, breaking the limitation of traditional methods that can only model gradient fields.
De novo generation of functional terpene synthases using TpsGPT: TpsGPT fine-tunes a distilled ProtGPT2 Tiny (38.9M parameters) on 79K terpene synthase (TPS) sequences to generate 28K candidate sequences, which are subsequently filtered through a multi-stage pipeline (perplexity / pLDDT / EnzymeExplorer / CLEAN / InterPro / Foldseek) to yield 7 de novo TPS sequences that are evolutionarily distant (<60% sequence identity) yet structurally conserved. Wet-lab experiments confirm that 2 of the 7 candidates possess TPS enzymatic activity—achieving functional enzyme de novo design at a GPU cost below $200.
DesignX: Human-Competitive Algorithm Designer for Black-Box Optimization: This paper proposes DesignX, the first automated algorithm design framework that jointly learns two sub-tasks—optimizer workflow generation and dynamic hyperparameter control—through dual Transformer agents pre-trained at scale on 10k synthetic problems. DesignX surpasses human-designed optimizers on both synthetic benchmarks and real-world tasks including protein docking, AutoML, and UAV path planning.
Diffusion Generative Modeling on Lie Group Representations: This paper proposes a novel theoretical framework for constructing diffusion processes on the representation space of Lie groups (rather than on the Lie groups themselves). By mapping the curved dynamics of non-Abelian Lie groups into Euclidean space via generalized score matching, the framework enables simulation-free training of Lie group diffusion models, and demonstrates that standard score matching is a special case corresponding to the translation group.
EDBench: Large-Scale Electron Density Data for Molecular Modeling: This work constructs EDBench, the largest electron density (ED) dataset to date (3.3 million molecules, computed via B3LYP/6-31G** DFT), and designs a three-category benchmark evaluation framework covering prediction, retrieval, and generation tasks. It provides the first systematic assessment of deep learning models' ability to understand and exploit electron density.
Energy Loss Functions for Physical Systems: This paper proposes a physics-based energy loss function framework. By deriving an energy-difference loss grounded in pairwise distances via reverse KL divergence and the Boltzmann distribution, the framework naturally satisfies SE(d) invariance and substantially outperforms MSE and cross-entropy losses on molecular generation and spin ground-state prediction tasks.
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling: This paper proposes Energy Matching, which unifies flow matching and energy-based models via a single time-independent scalar potential field: far from the data manifold, the model performs efficient transport along optimal transport paths; near the manifold, it transitions to a Boltzmann equilibrium distribution for likelihood modeling. The method achieves FID 3.34 on CIFAR-10, substantially outperforming existing EBMs by more than 50%.
Evaluating Multiple Models Using Labeled and Unlabeled Data: This paper proposes SSME (Semi-Supervised Model Evaluation), which leverages a small amount of labeled data and a large amount of unlabeled data to estimate the joint distribution $P(y, \mathbf{s})$ of multiple classifiers via a semi-supervised mixture model, enabling accurate classifier performance evaluation with errors reduced to 1/5 of those incurred when using labeled data alone.
FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models: This paper presents FGBench, a dataset comprising 625K molecular property reasoning questions focused on functional group-level reasoning evaluation. Through three dimensions (single functional group effect, multi-functional group interaction, and molecular comparison), it systematically reveals the severe deficiencies of current LLMs in fine-grained chemical reasoning.
Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning: This paper proposes Flow Density Control (FDC), which generalizes the fine-tuning of pretrained flow/diffusion models from KL-regularized expected reward maximization to a unified framework supporting arbitrary distributional utility functions with arbitrary divergence regularization. The approach decomposes nonlinear objectives into a sequence of linear fine-tuning subproblems and provides convergence guarantees.
Fractional Diffusion Bridge Models: This paper proposes Fractional Diffusion Bridge Models (FDBM), which incorporate fractional Brownian motion (fBM) into the generative diffusion bridge framework. The Hurst exponent $H$ controls the roughness and long-range dependence of trajectories, yielding improvements over Brownian motion baselines on protein conformation prediction and image translation tasks.
g-DPO: Scalable Preference Optimization for Protein Language Models: To address the quadratic growth of preference pairs with respect to sample size when applying DPO to protein language models (PLMs)—which renders training intractable—this paper proposes g-DPO: (1) redundant preference pairs are pruned via union-mask-based clustering in sequence space, retaining more informative comparisons within local neighborhoods; (2) grouped likelihood amortization via shared union masks enables computation of log-likelihoods for all sequences within a group in a single forward pass. Across three protein engineering tasks, g-DPO achieves statistically indistinguishable in silico and in vitro performance compared to standard DPO, while delivering 1.7–5.4× training speedups.
Generalizable Insights for Graph Transformers in Theory and Practice: This paper proposes the Generalized-Distance Transformer (GDT), a graph Transformer architecture based on standard attention (requiring no modifications to the attention mechanism). It theoretically proves that GDT's expressiveness is equivalent to the GD-WL algorithm, and through large-scale experiments covering 8 million graphs and 270 million tokens, establishes for the first time a fine-grained empirical hierarchy of positional encoding (PE) expressiveness. Under a few-shot transfer setting, GDT surpasses state-of-the-art methods without any fine-tuning.
Generative Distribution Embeddings: Lifting Autoencoders to the Space of Distributions for Multiscale Representation Learning: This paper proposes Generative Distribution Embeddings (GDE), which lifts autoencoders to the space of distributions — the encoder operates on sets of samples while the decoder is replaced by a conditional generative model — thereby learning distribution-level representations. The framework is validated on 6 computational biology tasks.
Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings: This work proposes LD-FPG, a framework that encodes full-atom MD trajectories into a low-dimensional latent space via Chebyshev graph neural networks and applies DDPM in that space to generate novel conformational ensembles. To the authors' knowledge, this is the first approach to generate protein conformations that includes all heavy atoms of the side chains.
GFlowNets for Learning Better Drug-Drug Interaction Representations: To address the severe class imbalance in drug-drug interaction (DDI) prediction, this paper proposes combining GFlowNet with a variational graph autoencoder (VGAE). By reward-guided generative sampling, the framework synthesizes training samples for rare interaction types, thereby enhancing predictive performance on infrequent yet clinically critical interaction categories.
Graph Diffusion that can Insert and Delete: This paper proposes GrIDDD, the first model to extend discrete denoising diffusion probabilistic models (DDPM) to support dynamic insertion and deletion of graph nodes during generation, allowing molecular graph size to adapt throughout the diffusion process. GrIDDD matches or surpasses existing methods on property targeting and molecular optimization tasks.
Inferring Stochastic Dynamics with Growth from Cross-Sectional Data: This paper proposes Unbalanced Probabilistic Flow Inference (UPFI), which jointly infers the drift, diffusion, and growth rate of stochastic dynamical systems from cross-sectional data via a Lagrangian formulation of the Fokker-Planck equation, constituting the first method to accurately handle scenarios involving cell proliferation and death.
Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry: This work constructs a multi-level interpretability toolkit for SynFlowNet (a GFlowNet grounded in synthetic reaction templates), integrating gradient saliency, counterfactual perturbation, sparse autoencoders (SAE), and motif probes to reveal how internal representations encode physicochemical properties and functional group information relevant to medicinal chemistry.
Is Sequence Information All You Need for Bayesian Optimization of Antibodies?: This paper systematically compares the roles of sequence and structural information in antibody Bayesian optimization, finding that sequence-only methods augmented with protein language model (pLM) soft constraints can match the performance of structure-based methods, thereby questioning the necessity of structural information in antibody Bayesian optimization.
Iterative Foundation Model Fine-Tuning on Multiple Rewards: This paper proposes IterativeRS (Iterative Rewarded Soups), which alternates between independent fine-tuning of per-objective expert policies and policy merging. The method unifies reward combination and expert merging approaches, outperforming MORLHF and Rewarded Soups on small molecule design, DNA sequence generation, and text summarization tasks.
JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles: This paper proposes JAMUN, a conformational ensemble generation method built on the Walk-Jump Sampling (WJS) framework. By performing Langevin dynamics on a noise-smoothed manifold and using an SE(3)-equivariant denoiser to jump back to the original distribution, JAMUN achieves peptide conformational sampling an order of magnitude faster than conventional molecular dynamics while retaining transferability to out-of-training systems.
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model: JanusDNA is proposed as the first bidirectional DNA foundation model, combining a Mamba-Attention-MoE hybrid architecture with the Janus Modeling pretraining paradigm to achieve bidirectional understanding at the training efficiency of autoregressive methods, attaining state-of-the-art performance across multiple genomic benchmarks.
KLASS: KL-Guided Fast Inference in Masked Diffusion Models: This paper proposes KLASS (KL-Adaptive Stability Sampling), a training-free sampling method that leverages token-level KL divergence and confidence scores to identify stable tokens for parallel decoding, achieving up to 2.78× speedup on masked diffusion models without sacrificing—and in many cases improving—generation quality.
Learning Conformational Ensembles of Proteins Based on Backbone Geometry: This paper proposes BBFlow, a flow matching generative model based on protein backbone geometry for conformational ensemble sampling. BBFlow requires neither evolutionary sequence information nor pretrained folding models, achieves inference speeds more than an order of magnitude faster than AlphaFlow, and generalizes to multi-chain proteins.
Learning Relative Gene Expression Trends from Pathology Images in Spatial Transcriptomics: This paper proposes STRank, a loss function that reformulates gene expression estimation from pathology images as a ranking score estimation task. By modeling the stochastic noise inherent in expression counts via binomial/multinomial distributions, STRank enables models to learn robust relative expression relationships from spatial transcriptomics data subject to batch effects and random fluctuations.
Learning Repetition-Invariant Representations for Polymer Informatics: This paper proposes GRIN (Graph Repetition-Invariant Network), which achieves invariance to the number of repeated monomer units in polymer representations via Max aggregation and a specialized graph construction strategy, addressing a fundamental symmetry problem in polymer representation learning.
Manipulating 3D Molecules in a Fixed-Dimensional E(3)-Equivariant Latent Space: This paper proposes MolFLAE, a 3D molecular variational autoencoder that learns a fixed-dimensional, E(3)-equivariant latent space. By introducing learnable virtual nodes and a Bayesian Flow Network (BFN) decoder, MolFLAE enables zero-shot molecular editing — including atom-count editing, structural reconstruction, and property interpolation — and demonstrates practical utility in drug optimization targeting the human glucocorticoid receptor (hGR).
MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs: MEIcoder is proposed to leverage neuron-specific Most Exciting Inputs (MEIs) as biological priors, combined with SSIM loss and adversarial training, to achieve state-of-the-art visual stimulus reconstruction from neural population activity in the primary visual cortex (V1), with particular strengths in small-dataset and low-neuron-count regimes.
Modeling Microenvironment Trajectories on Spatial Transcriptomics with NicheFlow: NicheFlow is a Flow Matching-based generative model that represents cellular microenvironments as point clouds and jointly models the temporal evolution of cell states and spatial coordinates via Variational Flow Matching and optimal transport, substantially outperforming single-cell-level trajectory inference methods on embryonic development, brain development, and aging datasets.
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Models: This paper proposes Mol-LLaMA, a large molecular language model for general molecular understanding. By designing three types of instruction data and a 2D-3D molecular representation fusion module, Mol-LLaMA surpasses GPT-4o in molecular feature understanding while exhibiting interpretability and reasoning capabilities.
Multimodal 3D Genome Pre-training: This paper proposes MIX-HIC — the first multimodal foundation model for 3D genomics — which integrates Hi-C contact maps and epigenomic signals via cross-modal interaction blocks and cross-modal mapping blocks. Pre-trained on over 1.27 million paired samples, MIX-HIC achieves state-of-the-art performance across three downstream tasks: Hi-C prediction, chromatin loop detection, and CAGE-seq expression prediction.
Multiscale Guidance of Protein Structure Prediction with Heterogeneous Cryo-EM Data: CryoBoltz leverages cryo-EM density maps to guide the sampling trajectory of a pretrained diffusion-based structure prediction model (Boltz-1) via a multiscale guidance mechanism (global → local), generating multi-conformational atomic models consistent with experimental data without any retraining.
Omni-Mol: Multitask Molecular Model for Any-to-Any Modalities: This paper proposes Omni-Mol, a unified molecular understanding and generation framework built upon a multimodal LLM. Through a 1.42M-sample instruction tuning dataset, Gradient Adaptive LoRA (GAL), and a Mixture-of-GAL-Experts (MoGE) architecture, Omni-Mol is the first single model to jointly learn 16 molecular tasks (Mol2Mol / Mol2Text / Mol2Num / Text2Mol), achieving SOTA on 13 tasks with only 2.2B parameters.
One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra: By employing MIST as a spectra-to-fingerprint encoder and MolForge as a fingerprint-to-structure decoder, combined with a prior-adjusted thresholding strategy, this work achieves a tenfold performance improvement on the MassSpecGym benchmark for de novo molecular structure generation from mass spectra (top-1 accuracy from 2.3% to 31%).
Pharmacophore-Guided Generative Design of Novel Drug-Like Molecules: This paper proposes a pharmacophore-guided molecular generation framework that simultaneously maximizes pharmacophore similarity and minimizes structural similarity within the reward function of a reinforcement learning model (FREED++), generating candidate drug molecules that retain bioactivity features while exhibiting high structural novelty.
Post Hoc Regression Refinement via Pairwise Rankings: This paper proposes RankRefine, a model-agnostic post-processing regression refinement method that fuses predictions from a base regressor with estimates derived from pairwise rankings via inverse-variance weighting. Without any retraining, the method achieves up to 10% relative MAE reduction in molecular property prediction using only 20 pairwise comparisons and a general-purpose LLM.
PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation: PRESCRIBE is a framework that jointly models epistemic uncertainty (model unfamiliarity with inputs) and aleatoric uncertainty (inherent randomness of biological systems) in single-cell perturbation prediction via multivariate deep evidential regression. It generates a pseudo E-distance as a unified uncertainty proxy; filtering unreliable predictions based on this metric yields accuracy improvements exceeding 3%.
Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number: This paper proposes PAFlow, a 3D molecule generation model built on the flow matching framework, which guides the vector field via a protein–ligand interaction predictor and determines atom counts through a learnable atom number predictor. PAFlow achieves a new state-of-the-art Avg. Vina Score of −8.31 on CrossDocked2020, substantially outperforming existing methods.
PROSPERO: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhood: This paper proposes ProSpero, an active learning framework that discovers high-fitness and novel protein sequences even under surrogate model mismatch, via inference-time sampling of a frozen pretrained generative model (EvoDiff) guided by a surrogate, a targeted masking strategy, and biologically-constrained SMC sampling.
Protein Design with Dynamic Protein Vocabulary: ProDVa introduces natural protein fragments as a "dynamic vocabulary" for generative protein design, employing a three-component architecture consisting of a text encoder, a protein language model, and a fragment encoder. Using less than 0.04% of the training data required by prior work, ProDVa designs functionally aligned and structurally foldable protein sequences, surpassing the SOTA model Pinal by 7.38% on the pLDDT>70 ratio.
Quantifying the Role of OpenFold Components in Protein Structure Prediction: This paper proposes a systematic methodology for evaluating the contribution of individual Evoformer components in OpenFold/AlphaFold2 to protein structure prediction accuracy. The study finds that MSA column attention and MLP Transition layers are the most critical components, and that the importance of multiple components is significantly correlated with protein sequence length.
Random Search Neural Networks for Efficient and Expressive Graph Learning: This paper proposes Random Search Neural Networks (RSNN), which replace random walks with randomized depth-first search (DFS) for graph structure sampling. On sparse graphs, RSNN achieves complete edge coverage with only $O(\log|V|)$ searches. Paired with a universal sequence model, RSNN attains universal approximation capability, and consistently outperforms RWNN on molecular and protein benchmarks using up to 16× fewer samples.
Remasking Discrete Diffusion Models with Inference-Time Scaling: This paper proposes the ReMDM sampler, which enables iterative error correction in discrete mask diffusion models by allowing already-decoded tokens to be remasked during generation. This mechanism supports inference-time compute scaling and yields substantial quality improvements on text, image, and molecular design tasks.
Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs: This work reformulates retrosynthesis planning as a worst-path optimisation problem in tree-structured MDPs — the value of a synthesis tree is determined by its weakest path, since any dead-end path renders the entire tree invalid. The proposed method, InterRetro, optimises this worst-path objective via weighted self-imitation learning, achieving 100% success rate on Retro*-190, reducing path length by 4.9%, and attaining 92% of full performance with only 10% of training data.
scMRDR: A Scalable and Flexible Framework for Unpaired Single-Cell Multi-Omics Data Integration: This paper proposes scMRDR, a framework based on β-VAE that disentangles latent representations of single-cell multi-omics data into modality-shared and modality-specific components, achieving scalable integration of unpaired multi-omics data through isometric regularization, adversarial training, and masked reconstruction loss.
scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery: This work proposes the scPilot framework and scBench benchmark, enabling LLMs to perform "omics-native reasoning" (ONR) directly on single-cell RNA-seq data—reading marker genes, forming hypotheses, invoking tools for verification, and iteratively refining conclusions—achieving an 11% improvement in cell-type annotation accuracy and a 30% reduction in trajectory inference graph-edit distance.
Self Iterative Label Refinement via Robust Unlabeled Learning: This paper proposes an iterative pipeline that leverages a robust unlabeled-unlabeled (UU) learning framework to refine LLM-generated pseudo-labels, surpassing the self-refinement approaches of GPT-4o and DeepSeek-R1 on both classification and generative safety alignment tasks with minimal human annotation.
SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding: SpecMER introduces speculative decoding into protein sequence generation, employing a K-mer-guided batch selection strategy to choose the candidate most consistent with evolutionary conservation from multiple draft model outputs for target model verification. It achieves 24–32% speedup while preserving distributional consistency, and the generated sequences demonstrate significantly improved NLL and pLDDT structural confidence scores compared to unguided baselines.
Split Gibbs Discrete Diffusion Posterior Sampling: This paper proposes SGDD (Split Gibbs Discrete Diffusion), a plug-and-play posterior sampling algorithm for discrete diffusion models based on the split Gibbs sampling principle. By introducing auxiliary variables and a Hamming-distance-based regularization potential, SGDD decomposes posterior sampling into alternating likelihood and prior sampling steps, achieving substantial improvements over baselines on DNA sequence design, discrete image inverse problems, and music infilling tasks.
Steering Generative Models with Experimental Data for Protein Fitness Optimization: This work systematically evaluates strategies for steering protein generative models (discrete diffusion models and language models) toward fitness optimization, finding that plug-and-play guidance methods using small labeled datasets (~200 samples)—particularly DAPS—outperform RL-based fine-tuning, and proposes a Thompson sampling strategy incorporating predictive uncertainty for adaptive optimization.
Towards Multiscale Graph-based Protein Learning with Geometric Secondary Structural Motifs: This paper proposes SSHG (Secondary Structure-based Hierarchical Graph), a framework that constructs two-level hierarchical graph representations from protein secondary structure motifs — an intra-motif residue-level graph and an inter-motif global graph — and employs a two-stage GNN to learn local and global features respectively. Theoretical guarantees of maximal expressiveness are provided, with empirical improvements in both accuracy and computational efficiency on enzyme classification and ligand affinity prediction tasks.
Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling: This paper proposes UAE-3D, a multimodal variational autoencoder that compresses atomic types, chemical bonds, and 3D coordinates of molecules into a unified, near-lossless latent space. By eliminating the complexity of handling multimodality and equivariance, a general-purpose Diffusion Transformer achieves state-of-the-art 3D molecular generation.
Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design: This paper proposes an uncertainty-aware multi-objective reinforcement learning framework that guides a 3D molecular diffusion model (EDM) to simultaneously optimize drug-likeness (QED), synthetic accessibility (SAS), and binding affinity. The framework dynamically shapes the reward function using predictive uncertainty from surrogate models, consistently outperforms baselines across three benchmark datasets, and validates candidate molecules through molecular dynamics simulations and ADMET analysis.
Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction: This paper proposes OligoICP, a method that leverages the interquartile range (IQR) of TabPFN's predicted distributions as an unlabeled model selection heuristic, achieving superior performance over both specialized SOTA models and naive ensembles on siRNA knockdown efficiency prediction.
Understanding and Enhancing Mask-Based Pretraining towards Universal Representations: This paper employs high-dimensional linear regression theory to precisely characterize the effect of masking ratio on test risk in mask-based pretraining via a bias-variance decomposition, revealing that the optimal masking ratio depends on both the downstream task and model size. Building on this theory, the paper proposes R2MAE (Random Ratio MAE), which consistently outperforms fixed masking ratios across vision, language, DNA, and single-cell modeling benchmarks.
Unified All-Atom Molecule Generation with Neural Fields: This paper proposes FuncBind, a framework that represents molecules as continuous atomic density functions via neural fields, constructing a unified conditional generative model capable of target-conditioned generation across three drug modalities: small molecules, macrocyclic peptides, and antibody CDR loops.
UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection: This work introduces UniSite-DS, the first UniProt (unique protein)-centric ligand binding site dataset, and UniSite, the first end-to-end binding site detection framework. UniSite directly predicts multiple potentially overlapping binding sites via set prediction loss and bijective matching, and further proposes IoU-based AP as a more accurate evaluation metric.
Variational Regularized Unbalanced Optimal Transport: Single Network, Least Action: This paper proposes Var-RUOT, which incorporates the necessary optimality conditions of the Regularized Unbalanced Optimal Transport (RUOT) problem into the parameterization and loss design, enabling the solution of RUOT by learning a single scalar field. The approach yields solutions with lower action and improves training stability, while also analyzing the effect of growth penalty functions on biological priors.
Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion: This paper reveals the fundamental reason for the superiority of masking diffusion models — they implicitly condition on the known jump-time distribution — and proposes the Schedule-Conditioned Diffusion (SCUD) framework, which generalizes this advantage to arbitrary discrete diffusion models. Combined with structured forward processes, SCUD surpasses masking diffusion on both image and protein generation tasks.