🧬 Computational Biology¶
🔬 ICLR2026 · 24 paper notes
📌 Same area in other venues: 🧪 ICML2026 (8) · 💬 ACL2026 (2) · 📷 CVPR2026 (5) · 🤖 AAAI2026 (15) · 🧠 NeurIPS2025 (44) · 📹 ICCV2025 (3)
🔥 Top topics: Biomolecules ×9 · Diffusion Models ×6 · LLM ×2
- AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design
-
This work constructs AFD-Instruction, the first large-scale antibody functional annotation instruction dataset (430K+ entries), aligning antibody sequences with natural-language functional descriptions via a multi-agent literature extraction pipeline. The dataset is used to instruction-tune general-purpose LLMs for antibody understanding and function-guided design, achieving an average accuracy improvement of 20+ points across five classification tasks.
- AntigenLM: Structure-Aware DNA Language Modeling for Influenza
-
AntigenLM is a GPT-2-style DNA language model that preserves the integrity of genomic functional units. Pretrained on complete influenza virus whole genomes and subsequently fine-tuned, it autoregressively predicts antigenic sequences of future circulating strains, achieving significantly lower amino acid mismatch rates than the evolutionary model beth-1 and general-purpose genomic models.
- ConfHit: Conformal Generative Design with Oracle Free Guarantees
-
This paper proposes ConfHit, a framework that employs density-ratio-weighted conformal permutation p-values to perform certification (determining whether a generated batch contains a hit) and design (pruning the candidate set while preserving statistical guarantees). Without requiring an experimental oracle and under distributional shift, ConfHit provides finite-sample \(1-\alpha\) coverage guarantees for generative molecular design.
- Controlling Repetition in Protein Language Models
-
This work presents the first systematic study of pathological repetition in protein language models (PLMs), introducing a unified repetition metric \(R(x)\) and a utility metric \(U(x)\), and proposes UCCS (Utility-Controlled Contrastive Steering), a method that injects steering vectors decoupled from repetition into hidden layers at inference time to suppress repetition while preserving folding credibility without retraining the model.
- CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints
-
CryoNet.Refine is proposed as the first AI-based framework for cryo-EM atomic model refinement. It integrates a one-step diffusion model initialized from Boltz-2 weights, a novel differentiable density generator that physically simulates synthetic density maps, and the first use of density map correlation (cosine similarity) as a differentiable loss function, jointly optimized with geometric constraint losses including Ramachandran, rotamer, and bond angle terms. A test-time optimization strategy enables per-case customization. The method comprehensively outperforms Phenix.real_space_refine on 120 protein and DNA/RNA complex benchmarks (CC_mask: 0.59 vs. 0.54; Ramachandran favored: 98.92%).
- Discrete Diffusion Trajectory Alignment via Stepwise Decomposition
-
This paper proposes SDPO (Stepwise Decomposition Preference Optimization), which decomposes the trajectory alignment problem of discrete diffusion models into stepwise posterior alignment subproblems, avoiding the difficulty of backpropagating gradients through the entire denoising chain. SDPO achieves significant improvements over existing methods across three tasks: DNA sequence design, protein inverse folding, and language modeling.
- DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials
-
DistMLIP is a distributed inference platform based on a zero-redundancy graph-level parallelization strategy that addresses the lack of multi-GPU support in existing machine learning interatomic potentials (MLIPs). On 8 GPUs, it enables simulations approaching one million atoms, achieving up to 8× speedup over spatial partitioning methods while supporting systems 3.4× larger.
- DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models
-
DriftLite exploits the inherent degrees of freedom between the drift and potential function in the Fokker-Planck equation to actively stabilize particle weights by solving a lightweight linear system for the optimal control drift at each step. This approach addresses weight degeneracy in Sequential Monte Carlo (SMC) at minimal computational cost, substantially outperforming Guidance-SMC baselines on Gaussian mixture, molecular system, and protein–ligand co-folding tasks.
- EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering
-
EvoFlows proposes an edit-based flow matching approach that learns mutational trajectories between evolutionarily related protein sequences, enabling controllable numbers of edits (insertions, deletions, substitutions) on a template sequence while jointly predicting what to mutate and where to mutate.
- Fine-Tuning Diffusion Models via Intermediate Distribution Shaping
-
This work unifies rejection-sampling-based fine-tuning methods under the GRAFT framework, proving that they implicitly perform KL-regularized reward maximization. Building on this, P-GRAFT is proposed to perform distribution shaping at intermediate denoising steps (achieving a better bias–variance trade-off), and Inverse Noise Correction is introduced to improve flow model quality without reward signals, yielding an 8.81% VQAScore improvement on text-to-image generation.
- Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology
-
This paper proposes the Stamp framework, which leverages spatial transcriptomics gene expression data as a supervisory signal. Through spatially-aware gene encoder pretraining and hierarchical multi-scale contrastive alignment, it enables joint representation learning of pathology images and spatial transcriptomics data, achieving state-of-the-art performance across 4 downstream tasks on 6 datasets.
- HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction
-
This paper proposes HistoPrism, an efficient Transformer architecture that injects cancer-type conditioning via cross-attention to predict pan-cancer gene expression from H&E histology images. It further introduces the Gene Pathway Coherence (GPC) evaluation framework based on Hallmark/GO pathways, achieving substantial improvements over STPath at the pathway level—particularly on low-variance, biologically fundamental pathways.
- How to Make the Most of Your Masked Language Model for Protein Engineering
-
This work proposes a temperature-annealed stochastic beam search (SBS) sampling method for masked language models (MLMs), leveraging a wild-type marginal approximation of pseudo-log-likelihood (PLL) for efficient full-sequence evaluation. In vitro experiments on real therapeutic antibody optimization demonstrate that the choice of sampling algorithm is at least as important as model selection; SBS with guidance achieves a 100% success rate.
- Intrinsic Lorentz Neural Network
-
This paper proposes ILNN, a fully intrinsic hyperbolic neural network in which all operations are performed entirely within the Lorentz model, eliminating the geometric inconsistencies introduced by Euclidean operations in existing methods. ILNN achieves state-of-the-art performance on image classification, genomics, and graph classification tasks.
- mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules
-
This paper proposes mCLM (Modular Chemical Language Model), which represents molecules as sequences of synthesizable building blocks, enabling LLMs to generate molecules that simultaneously satisfy pharmacological function and automated synthesis feasibility. mCLM achieves significant improvements in pharmacokinetic and toxicity properties across 430 FDA-approved drugs.
- Protein as a Second Language for LLMs
-
This work treats amino acid sequences as a "second language" for LLMs. By constructing a protein–natural language bilingual dataset and an adaptive context construction mechanism, the proposed framework enables general-purpose LLMs to achieve an average ROUGE-L improvement of 7%—up to 17.2%—on protein question-answering tasks without any training, even surpassing domain-specific fine-tuned models.
- Protein Counterfactuals via Diffusion-Guided Latent Optimization
-
This paper proposes MCCOP, a framework that performs gradient-guided counterfactual optimization in a continuous joint sequence–structure latent space, using a pretrained diffusion model as a manifold prior. With as few as 2–3 mutations, MCCOP generates biologically plausible protein variants that flip predictor outputs, simultaneously enabling model interpretation and protein design hypothesis generation.
- Protein Structure Tokenization via Geometric Byte Pair Encoding
-
This paper proposes GeoBPE — the first tokenizer to extend Byte Pair Encoding (BPE) from discrete text to continuous protein backbone geometry. By alternating between local merging (k-medoids clustering + quantization) and global correction (differentiable inverse kinematics), GeoBPE constructs a hierarchical structural motif vocabulary that achieves >10× compression ratio and >10× data efficiency over VQ-VAE-based protein structure tokenizers (PSTs), ranking first across 24 test sets spanning 12 downstream tasks.
- Reverse Distillation: Consistently Scaling Protein Language Model Representations
-
To address the anomalous scaling phenomenon in protein language models (PLMs) where larger models do not necessarily yield better performance, this paper proposes a reverse distillation framework. It uses the representations of a smaller model as a base, extracts orthogonal residual information from a larger model via SVD, and constructs Matryoshka nested embeddings—ensuring that larger reverse-distilled models consistently outperform smaller ones. ESM-2 15B, after reverse distillation, becomes for the first time the strongest model in its family.
- Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics
-
This paper proposes STAR-MD, an SE(3)-equivariant causal diffusion Transformer that achieves microsecond-scale protein dynamics trajectory generation via joint spatio-temporal attention and contextual noise perturbation. STAR-MD attains state-of-the-art performance across all metrics on the ATLAS benchmark and stably extrapolates to microsecond timescales unseen during training.
- SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling
-
SynCoGen proposes a multimodal generative framework combining masked graph diffusion and flow matching to jointly sample molecular building-block reaction graphs and 3D atomic coordinates, achieving high-quality 3D molecule generation while guaranteeing synthetic feasibility.
- Thompson Sampling via Fine-Tuning of LLMs
-
This paper proposes ToSFiT, which extends Thompson Sampling to large-scale unstructured discrete spaces by fine-tuning large language models to directly parameterize the Probability of Maximality (PoM), thereby circumventing the intractability of acquisition function maximization.
- Tracing Pharmacological Knowledge in Large Language Models
-
The first systematic causal analysis of the encoding mechanisms for drug-group semantics in biomedical LLMs, revealing that drug-group knowledge is stored in early layers, distributed across multiple tokens (not the last token alone), and that linearly separable semantic information is already present at the embedding layer.
- Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge
-
PVB (Pretrained Variational Bridge) unifies the training objectives of single-structure pretraining and paired-trajectory fine-tuning via an encoder-decoder architecture combined with augmented bridge matching, enabling cross-domain biomolecular trajectory generation. It further accelerates protein–ligand holo-state exploration through RL fine-tuning based on adjoint matching.