Skip to content

🧬 Computational Biology

🤖 AAAI2026 · 15 paper notes

📌 Same area in other venues: 🧪 ICML2026 (8) · 💬 ACL2026 (2) · 📷 CVPR2026 (5) · 🔬 ICLR2026 (24) · 🧠 NeurIPS2025 (44) · 📹 ICCV2025 (3)

🔥 Top topics: Biomolecules ×4 · Diffusion Models ×2 · Multimodal/VLM ×2 · Alignment/RLHF ×2

Apo2Mol: 3D Molecule Generation via Dynamic Pocket-Aware Diffusion Models

This paper proposes Apo2Mol, a diffusion-based all-atom framework that simultaneously generates 3D ligand molecules and corresponding holo (bound-state) pocket conformations from protein apo (unbound) conformations. Trained on 24K experimentally resolved apo-holo structure pairs, it achieves state-of-the-art performance in binding affinity (Vina min −7.86) and drug-likeness.

Constrained Best Arm Identification with Tests for Feasibility

This paper proposes a new framework for best arm identification (BAI) with feasibility constraints, allowing the decision-maker to test arm performance and feasibility constraints separately. An asymptotically optimal algorithm is designed that adaptively eliminates suboptimal arms via whichever criterion—performance or feasibility—is easier to satisfy.

ConSurv: Multimodal Continual Learning for Survival Analysis

This paper proposes ConSurv, the first multimodal continual learning framework for survival analysis. Through two core components — Multi-Stage Mixture-of-Experts (MS-MoE) and Feature-Constrained Replay (FCR) — ConSurv effectively mitigates catastrophic forgetting in settings that integrate whole slide pathology images and genomic data, comprehensively outperforming existing methods on the newly constructed MSAIL benchmark.

Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes

This paper proposes GODD (Geometric OOD Diffusion Model), which captures distributional structural priors via an equivariant asymmetric autoencoder to guide the generation process of a diffusion model, enabling models trained on data-rich molecular distributions to generalize to data-scarce distributions, achieving a 12.6% improvement in success rate on OOD structural shift benchmarks.

Dual-Path Knowledge-Augmented Contrastive Alignment Network for Spatially Resolved Transcriptomics

This paper proposes DKAN, a Dual-path Knowledge-Augmented contrastive Alignment Network that integrates semantic information from external gene databases as a cross-modal coordinator. Combined with a unified one-stage contrastive learning paradigm and an adaptive weighting mechanism, DKAN predicts spatially resolved gene expression from H&E-stained whole slide images (WSI), achieving state-of-the-art performance across three public ST datasets.

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

Three complementary chromosome-level genomic parallelization scheduling schemes are proposed — static scheduling (optimizing processing order), dynamic scheduling (knapsack-based batching with online RAM prediction), and a symbolic regression RAM predictor — achieving significant reductions in out-of-memory errors and execution time in both simulated and real precision medicine pipelines.

GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance

This paper proposes GP-MoLFormer-Sim, a training-free test-time molecular generation guidance method that leverages the contextual embeddings of a chemical language model (GP-MoLFormer) to estimate similarity to target molecules, dynamically adjusting logits during autoregressive decoding. Combined with a genetic algorithm (GP-MoLFormer-Sim+GA), the method achieves an average rank of 2nd across 23 tasks on the PMO benchmark and outperforms MOLLEO—which relies on GPT-4—under a strict black-box oracle setting.

HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

This paper proposes HiFusion, a framework comprising two complementary modules — Hierarchical Intra-Spot Modeling (HISM) and Context-Aware Cross-Scale Fusion (CCF) — to accurately predict spatial gene expression from H&E-stained whole-slide images, achieving state-of-the-art performance on two benchmark datasets under both 2D cross-validation and 3D sample-specific evaluation settings.

Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

This paper proposes a post-hoc data pruning framework based on influence functions, leveraging Subset-Based Self-Influence estimation and two selection strategies (Top-k Influence and Coverage-Centric Influence). Under an extreme pruning rate exceeding 99%, an RNA-FM pretrained on only 0.2M sequences matches or surpasses the full model trained on 23M sequences across multiple downstream tasks, revealing substantial redundancy in biological sequence datasets.

Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling

This paper proposes the CHMR framework, which addresses missing biological modalities via structure-aware propagation, and introduces Tree-VQ to model hierarchical dependencies among molecules, cells, and genes. Evaluated on 728 tasks across 9 benchmarks, CHMR achieves a 3.6% improvement in classification and 17.2% in regression, enabling robust cell-aware molecular representation learning.

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

MergeDNA is proposed to achieve context-aware dynamic DNA tokenization via differentiable Token Merging, combined with a hierarchical autoencoder and adaptive masked token modeling for pretraining. With 380M parameters, it surpasses GENERator at 1.3B.

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

This paper proposes ProtSAE, which incorporates semantic annotations and domain ontology knowledge as guidance signals during sparse autoencoder training to address the semantic entanglement problem of conventional SAEs. The method aligns latent features of protein language models with biological concepts (molecular function, biological process, ion binding sites, etc.) with high precision, while maintaining high reconstruction fidelity and supporting concept-level generation steering.

S2Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening

This paper proposes S2Drug, a two-stage contrastive learning framework. Stage 1 performs large-scale protein sequence–ligand contrastive pre-training on ChemBL with a bilateral data sampling strategy to reduce noise and redundancy. Stage 2 fine-tunes on PDBBind by fusing sequence and 3D structural information via a residue-level gating module and incorporating a binding site prediction auxiliary task. S2Drug substantially outperforms existing methods on the DUD-E and LIT-PCBA virtual screening benchmarks.

SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

SpaCRD is proposed as a transfer learning-based multimodal deep fusion framework that integrates histology images and spatial transcriptomics (ST) data through a Variational Reconstruction-guided Bidirectional Cross-Attention (VRBCA) fusion network. It achieves state-of-the-art performance in cancer tissue region (CTR) detection across samples, platforms, and batches on 23 paired datasets.

TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

TrinityDNA is a bio-inspired DNA foundation model integrating three innovations: a Groove Fusion module for capturing major/minor groove structural features, a Gated Reverse Complement mechanism for handling double-strand complementary symmetry, and Sliding Multi-Window Attention for multi-scale long-range dependency modeling. Combined with an Evolutionary Training Strategy (ETS) progressing from prokaryotes to eukaryotes, TrinityDNA achieves an average MCC of 0.708 across 15 GUE benchmark tasks (surpassing NT with 2.5B parameters), leads on both prokaryotic and eukaryotic zero-shot tasks across 19 benchmarks, and introduces a new CDS annotation benchmark for long-sequence inference evaluation.