ICML2026 Computational Biology AI paper notes paper summaries Biomolecules Diffusion Models Layout & Composition Multimodal/VLM Self-Supervised Learning Agents

🧬 Computational Biology¶

🧪 ICML2026 · 52 paper notes

📌 Same area in other venues: 📷 CVPR2026 (21) · 🔬 ICLR2026 (156) · 💬 ACL2026 (5) · 🤖 AAAI2026 (20) · 🧠 NeurIPS2025 (76) · 📹 ICCV2025 (4)

🔥 Top topics: Biomolecules ×22 · Diffusion Models ×7 · Layout & Composition ×3 · Multimodal/VLM ×3 · Self-Supervised Learning ×3

Active Timepoint Selection for Learning Measure-Valued Trajectories: This paper investigates "when a distribution snapshot is most valuable to sample." It uses Linearized Optimal Transport (LOT) to linearize measure trajectories in Wasserstein space and employs a multi-output Gaussian Process (GP) with time warping to provide epistemic uncertainty, enabling the active selection of timepoints that best reduce trajectory reconstruction error.
Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance: This paper proposes utilizing frozen pretrained molecular models (GeoDiff, MoLFormer) to calculate the distance between embeddings (PED) as a measure of molecular similarity without any specialized similarity training. This approach serves both for candidate ranking in virtual screening and as a reward signal for molecular generation; it correlates strongly with industrial-standard 3D similarity (ROCS/ROSHAMBO2), outperforms traditional metrics in EF1% on the LIT-PCBA benchmark, and accelerates generation sampling by up to 3.3×.
CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation: CARD utilizes "radix \(r\) decomposition" to bijectively map molecular 3D coordinates into coarse-to-fine sequences of discrete-continuous mixed tokens. This enables a cross-system general autoregressive Transformer to act as a "zero-free-energy proposal" for directly estimating the absolute free energy of arbitrary molecular systems via BAR. It achieves the accuracy of classical MFES on 70 new solvation systems while being approximately 40x faster during inference.
Circuit Tracing in Autoregressive Protein Language Models: ProGenMech introduces "Cross-Layer Transcoders (CLT)" to the autoregressive protein language model ProGen3. Using a zero-shot circuit discovery algorithm, it identifies sparse latent circuits (less than 2%) that faithfully replicate generative probability distributions and zero-shot fitness scores while mapping to biologically conserved motifs such as the HRD/DFG motifs in kinases.
CoSiNE: Conditional Site-Independent Neural Evolution Model for Antibody Sequences: CoSiNE models the antibody affinity maturation process using a neural-parameterized conditional site-independent Continuous-Time Markov Chain (CTMC). It captures inter-site epistatic effects while maintaining tractability and enables antigen-specific antibody optimization via Guided Gillespie sampling, outperforming existing language and evolutionary models in zero-shot variant effect prediction.
Constrained Flow Optimization via Sequential Fine-Tuning for Molecular Design: Addressing the scenario of "maximizing rewards (e.g., binding affinity, dipole moment) under hard domain constraints (e.g., synthetic accessibility, energy upper bounds)," this paper proposes the CFO algorithm. CFO decomposes constrained generative optimization into a sequence of standard KL-regularized fine-tuning subproblems using the Augmented Lagrangian method. By adaptively updating penalty factors \(\rho_k\) and dual variables \(\lambda_k\), CFO achieves provable convergence and significant Pareto improvements in reward-constraint trade-offs across low-dimensional toy tasks and FlowMol molecular design.
CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data: To address the issue that biological sequence counts (scRNA-seq, ATAC-seq, etc., which are inherently natural numbers) are unsuitable for either continuous or categorical diffusion, this paper proposes CountsDiff—a diffusion framework operating directly on the set of natural numbers \(\mathbb{N}_0\). It reparameterizes Blackout diffusion using "survival probability scheduling \(p(t)\) + explicit loss weighting" and integrates modern diffusion tools including continuous-time training, classifier-free guidance, churn/remasking (attrition) non-monotonic reverse trajectories, and stochastic rounding. Even with a minimal implementation, it matches or exceeds SOTA discrete generative models and specialized imputation methods on CIFAR-10/CelebA images and scRNA-seq imputation.
Cross-Chirality Generalization by Axial Vectors for Hetero-Chiral Protein-Peptide Interaction Design: This paper proposes AFI (Axial Feature Injection), which injects axial vector features into the polar vector channels of \(E(3)\)-equivariant scalarized models via linear mixing to reduce them to \(SE(3)\)-equivariance and enable chirality sensitivity. By applying this to UniMoMo, the authors developed PepMirror, which generates hetero-chiral (D-L) peptide binders in a zero-shot manner using only homo-chiral (L-L) training data. Wet-lab experiments on the CD38 target validated it as the first experimentally confirmed AI de novo D-peptide design framework.
Demystifying Multimodal Biomolecular Co-design with Intrinsic Geodesic Coupling: The authors re-model the co-generation of heterogeneous modalities ("sequence + 3D structure") as a Temporal Optimal Transport (TOT) problem. By using bi-level optimization with a Gaussian Process surrogate (GeoCoupling), the model automatically learns non-diagonal temporal coupling curves during training (i.e., allowing structure and sequence to denoise at their respective optimal paces). This approach outperforms "synchronous coupling" and "random coupling" baselines in both SBDD and unconditional protein co-design tasks, revealing a universal "structure-leading" generation principle where geometry precedes semantics.
Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference: To address the modeling challenges of "multi-disease, long-tail/rare diseases, and shared risk factors" in Electronic Health Records (EHR), the authors reformulate multi-disease risk as "risk-factor-modulated latent disease pathways." They employ a latent hypergraph (where hyperedges represent subsets of diseases sharing risk factors) to express high-order structures, coupled with a repulsive prior to ensure sparse and identifiable pathways. A logic-preserving structured variational inference framework is used for scalable posterior estimation with calibrated uncertainty.
DNAChunker: Learnable Tokenization for DNA Language Models: DNAChunker embeds an end-to-end learnable "dynamic chunker" into masked DNA language models. It compresses base-pair sequences into variable-length chunks via bidirectional Mamba encoding and cosine similarity boundary prediction. Combined with mask protection and residual gating to prevent information leakage, it outperforms 2.5B-scale multi-species pre-trained baselines on five genomic benchmarks using only 172M parameters and the human reference genome.
EvoEGF-Mol: Evolving Exponential Geodesic Flow for Structure-based Drug Design: EvoEGF-Mol maps the continuous coordinates and discrete atom/bond types of SBDD into the same natural parameter space of the exponential family. By replacing singular Dirac endpoints with dynamically tightening target distributions and evolving them synchronously along exponential geodesics under the Fisher-Rao geometry, it pushes the PoseBusters pass rate on CrossDock to 93.4%, approaching the level of reference molecules.
Flexible Kernels for Protein Property Prediction: This paper designs a family of "flexible kernels" (LOCK / CLOCK) for protein sequences, directly encoding biophysical priors from evolutionary substitution matrices (BLOSUM) and the local linearity assumption of "property additivity under mutation" into Gaussian Process kernels. These kernels frequently outperform complex methods relying on large-scale model embeddings in data-scarce protein property prediction tasks and can zero-shot absorb information from structural foundation models for multi-task learning.
Flow Sampling: Learning to Sample from Unnormalized Densities via Denoising Conditional Processes: This paper proposes Flow Sampling, which inverts the flow matching/diffusion model paradigm from "data-driven" to "noise-driven"—constructing a denoising diffusion drift conditioned on source noise samples. By using a detached model to sample \(X_1\) on the interpolant and utilizing the energy gradient of \(X_1\) as the regression target, it learns an efficient diffusion sampler under data-free conditions and naturally extends to constant-curvature Riemannian manifolds.
From Feasible to Practical: Pareto-Optimal Synthesis Planning: PareSP utilizes multi-objective MCTS search to jointly optimize synthesis pathway cost / time / feasibility / environmental impact—identifying the complete Pareto front rather than a single "optimal" path. On USPTO and ASKCOS benchmarks, it achieves a 23% reduction in cost and a 35% reduction in time compared to single-objective methods, while maintaining \(\ge 95\%\) chemical feasibility.
From Holo Pockets to Electron Density: GPT-style Drug Design with Density: This paper replaces the condition for structure-based drug design from a "rigid empty pocket" to a "low-resolution electron cloud of the filler (containing ligand and solvent)." It proposes EDMolGPT, the first decoder-only autoregressive model in this domain, which achieves a bioactive recovery of 41% across 101 DUD-E targets, significantly outperforming previous electron density (ED)-based methods.
Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients: GReinSS employs a reward dynamically rescaled by parameters, \(r(\tau)=\sum_i \Pr(X_i\mid\tau)/\Pr(X_i\mid\theta)\), to transform policy gradients into an unbiased gradient ascent of the "observed data log-likelihood." This enables generative modeling and inference across combinatorially exploding discrete latent spaces. It consistently outperforms GFlowNets, naive policy gradients, and VAE/Diffusion/Autoregressive GEM baselines on simulated graph/set reconstruction, and surpasses standard RSEM on isoform reconstruction for real short-read RNA sequencing.
Hyperbolic Neural Population Geometry Benefits Computation: This paper establishes a theoretical framework for the experimental phenomenon where "hippocampal population activity exhibits a hyperbolic structure." It proves that place cells with exponentially distributed receptive field widths statistically induce tree-like/hyperbolic stimulus geometry. It further reveals that the update rules of Modern Hopfield Networks essentially compute the MMSE optimal decoder. Based on this, the authors propose an associative memory model defined in hyperbolic space (Karcher-flow model), with capacity growing exponentially with dimension and double-exponentially with the maximum norm, significantly exceeding existing models.
iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis: iLoRA employs a Bayesian approach to infer a sparse microbial interaction graph from each microbiome sample (Poisson edges \(\rightarrow\) Laplace sparsification \(\rightarrow\) GNN embedding). This graph is then used to generate an input-conditioned LoRA matrix \(A\), enabling the LLM to learn which bacteria are "cross-talking" while simultaneously performing IBD diagnosis.
Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback: IGSR decomposes symbolic regression into a two-step "LLM proposes basis functions \(\psi_j\) + pruning via granular influence scores \(\Delta_j\)" cycle. This cycle is embedded into Monte Carlo Tree Search (MCTS) to explore the combinatorial space. It achieves state-of-the-art MSE and symbolic recall across six biomedical benchmarks and LLM-SRBench, while discovering a novel relationship between DNA methylation and RNA Pol II pausing validated via wet-lab experiments.
Insertion Based Sequence Generation with Learnable Order Dynamics: This paper proposes LoFlexMDM—a generation model that replaces the fixed generation order of the two-step insertion-based masked diffusion model ("mask insertion + unmasking") with learnable, sample-dependent order dynamics. By generalizing discrete flow matching to variable-length sequences, parameterizing learnable insertion/unmasking times with Kumaraswamy CDFs, and jointly training the generator and target order network via REINFORCE, the model learns near-optimal generation orders on molecule and graph tasks. De novo molecule quality is improved by up to 17.5 percentage points compared to FlexMDM.
Learning Protein Structure-Function Relationships through Knowledge-guided Representation Decomposition: ProtDiS decomposes pre-trained protein micro-environment embeddings (such as ESM-3) into 8 biophysically interpretable "knowledge channels" and 1 residual channel through information bottleneck and redundancy elimination. This leads to consistent improvements in structural representation across twelve downstream tasks, particularly in scenarios where proteins share similar structures but possess different functions.
Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach: L3-PPI transforms the biological "L3 rule" (where more length-3 paths between protein pairs indicate a higher likelihood of interaction) into a learnable graph prompt. It utilizes a pre-trained GNN to recognize L3 patterns and a gated network to generate virtual L3 paths, regularizing the path count based on PPI labels. This serves as a plug-and-play classification head that improves the performance of various PPI representation models by 2-4 percentage points on average.
Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining: C-FREE decomposes molecules into \(k\)-EgoNet subgraphs with fixed radii. It encodes 2D topology and multiple 3D conformations using GINE, PaiNN, and Transformer architectures, followed by pretraining via JEPA-style latent space prediction. Without negative samples, data augmentation, or positional encodings, it outperforms multimodal baselines like UniMol and MolFM (trained on 19M–77M molecules) on 8 MoleculeNet tasks using only 0.33M molecules (GEOM).
LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation: The universal uniform/mask noise prior is replaced with a family-specific Dirichlet prior obtained via Ancestral Sequence Reconstruction (ASR). This allows Dirichlet flow matching to perform structured mutations starting from an "evolved scaffold," followed by a mutate–select–amplify rerouting at an intermediate timestep. Across 8,886 Pfam families, this approach pushes family recognition accuracy close to natural sequences (95.3% vs. 96.6%) while maintaining high novelty and folding confidence.
Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models: Starting from the hidden states of a pre-trained masked diffusion model (MDM), this paper trains a lightweight "mutual information predictor head" to output the full conditional mutual information matrix between all token pairs in a single forward pass. By selecting "conditionally independent" token subsets for parallel decoding based on this matrix, it reduces inference NFE by 3-5x on Sudoku and proteins (ESM-C) while maintaining or even exceeding sequential decoding quality.
On the Collapse of Generative Paths: A Criterion and Correction for Diffusion Steering: This paper identifies Marginal Path Collapse (MPC)—where intermediate composite densities become non-integrable—as a silent failure in inference-time guidance that combines multiple heterogeneous diffusion/flow models via a ratio-of-densities. It proposes a necessary and sufficient Path Existence Criterion (PEC) \(C(t)>0\) to diagnose collapse and introduces ACE, which dynamically corrects paths by applying bump functions to exponents \(\gamma_i(t)\). By extending Feynman–Kac correctors to time-varying exponents, ACE significantly outperforms constant-exponent baselines like NR and FKC in synthetic Checkerboard, flexible pose scaffold decoration, and COCO-MIG multi-attribute generation tasks.
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction: This paper proposes GILC (Gradient-Informed Logit Correction), which treats a pre-trained denoising network as a variational proxy for the value function. By employing a mechanism that "bypasses the model Jacobian and performs gradient correction directly on clean prediction logits," it achieves controllable generation for discrete diffusion without any re-training. It outperforms training-free baselines and matches or exceeds fine-tuning methods across DNA, protein, and molecular science tasks.
Protein Autoregressive Modeling via Multiscale Structure Generation: PAR adapts the "next-scale prediction" concept from the visual autoregressive (VAR) domain to protein \(C\alpha\) backbone generation. By using multiscale downsampling, an autoregressive Transformer, and a flow-based decoder instead of single-scale diffusion models—combined with noisy context learning and scheduled sampling to mitigate exposure bias—it achieves an unconditional FPSD of 161.0 while unlocking zero-shot point-prompt generation and motif scaffolding with a 2.5× sampling speedup.
Protein Circuit Tracing via Cross-layer Transcoders: The authors adapt cross-layer transcoders from NLP to the protein language model (pLM) ESM2, proposing the ProtoMech framework. This framework recovers 79% of downstream performance using sparse latent circuits composed of < 1% of total latents and enables designing high-fitness protein variants by steering along the discovered circuits, outperforming baselines in over 70% of cases.
Protein Fold Classification at Scale: Benchmarking and Pretraining: The authors constructed TEDBench, an unprecedentedly large (approx. 490k entries, 965 classes) and non-redundant protein fold classification benchmark based on AlphaFold structures clustered via TED + Foldseek. They further proposed MiAE, an SE(3)-invariant Masked Autoencoder. Utilizing an extreme masking rate of 90% and an asymmetric architecture with a heavy encoder and light decoder, MiAE outperforms significantly larger models like ESM2-650M and SaProt-650M in linear probing and fine-tuning with only 100M parameters.
Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators: This paper integrates residue embeddings from pre-trained protein language models (pLM) directly into Transferable Implicit Transfer Operators (TITO). The resulting PLaTITO, trained on mdCATH using only 56 ms of trajectories and 1100 GPU hours, allows a coarse-grained \(C_\alpha\) model with as few as 19M parameters to outperform BioEmu in equilibrium sampling of outlier systems such as fast-folding proteins.
Rethinking Genomic Modeling Through Optical Character Recognition: OpticalDNA renders 1D DNA sequences into multi-page "document images", which are then "read" by an OCR-style vision-language model. By compressing nucleotide content into a few reconstructible visual tokens, it outperforms genomic foundation models that are \(985\times\) larger on long-sequence tasks of up to 450k bases, using approximately \(20\times\) fewer effective tokens and only 256K trainable parameters.
RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking: Single-step retrosynthesis is decomposed into two independent modules: "proposal" and "selection." A single ChemAlign Transformer, optimized through enhanced training, generates candidate precursors. Subsequently, LambdaMART performs Learning to Rank (LTR) on the merged and deduplicated candidate pool. On the USPTO-50K dataset, the single-model top-1 accuracy reaches 55.00%, increasing to 59.4% after reranking, while honestly attributing the reranking gains to specific feature sets.
Routing by Reaching: Composition of Pre-trained GFlowNets for Multi-Objective Generation: This paper proposes a training-free framework for composing GFlowNets. By using the "reaching probability" of each pre-trained model as weights to mix their respective forward policies, the framework enables direct sampling for arbitrary linear scalarizations or logical operator combinations during the inference phase. It is theoretically proven to exactly recover the target distribution in the linear case.
Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models: scLDM utilizes a unified Multi-head Cross-Attention Block (MCAB) to encode exchangeable gene expression data into sets of fixed-length, permutation-invariant latent variables. By replacing Gaussian priors with DiT + Flow Matching + joint multi-attribute classifier-free guidance, it significantly outperforms scVI, scDiffusion, and CFGen in reconstruction, (un/conditional) generation, and perturbation response prediction tasks across multiple scRNA-seq datasets.
scCBGM: Interpretable Single-Cell Counterfactual Editing: This paper proposes scCBGM, a single-cell concept bottleneck generative model. By transferring the "concept bottleneck" architecture to scRNA-seq data and employing decoder skip connections along with cross-covariance decoupling penalties, it achieves interpretable and controllable counterfactual editing of "what would happen if a biological concept were changed" for individual cells. It can also be integrated into flow matching models to enhance generation quality.
SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning: SIGMA uses token-level contrastive loss to force the hidden states of different SMILES permutations of the same molecule onto the same trajectory. It further introduces IsoBeam to prune isomorphic redundant paths during the decoding stage, enabling sequence models to truly "think by graph rather than by string" in chemical space.
Site4Drug: Predicting Drug-Binding Target Sites with an AI Agent: Site4Drug reformulates the upstream bottleneck of "where to target on a protein" as a constraint-first evidence integration problem. An LLM Agent derives feasibility signals such as topology, PTM, Motif, and cysteine networks from sequences, outputting ranked candidate sites with scores, risk labels, and traceable logs, while automatically recommending whether to use antibody/peptide or small molecule modalities.
SPATIA: Multimodal Generation and Prediction of Spatial Cell Phenotypes: Addressing the spatial transcriptomics challenge of joint modeling for "cell morphology + gene expression + spatial location," SPATIA utilizes a hierarchical attention mechanism (cell→niche→tissue) for unified representation and a spatial-conditioned morphology generation module (weak pairing + confidence-aware Optimal Transport reweighting + morphology-profile alignment flow matching). It sets new SOTA across 25.9M cells and 12 tasks for both generation and prediction.
Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions: SDG unifies the "training-free diffusion guidance" and "Stochastic Optimal Control (SOC) posterior sampling" paradigms. By deriving the variational upper bound of the guidance term via SOC, it is revealed that existing Tweedie-based methods omit a crucial KL regularization term. Consequently, the authors design a "back-and-forth" correction mechanism using Stein Variational Gradient Descent (SVGD): first performing a Tweedie reverse-projection to the data manifold \(\mathcal{M}_T\), then applying a Stein correction, and finally forward-projecting back to the noise manifold \(\mathcal{M}_t\). This approach significantly outperforms baselines such as DPS/LGD/MPGD/UGD in both image guidance and molecule-protein docking tasks, demonstrating particular strength in sampling rare, high-value samples from low-density regions.
STRIDE: Post-Training LLMs to Reason and Refine Bio-Sequences via Edit Trajectories: STRIDE reformulates "protein/molecule sequence optimization" as "trajectory planning in edit space." It trains an LLM to explicitly generate executable atomic edit scripts (INSERT/DELETE/REPLACE). By using Levenshtein shortest edit paths for SFT and GRPO-style reinforcement learning to align with task rewards, STRIDE increases the success rate of protein all-action stress tests from 42% to 89% and novelty from 47% to 97% in variable-length, syntactically-constrained discrete sequence optimization.
Supervised Graph Contrastive Learning for Gene Regulatory Networks: The authors treat "gene knockdown experiments" as supervisory signals for graph contrastive learning. By ensuring that graph augmentations for Gene Regulatory Networks (GRN) are based on real biological perturbations rather than random noise, the method achieves clearer disease subtype clustering on patient-specific GRNs and consistently outperforms existing graph representation learning baselines across 13 downstream tasks.
SwitchCraft: A Programmatic Framework for Designing State-Switching Proteins: SwitchCraft formalizes the task of "designing a protein that switches between multiple functional states" as an optimization problem over combinatorial constraints. By backpropagating multiple state-dependent losses (motif, binding, conformational change, contact) through the structure prediction model Boltz-1, it directly optimizes amino acid logits via gradient descent. This represents the first general computational framework for multi-state protein design, demonstrated through in silico experiments including positive/negative allostery, motif switching, induced binding, ligand modification, ligand discrimination, and de novo design of cpGFP fluorescent biosensors.
TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering: TadA-Bench utilizes million-scale TadA variant sequences from 31 rounds of real directed evolution wet-lab experiments to formalize protein engineering as a fixed-data replay task of "predicting future rounds using preceding ones." Equipped with a Seq2Graph unified labeling pipeline, it reveals that mainstream biological foundation models significantly fail in "future-round discovery."
TD3B: Transition-Directed Discrete Diffusion for Allosteric Binder Generation: TD3B formalizes the design of agonists and antagonists as a generation task for "directional transition operators." By employing a framework consisting of a direction Oracle, affinity gating, and tree-search amortized fine-tuning of a masked discrete diffusion model, it enables pre-trained peptide generators to produce sequences that directionally shift transitions between active and inactive protein conformations.
Temporal Score Rescaling for Temperature Sampling in Diffusion and Flow Models: By multiplying the score output of pre-trained diffusion/flow models by an analytical rescaling factor \(r_t\), which depends only on the timestep, variable \(k\), and \(\sigma\), the sampling distribution can be made "locally" sharper or flatter during the inference stage without any fine-tuning. This method is fully compatible with deterministic samplers such as DDIM.
Towards A Generative Protein Evolution Machine with DPLM-Evo: This paper proposes DPLM-Evo, which extends the discrete diffusion of protein language models from "mask-replacement only" to "explicit modeling of substitution + insertion + deletion." By decoupling variable-length observed sequences into an upsampled latent alignment space (\(2L\)) and utilizing contextualized evolutionary noise kernels, DPLM-Evo achieves variable-length evolutionary generation and trajectory-based protein post-editing/optimization. It achieves SOTA on ProteinGym single-sequence variant effect prediction.
Towards Universal Gene Regulatory Network Inference: Unlocking Generalizable Regulatory Knowledge in Single-cell Foundation Models: This paper points out that single-cell foundation models (scFMs) contain rich gene regulatory knowledge that is often obscured by "reconstructive pre-training." It proposes two probes, Virtual Value Perturbation (VVP) and Gradient Trajectory (GDT), to distill pairwise gene features from frozen scFMs that generalize across genes and datasets. This approach pushes AUPRC on the BEELINE benchmark from ~0.5 to 0.8–0.97, pioneering a new paradigm of "Universal GRN inference (UGRN)."
Transformed Latent Variable Multi-Output Gaussian Processes: This paper proposes T-LVMOGP: it transforms the core modeling problem of Multi-Output Gaussian Processes (MOGP)—the construction of cross-output covariance \(k_{p,p'}(x, x')\)—into "computing an inner product with a single scalar base kernel in a Lipschitz-regularized RCNN embedding space." Fully integrated into the SVGP framework, it enables MOGP to handle \(P > 10,000\) outputs (including spatial transcriptomics data with ZINB likelihoods) with high scalability and expressivity for the first time, while comprehensively outperforming baselines such as SV-LMC, OILMM, and GS-LVMOGP.
Viral Proteins Reveal Geometry of Protein Language Models: Using viral proteins as probes, this paper discovers a "nativeness axis" (PC1) in the embedding space of ESM-series protein language models (pLMs), dominated by masked reconstruction perplexity. This axis ranks sequences from well-modeled cellular proteins, through viral proteins, to shuffled/random sequences. It further demonstrates that embeddings retain "residual viral signals" beyond perplexity—linear probes can distinguish viral from cellular proteins near performance ceilings, whereas perplexity alone cannot.
What Makes a Representation Good for Single-Cell Perturbation Prediction?: This paper proposes PerturbedVAE, arguing that an effective representation for single-cell perturbation prediction must explicitly separate the dominant perturbation-invariant background programs from the sparse perturbation-response signals, while organizing the latter with a causal structure to better generalize to unseen dual-gene combinatorial perturbations.