Skip to content

EvoEGF-Mol: Evolving Exponential Geodesic Flow for Structure-based Drug Design

Conference: ICML 2026
arXiv: 2601.22466
Code: https://github.com/BLEACH366/EvoEGF-Mol (Available)
Area: Scientific Computing / Molecular Generation / Structure-based Drug Design
Keywords: Information Geometry, Exponential Geodesic, Fisher-Rao Metric, Flow Matching, SBDD

TL;DR

EvoEGF-Mol places continuous coordinates and discrete atom/bond types of SBDD into a unified natural parameter space of the exponential family. By replacing singular Dirac endpoints with dynamically shrinking target distributions and evolving them synchronously along exponential geodesics under the Fisher-Rao geometry, it pushes the PoseBusters pass rate on CrossDock to 93.4%, approaching the level of reference molecules.

Background & Motivation

Background: Structure-based drug design (SBDD) aims to generate small molecule ligands \(M=(\mathbf{x}_M,\mathbf{v}_M,\mathbf{b}_M)\) based on a protein pocket \(P\), encompassing three types of heterogeneous variables: 3D atomic coordinates, atom types, and bond types. Mainstream methods have shifted from early autoregressive models (Pocket2Mol, PocketFlow) to diffusion and flow matching paradigms (TargetDiff, DecompDiff, FLOWR, DynamicFlow, ECloudGen), with unified probabilistic frameworks emerging (MolCRAFT, MolPilot).

Limitations of Prior Work: Almost all methods design probability paths for continuous coordinates and discrete categories separately—the former using Gaussian noise in Euclidean space, and the latter using discrete schedules on the categorical probability simplex. This "divide and conquer" approach leads to modality mismatch: geometric coordinates may be close to convergence while atomic identities remain ambiguous, disrupting the strong geometric-chemical coupling inherent in drug molecules.

Key Challenge: Heterogeneous variables lack a unified definition of "distance." Gaussian variance for coordinates and Dirichlet concentration for categories measure uncertainty in different spaces. Forcing them together using weighted losses requires manual tuning and inherently violates the internal geometry between distributions.

Goal: (1) Provide a unified probabilistic object description for coordinates, atom types, and bond types; (2) construct geometrically sound probability paths on this object; (3) avoid instantaneous collapse caused by Dirac endpoints to ensure training signals remain effective throughout \(t\in[0,1]\).

Key Insight: Information geometry dictates that exponential geodesics (e-geodesics) under the Fisher-Rao metric and exponential connection correspond exactly to linear interpolation in the natural parameter space \(\bm{\eta}\). By viewing Gaussian coordinates and Dirichlet categories as a "product" of the exponential family, they share the same linear schedule, naturally eliminating modality temporal mismatch.

Core Idea: Represent molecules as a composite exponential family distribution "Gaussian × Dirichlet × Dirichlet," evolve along e-geodesics, and replace fixed Dirac endpoints with "gradually tightening" dynamic endpoints. This preserves Fisher-Rao geometric consistency while avoiding variance/support collapse at boundary singularities.

Method

Overall Architecture

The input is the protein pocket \(P=\{(\mathbf{x}_P^{(i)},\mathbf{v}_P^{(i)})\}\) and prior \(\bm{\eta}_0\); the output is the ligand triplet \(M=(\mathbf{x}_M,\mathbf{v}_M,\mathbf{b}_M)\). The entire pipeline operates within the natural parameter space:

  1. Unified Representation: Treat \(M\) as a product exponential family \(p(\mathbf{x}|\bm{\eta}^{\mathbf{x}})\,p(\mathbf{v}|\bm{\eta}^{\mathbf{v}})\,p(\mathbf{b}|\bm{\eta}^{\mathbf{b}})\). Coordinates use isotropic Gaussians and atom/bond categories use Dirichlet distributions. As members of the exponential family, their natural parameters are concatenated into a long vector \(\bm{\eta}=(\bm{\eta}^{\mathbf{x}},\bm{\eta}^{\mathbf{v}},\bm{\eta}^{\mathbf{b}})\).
  2. Dynamic Endpoint Geodesics: Construct time-evolving targets using \(\bm{\eta}_t=(1-t)\bm{\eta}_0+t\tilde{\bm{\eta}}_1(t)\), following e-geodesics. The path always remains within the open convex natural parameter domain \(\Omega\).
  3. Progressive Parameter Refinement Network: Similar to BFN and PIF, a neural network \(\bm{\Phi}(M_t,t,P)\) receives the current noisy sample and directly predicts the terminal parameters \(\hat{\bm{\eta}}_1\). Training utilizes a first-order KL divergence transformed into a quadratic loss under the Fisher-Rao norm.
  4. Sampling: Starting from \(M_0\sim p(\cdot|\bm{\eta}_0)\), each step predicts \(\hat{\bm{\eta}}_1\), combines it to get \(\hat{\bm{\eta}}_t\), and resamples for the next input until \(t=1\) to output the molecule.

Key Designs

  1. Synchronous e-Geodesics in Unified Natural Parameter Space:

    • Function: Allows coordinates, atom types, and bond types to "tighten synchronously" under the same schedule, eliminating temporal mismatch between continuous and discrete modalities.
    • Mechanism: For an exponential family \(p(\mathbf{x}|\bm{\eta})=h(\mathbf{x})\exp(\langle\bm{\eta},\mathbf{T}(\mathbf{x})\rangle-A(\bm{\eta}))\), the e-geodesic is equivalent to \(\bm{\eta}_t=(1-t)\bm{\eta}_0+t\bm{\eta}_1\). For isotropic Gaussians, natural parameters \(\sigma_t^{-2}\bm{\mu}_t\) and \(-\tfrac{1}{2}\sigma_t^{-2}\) are linearly interpolated; for Dirichlet distributions, \(\bm{\eta}=\bm{\alpha}-\mathbf{1}\) is also linearly interpolated. Thus, heterogeneous variables are driven by the same \(t\).
    • Design Motivation: Previous methods combined Euclidean MSE and categorical CE with manual weights, ignoring the "distance between distributions." The Fisher-Rao metric automatically weights supervision based on the intrinsic uncertainty of each component. The expanded loss \(D_{\mathrm{KL}}\approx \tfrac{1}{2}\sum_{\mathbf{c}}(\bm{\xi}_t^\mathbf{c})^\top \mathbf{G}^\mathbf{c}(\bm{\eta}_t^\mathbf{c})\bm{\xi}_t^\mathbf{c}\) validates this Fisher calibration.
  2. Dynamic Tightening Endpoints Replacing Dirac Targets:

    • Function: Avoids instantaneous variance zeroing or simplex support collapse caused by geometric singularities, maximizing the available training window.
    • Mechanism: Replaces the fixed endpoint \(\bm{\eta}_1\) with a time-dependent \(\tilde{\bm{\eta}}_1(t)\), controlled by a smoothing parameter \(\lambda\). For coordinates, \(\tilde{\sigma}_1(t)=\lambda(1-t)\); for categories, \(\tilde{\bm{\alpha}}_1(t)=(1-\lambda(1-t))\mathbf{e}_k+\lambda(1-t)\tfrac{1}{K}\mathbf{1}_K\). When \(t<1\), the endpoint remains inside the manifold with bounded natural parameters, preventing path divergence.
    • Design Motivation: Analysis in §3.2 and Fig. 2 shows that if e-geodesics aim at a Dirac distribution, natural parameters trend toward infinity at the endpoint, causing variance \(\sigma_t^2\) to collapse as \(t \to 1\). Dynamic endpoints distribute the tightening process evenly across the time axis.
  3. Progressive Parameter Refinement Training + Fisher-Calibrated KL Loss:

    • Function: Directly learns a network in parameter space to "predict terminal parameters from a noisy state," avoiding the difficulty of explicitly estimating velocity fields in sample space while automatically balancing multimodal losses.
    • Mechanism: During training, sample \(t\sim\mathcal{U}(0,1)\), compute \(\bm{\eta}_t\) and noisy samples \(M_t\). The network predicts \(\hat{\bm{\eta}}_1\) to reconstruct \(\hat{\bm{\eta}}_{t+\Delta t}\), supervised by the difference from the true evolution \(\bm{\xi}_t\). The coordinate component simplifies to weighted MSE \(\mathcal{L}_\mathbf{x}=\mathbb{E}[\tfrac{t^2\sigma_t^2}{2\tilde{\sigma}_1^4(t)}\|\mathbf{x}^*-\hat{\mathbf{x}}\|^2]\); categorical components reduce to Dirichlet KL involving multivariate Beta terms and digamma differences \(\Delta\psi_k\).
    • Design Motivation: Velocity fields in joint discrete-continuous sample spaces are non-unique and hard to train; learning terminal parameters in parameter space reduces the objective to a familiar regression form. The weights \(\mathbf{G}^\mathbf{c}(\bm{\eta}_t^\mathbf{c})\) are naturally provided by Fisher information. The paper also notes that SLDM is a special case of EGF under static endpoints + regularization.

Loss & Training

The total loss is the sum of three Fisher-weighted KL terms: coordinates \(\mathcal{L}_\mathbf{x}\), atom types \(\mathcal{L}_\mathbf{v}\), and bond types \(\mathcal{L}_\mathbf{b}\). These are summed with expectation over \(t\sim\mathcal{U}(0,1)\). Due to the block-diagonal Fisher matrix of product exponential families, components are naturally decoupled yet their weights are coordinated. Training follows the "predict endpoint + one-step refinement" paradigm of BFN/PIF, with time steps sampled uniformly in \([0,1]\) and endpoint tightening speed controlled by hyperparameter \(\lambda\).

Key Experimental Results

Main Results

On CrossDock, the unified framework is compared against SOTA diffusion/autoregressive baselines. Evaluation includes PoseBusters pass rates, Vina scores, strain energy, connectivity, QED, SA, and Clash Ratio.

Dataset Metric Ours (EvoEGF-Mol) Prev. SOTA (MolCRAFT) Gain
CrossDock PB-Valid (↑) 93.4% 84.6% +8.8 pp
CrossDock Connected (↑) 98.6% 96.7% +1.9 pp
CrossDock Strain (Med., ↓) 25.96 195 -86.7%
CrossDock Vina Min (Avg., ↓) -6.98 -7.21 -0.23 (lower)
CrossDock SA (↑) 0.75 0.67 +0.08
CrossDock Clash Ratio (↓) 0.24 0.26 -0.02

The PoseBusters pass rate of 93.4% is close to the 95.0% of the reference molecule set. The median strain energy drop from 195 (MolCRAFT) to 25.96 indicates that the improvement in "physical plausibility" is far more significant than raw Vina scores.

Ablation Study

Evaluation Suite / Config Key Metric Description
CrossDock vs. Dirac EGF (Fig. 2) Training Window Width Static endpoints cause immediate variance/support collapse; dynamic endpoints smooth the training signal across the timeline.
MolGenBench (In) Pass Rate / Hit Recovery (↑) EvoEGF-Mol achieves top-2 hit rates and fragment recovery rates across In/In(RM.)/Not protein splits.
Relation to SLDM Formal Comparison (Appx. E) Proves SLDM is a special case of EGF under regularized static endpoints; EvoEGF is a generalized dynamic solution.
Fisher Calibration vs. Manual Weighting Multimodal Balance The KL second-order expansion provides natural weights \(\mathbf{G}^\mathbf{c}\) for each component, removing the need for manual cross-modality tuning.

Key Findings

  • Geometric priors (e-geodesics) contribute more to "physically plausible" molecules than manually designed probability paths: both strain energy and clash ratios decreased significantly, indicating generated conformations are closer to real chemistry.
  • Endpoint dynamics are crucial: under the same e-geodesic backbone, switching from Dirac to \(\lambda\)-tightening targets significantly alleviates variance collapse without requiring architectural changes.
  • On real-world tasks in MolGenBench, the improvement in scaffold recovery hit rates suggests the framework generalizes beyond CrossDock and is valuable for actual drug candidate retrieval.

Highlights & Insights

  • The information geometry perspective is a powerful tool for "truly unified" continuous-discrete generation: linear interpolation in natural parameter space aligns Gaussian variance and Dirichlet concentration to the same beat, which is far more elegant than weighting two separate losses.
  • Dynamic endpoints offer a universal "singularity therapy": any generative flow targeting a Dirac distribution encounters endpoint divergence. Changing the endpoint to a time-dependent tightening distribution can be migrated to other exponential family tasks (language, point clouds, attributed graphs).
  • The progressive parameter refinement paradigm reduces the training objective to familiar regression + Dirichlet KL, integrating seamlessly with BFN/PIF systems with low implementation costs.
  • The analysis of SLDM as a "special case" is insightful: placing existing methods in a broader geometric framework clarifies their boundary conditions.

Limitations & Future Work

  • The framework currently uses a specific exponential family choice (isotropic Gaussian + Dirichlet); whether more complex families (e.g., Mixture of Gaussians, von Mises) are equally stable remains unverified.
  • The tightening speed \(\lambda\) is a global constant; different parts of a molecule (scaffold vs. substituents) may require adaptive tightening rates.
  • Experiments are limited to two pocket datasets (CrossDock, MolGenBench). Validation in more difficult drug scenarios like lipids, peptides, or covalent binding is needed.
  • Inference still requires multiple iterations, which is costly compared to one-shot regression models. Exploring few-step sampling via consistency model (RCM) ideas is worthwhile.
  • vs. MolCRAFT / MolPilot (BFN/VOS systems): They still rely on separate Euclidean and categorical schedules, requiring VOS for alignment; EvoEGF eliminates mismatch via a unified exponential family.
  • vs. FLOWR / DynamicFlow (Flow Matching): They concatenate continuous OT flows with discrete categorical FM; EvoEGF provides a more intrinsic "product exponential family + e-geodesic" path with automatic Fisher balancing.
  • vs. Fisher-Flow / SFM / E-Geodesic FM: Previous works applied information geometry to pure discrete or simplex data; this work extends it to "hybrid continuous-discrete" molecular structures and solves the Dirac singularity.
  • vs. SLDM: The proof that SLDM is a special case of EGF under regularized static endpoints provides a unified geometric lens for several recent "end-to-end straight-line diffusion" methods.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to truly evolve SBDD continuous-discrete variables on a unified Fisher-Rao manifold and solve singularities with dynamic endpoints.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated on CrossDock and MolGenBench, though missing head-to-head comparisons with some late-2025 methods (e.g., ECloudGen) under identical conditions.
  • Writing Quality: ⭐⭐⭐⭐ Clear information geometry derivations; natural parameters for each family are explicitly defined.
  • Value: ⭐⭐⭐⭐⭐ The unified geometric framework and dynamic endpoints are versatile tools applicable to other generative problems; PoseBusters improvements are highly practical for drug discovery.