EvoEGF-Mol: Evolving Exponential Geodesic Flow for Structure-based Drug Design¶

Conference: ICML 2026
arXiv: 2601.22466
Code: https://github.com/BLEACH366/EvoEGF-Mol (Available)
Area: Scientific Computing / Molecule Generation / Structure-based Drug Design
Keywords: Information Geometry, Exponential Geodesic, Fisher-Rao Metric, Flow Matching, SBDD

TL;DR¶

EvoEGF-Mol maps the continuous coordinates and discrete atom/bond types of SBDD into the same natural parameter space of the exponential family. By replacing singular Dirac endpoints with dynamically tightening target distributions and evolving them synchronously along exponential geodesics under the Fisher-Rao geometry, it pushes the PoseBusters pass rate on CrossDock to 93.4%, approaching the level of reference molecules.

Background & Motivation¶

Background: Structure-based drug design (SBDD) aims to generate small molecule ligands \(M=(\mathbf{x}_M,\mathbf{v}_M,\mathbf{b}_M)\)—including 3D atomic coordinates, atom types, and bond types—based on a protein pocket \(P\). Mainstream methods have shifted from early autoregressive models (AR, Pocket2Mol, PocketFlow) to diffusion and flow matching paradigms (TargetDiff, DecompDiff, FLOWR, DynamicFlow, ECloudGen), alongside unified probabilistic frameworks (MolCRAFT, MolPilot).

Limitations of Prior Work: Almost all methods design probability paths separately for continuous coordinates and discrete categories—the former via Gaussian noise in Euclidean space and the latter via discrete schedules on category probability simplices. This "divide and conquer" approach leads to modality mismatch: geometric coordinates may have nearly converged while atom identities remain ambiguous, disrupting the strong geometric-chemical coupling essential to drug molecules.

Key Challenge: Heterogeneous variables lack a unified "distance." Gaussian variance for coordinates and Dirichlet concentration for categories measure uncertainty in different spaces. Forcing them together with weighted losses requires manual tuning and inherently violates the intrinsic geometry of the distributions.

Goal: (1) Provide a description of coordinates, atom types, and bond types as a unified probabilistic object; (2) Construct geometrically sound probability paths on this object; (3) Avoid instantaneous collapse caused by Dirac endpoints to ensure training signals remain effective throughout \(t\in[0,1]\).

Key Insight: Information geometry reveals that exponential geodesics (e-geodesics) under the Fisher-Rao metric and exponential connection correspond exactly to linear interpolation of natural parameters \(\bm{\eta}\). If Gaussian coordinates and Dirichlet categories are treated as "products" of exponential families, they share a unified linear schedule, naturally eliminating modality mismatch.

Core Idea: Molecules are represented as a composite "Gaussian × Dirichlet × Dirichlet" exponential family distribution evolving along e-geodesics. Fixed Dirac endpoints are replaced with "dynamically tightening" endpoints over time, preserving Fisher-Rao geometric consistency while avoiding the instantaneous collapse of variance or support caused by boundary singularities.

Method¶

Overall Architecture¶

EvoEGF-Mol addresses the long-standing problem of misaligned rhythms between continuous coordinates and discrete atom/bond types using separate probability paths. It treats the molecule triplet \(M=(\mathbf{x}_M,\mathbf{v}_M,\mathbf{b}_M)\) as a product exponential family distribution "Gaussian × Dirichlet × Dirichlet." All variable states are compressed into a unified natural parameter vector \(\bm{\eta}\), which evolves along an e-geodesic in Fisher-Rao geometry from a prior \(\bm{\eta}_0\) to a target. The network predicts the endpoint parameters from a noisy molecule. Combined with dynamically tightening target distributions to avoid endpoint convergence issues, it outputs ligands conditioned on protein pocket \(P\).

Key Designs¶

1. Synchronized e-geodesics in Unified Natural Parameter Space: Aligning Coordinates and Categories

The pain point of prior work lies in separate designs for Euclidean Gaussian noise (coordinates) and discrete schedules on simplices (categories), where geometry might converge before chemical identity is clear. EvoEGF's key observation is that for any exponential family \(p(\mathbf{x}|\bm{\eta})=h(\mathbf{x})\exp(\langle\bm{\eta},\mathbf{T}(\mathbf{x})\rangle-A(\bm{\eta}))\), the e-geodesic under the Fisher-Rao metric is equivalent to the linear interpolation of natural parameters \(\bm{\eta}_t=(1-t)\bm{\eta}_0+t\bm{\eta}_1\). By linearly interpolating the natural parameters of isotropic Gaussians (\(\sigma_t^{-2}\bm{\mu}_t\), \(-\tfrac{1}{2}\sigma_t^{-2}\)) and Dirichlet distributions (\(\bm{\eta}=\bm{\alpha}-\mathbf{1}\)), heterogeneous variables tighten synchronously according to the same time \(t\). This geometric consistency is reflected in the loss expansion: the first-order KL divergence simplifies to \(D_{\mathrm{KL}}\approx \tfrac{1}{2}\sum_{\mathbf{c}}(\bm{\xi}_t^\mathbf{c})^\top \mathbf{G}^\mathbf{c}(\bm{\eta}_t^\mathbf{c})\bm{\xi}_t^\mathbf{c}\). The supervision weights for each component are directly provided by the Fisher Information Matrix \(\mathbf{G}^\mathbf{c}\), eliminating the need for manual tuning of the ratios between coordinate MSE and classification CE.

2. Dynamic Tightening Endpoints Replacing Dirac Targets: Distributing Training Signals Timeline-wide

Targeting e-geodesics directly at Dirac endpoints causes catastrophic issues: natural parameters tend toward infinity at the endpoints, coordinate variance \(\sigma_t^2\) collapses to 0 as \(t\to1\), and category supports vanish instantly. This concentrates all effective training signals at the very end (as analyzed in §3.2 and Fig. 2). The solution is to replace the fixed endpoint \(\bm{\eta}_1\) with a time-dependent \(\tilde{\bm{\eta}}_1(t)\), controlled by a smoothing hyperparameter \(\lambda\)—setting \(\tilde{\sigma}_1(t)=\lambda(1-t)\) for coordinates and \(\tilde{\bm{\alpha}}_1(t)=(1-\lambda(1-t))\mathbf{e}_k+\lambda(1-t)\tfrac{1}{K}\mathbf{1}_K\) for categories. As long as \(t<1\), the endpoint remains within the open convex natural parameter domain \(\Omega\), ensuring parameters are bounded and paths do not diverge. The tightening process is spread evenly across \(t\in[0,1]\), maximizing the training window.

3. Progressive Parameter Refinement Training + Fisher-calibrated KL Loss: Reducing Targets to Regression

Since velocity fields for joint discrete-continuous sample spaces are non-unique and difficult to train, EvoEGF follows the BFN/PIF approach: the network \(\bm{\Phi}(M_t,t,P)\) receives a noisy sample and directly predicts the endpoint parameters \(\hat{\bm{\eta}}_1\) instead of explicitly estimating a velocity field. During training, \(t\sim\mathcal{U}(0,1)\) is sampled to calculate \(\bm{\eta}_t\) and noisy samples \(M_t\). The predicted \(\hat{\bm{\eta}}_1\) is used to reconstruct \(\hat{\bm{\eta}}_{t+\Delta t}\), which is supervised against the first-order KL difference \(\bm{\xi}_t\) of the true evolution. The coordinate component simplifies to a weighted MSE \(\mathcal{L}_\mathbf{x}=\mathbb{E}[\tfrac{t^2\sigma_t^2}{2\tilde{\sigma}_1^4(t)}\|\mathbf{x}^*-\hat{\mathbf{x}}\|^2]\), and the category component reduces to a Dirichlet KL term. Because the Fisher matrix of product exponential families is block-diagonal, the weights \(\mathbf{G}^\mathbf{c}(\bm{\eta}_t^\mathbf{c})\) are naturally determined, decoupling the components while maintaining coordination.

Loss & Training¶

The total loss is the sum of expectations of Fisher-weighted KL divergences for coordinates \(\mathcal{L}_\mathbf{x}\), atom types \(\mathcal{L}_\mathbf{v}\), and bond types \(\mathcal{L}_\mathbf{b}\) sampled under \(t\sim\mathcal{U}(0,1)\). The components are naturally decoupled yet weights are inherently coordinated. The training follows the BFN/PIF "predict endpoint parameters + one-step refinement" paradigm. Time steps are sampled uniformly in \([0,1]\), and the endpoint tightening speed is uniquely controlled by hyperparameter \(\lambda\). Sampling starts from \(M_0\sim p(\cdot|\bm{\eta}_0)\), where each step predicts \(\hat{\bm{\eta}}_1\), constructs \(\hat{\bm{\eta}}_t\), and resamples the next input until \(t=1\).

Key Experimental Results¶

Main Results¶

The unified framework was compared against SOTA diffusion and autoregressive baselines on CrossDock. Evaluation metrics include PoseBusters pass rates, Vina scores, strain energy, connectivity, QED, SA, and Clash Ratio.

Dataset	Metric	Ours (EvoEGF-Mol)	Prev. Best (MolCRAFT)	Gain
CrossDock	PB-Valid (↑)	93.4%	84.6%	+8.8 pp
CrossDock	Connected (↑)	98.6%	96.7%	+1.9 pp
CrossDock	Strain (Med., ↓)	25.96	195	-86.7%
CrossDock	Vina Min (Avg., ↓)	-6.98	-7.21	-0.23 (Slightly lower)
CrossDock	SA (↑)	0.75	0.67	+0.08
CrossDock	Clash Ratio (↓)	0.24	0.26	-0.02

The PoseBusters pass rate of 93.4% is close to the 95.0% of the reference molecule set. The median strain energy dropped significantly from 195 (MolCRAFT) to 25.96, indicating that the improvement in "physical plausibility" is far more significant than raw Vina scores.

Ablation Study¶

Evaluation Suite / Configuration	Key Metric	Description
CrossDock vs Dirac Endpoint EGF (Fig.2)	Training Window Width	Static endpoints cause instantaneous collapse of variance/support; dynamic endpoints smoothly expand across the timeline.
MolGenBench (In)	Pass Rate / Hit Recovery (↑)	EvoEGF-Mol achieves top-2 hit rates and fragment recovery across In/In(RM.)/Not protein splits.
Relationship with SLDM	Formal Comparison (Appx. E)	Proves SLDM is a special case of EGF under regularized static endpoints; EvoEGF is a generalized dynamic solution.
Fisher Calibration vs Manual Weighting	Multi-modal Balance	KL quadratic expansion provides natural weights \(\mathbf{G}^\mathbf{c}\) for each component, avoiding manual cross-modal tuning.

Key Findings¶

Geometric priors (e-geodesics) contribute more to "physically plausible" molecules than manually designed probability paths: both strain energy and clash ratios decreased significantly, showing generated conformations are closer to real chemistry.
Dynamic endpoints are critical: within the same e-geodesic framework, changing Dirac to \(\lambda\)-tightening targets significantly mitigates variance collapse without modifying architecture.
On real tasks in MolGenBench, the gain in scaffold recovery hit rates suggests the framework is not just overfitting constrained benchmarks like CrossDock but is valuable for actual drug candidate retrieval.

Highlights & Insights¶

Information geometry is a powerful tool for "truly" unifying continuous-discrete generation: linear interpolation in natural parameter space pulls Gaussian variance and Dirichlet concentration toward targets at the same pace, which is more elegant than "weighting two separate losses."
Dynamic endpoints are a universal "singular therapy": any generative flow aiming for Dirac targets encounters endpoint divergence. Changing the endpoint to a time-dependent tightening distribution can be migrated with little effort to other exponential family generation tasks (language, point clouds, graphs).
The progressive refinement paradigm reduces the training target to familiar regression and Dirichlet KL, fitting seamlessly into the BFN/PIF ecosystem with low engineering cost.
The "special case" analysis of SLDM is insightful: placing existing methods within a broader geometric framework clarifies boundary conditions.

Limitations & Future Work¶

The framework currently uses specific exponential family choices (isotropic Gaussian + Dirichlet); whether more complex families (e.g., Mixture Models, von Mises) are equally stable remains unverified.
The tightening speed \(\lambda\) is a global constant; different parts of a molecule (scaffold vs. substituents) might require adaptive tightening rates.
Experiments only cover two pocket datasets; harder drug scenarios like lipids, peptides, or covalent binding lack validation.
Inference still requires multi-step iterations; exploring few-step sampling via consistency model (RCM) ideas is worth investigating.

vs MolCRAFT / MolPilot (BFN/VOS systems): They still use separate Euclidean and categorical schedules requiring VOS-style noise schedule alignment; EvoEGF eliminates mismatch via a unified exponential family.
vs FLOWR / DynamicFlow (Flow Matching): They stitch continuous OT flows with discrete categorical FM; EvoEGF provides the geometrically intrinsic "product exponential family + e-geodesic" path and balances modalities via Fisher Information.
vs Fisher-Flow / SFM / E-Geodesic FM (Information Geometry): Previous works applied information geometry to purely discrete or simplex data; this work extends it to "continuous-discrete hybrid" structures and solves the singularity of original e-geodesics via dynamic endpoints.
vs SLDM: The authors prove SLDM is a special case of EGF under static regularized endpoints, providing a unified geometric perspective for recent "end-to-end straight-line diffusion" methods.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to truly evolve continuous-discrete SBDD variables on the same Fisher-Rao manifold and systemize "dynamic endpoints."
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on CrossDock and MolGenBench, though missing direct comparisons with some 2025 methods (e.g., ECloudGen) under identical conditions.
Writing Quality: ⭐⭐⭐⭐ Clear information geometry derivations; explicit natural parameters for each family; appendix clarifies boundary singularities and the relationship with SLDM.
Value: ⭐⭐⭐⭐⭐ Both the unified geometric framework and dynamic endpoints are generalizable tools for other generative problems; the PoseBusters improvement is highly practical for drug discovery.