Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule¶

Conference: ICML 2025
arXiv: 2505.07286
Code: https://github.com/AlgoMole/MolCRAFT
Area: Diffusion Models / Molecule Generation
Keywords: Structure-Based Drug Design, Bayesian Flow Network, Multimodal Noise Schedule, VLB Optimization, Molecular Geometry

TL;DR¶

Proposes the VLB-Optimal Scheduling (VOS) strategy. By theoretically analyzing the path-dependent VLB characteristics of joint noise scheduling of multimodal data (continuous 3D coordinates + discrete 2D topology), it utilizes dynamic programming to search for the optimal noise scheduling path, achieving state-of-the-art performance in SBDD with a 95.9% PoseBusters pass rate on CrossDock.

Background & Motivation¶

Background: Structure-Based Drug Design (SBDD) utilizes deep generative models (such as diffusion models and BFNs) to generate ligand molecules conditioned on protein structures.

Limitations of Prior Work: Existing models exhibit a significant drop in intramolecular validity (bond length/angle distribution) under strict PoseBusters testing, indicating inconsistency between the generated 3D geometry and the 2D molecular topology.

Key Challenge: Default noise scheduling causes the 3D modality (atom coordinates) to be denoised first, preventing the model from effectively using the 2D topology information to constrain the 3D structure. The "3D-dominant" probability path is suboptimal.

Goal: How to design an optimal noise schedule for multiple modalities so that 2D and 3D modalities can guide each other?

Key Insight: Proving that the VLB is path-dependent in multimodal scenarios (unlike in single-modality settings), and then searching for the optimal path.

Core Idea: Multimodal VLB is no longer invariant to the shape of the schedule. The optimal inter-modality scheduling path can be found through decoupled timestep training followed by dynamic programming search.

Method¶

Overall Architecture¶

Based on the BFN framework: (1) During the training phase, independent decoupled sampling of \((t_c, t_d)\) is used to train a generalized loss, allowing a single model to cover all possible noise combinations; (2) Estimate the 2D cost matrix \(C(t_c, t_d)\); (3) Use dynamic programming to search for the minimum cumulative cost path, yielding the optimal schedule \(\boldsymbol{\beta}^*\).

Key Designs¶

Path-Dependent VLB (Eq.8-10):
- Function: Proves that multimodal VLB is dependent on the coupled path of the noise schedule.
- Mechanism: In the single-modality setting, \(\mathcal{L}^\infty\) only depends on \(\beta(0), \beta(1)\) (endpoint invariance). In the multimodal setting, it is formulated as a path integral \(\int_{\beta_c,\beta_d} \mathbb{E}\|\mathbf{x} - \tilde{\mathbf{x}}_\phi\|^2 d\boldsymbol{\beta}\), where different paths result in different VLBs.
- Design Motivation: This provides the theoretical foundation for designing the optimal schedule.
Generalized Loss + Decoupled Timestep Training (Eq.14):
- Function: Trains a single model to support arbitrary combinations of \((t_c, t_d)\).
- Mechanism: \(\dot{\mathcal{L}}^\infty = \frac{1}{2}\int_0^1\int_0^1 \mathbb{E}\|\mathbf{x} - \tilde{\mathbf{x}}_\phi(\boldsymbol{\theta}, \boldsymbol{t})\|^2 dt_c dt_d\), where \(t_c, t_d \sim U(0,1)\) are sampled independently during training.
- Design Motivation: Enables evaluation of the loss over the entire \([0,1]^2\) plane with a single training session, avoiding the need to retrain for each individual schedule.
Dynamic Programming Search for Optimal Path (Algorithm 1):
- Function: Finds the minimum cumulative cost path on a discretized grid of \((t_c, t_d)\).
- Mechanism: \(J(t_c,t_d) = \min_{(\epsilon_c,\epsilon_d)}(J(t_c-\epsilon_c, t_d-\epsilon_d) + \boldsymbol{\alpha}C(t_c,t_d))\) traversing from \([0,0]\) to \([1,1]\).
- Design Motivation: Exhaustive search over all paths is intractable; DP guarantees finding the global optimum in polynomial time.

Intuition of the Optimal Schedule¶

The discovered optimal path exhibits a two-stage characteristic: (1) Shape-driven drafting phase (\(t<0.3\)): generates coarse 3D atom positions first; (2) Topology-driven fitting phase (\(t>0.8\)): refines the 3D conformation conditioned on the 2D molecular topology.

Key Experimental Results¶

Main Results¶

Method	PB-Valid ↑	Vina Dock ↓	scRMSD<2Å ↑
TargetDiff	50.5%	-7.80	37.1%
DecompDiff	71.7%	-7.03	24.2%
MolCRAFT (Default Schedule)	84.3%	-7.90	38.7%
MolPilot (VOS)	95.9%	-8.08	34.0%

Ablation Study¶

Schedule Type	PB-Valid ↑	Description
Default Schedule	84.3%	3D-dominant path
Linear Interpolation	~90%	Uniform coupling
VOS Optimal	95.9%	Two-stage path

Key Findings¶

VOS improves the PoseBusters pass rate by >10% compared to the default schedule.
Bond length/angle distribution plots demonstrate that MolPilot yields predictions closest to the ground-truth distribution.
Performance remains robust in OOD (PoseBusters) evaluation.

Highlights & Insights¶

Theory-Driven Schedule Design: Formulated starting from VLB path-dependency instead of heuristic trial-and-error.
Single Training, Multi-Schedule Evaluation: The generalized loss design with decoupled timesteps is highly elegant.
Transferable Concept: The VOS methodology can be generalized to the noise schedule design of any multimodal generative task.

vs TargetDiff/DecompDiff (Diffusion): These methods utilize fixed noise schedules and do not account for the schedule coupling between different modalities. MolPilot identifies the optimal coupling path via VOS.
vs MolCRAFT (BFN): MolPilot is built directly on the MolCRAFT framework; VOS can be considered as a noise schedule optimization for BFNs.
vs EquiFM: EquiFM designs mixing paths using information-theoretic heuristics and lacks theoretical optimality guarantees. VOS is theoretically grounded in VLB optimality.
The rationale of VOS can be directly transferred to other multimodal generation problems (e.g., devising noise schedules for text and image modalities in joint text-to-image generation).

Limitations & Future Work¶

The complexity of DP search grows exponentially when scaling to more modalities (>2), requiring approximate algorithms.
The interpretability of the optimal schedule still warrants further theoretical analysis (e.g., why the two-stage path is optimal).
The success rate of RMSD < 2Å on molecular docking tasks is only 44%, leaving substantial room for improvement.
Training with the generalized loss (decoupled timesteps) may result in slower convergence compared to standard training.
The efficacy of VOS remains unverified on other molecular generation tasks such as protein design.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The path-dependency of multimodal VLB is a significant theoretical discovery.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated extensively on CrossDock + PoseBusters OOD.
Writing Quality: ⭐⭐⭐⭐ Well-integrated theory and experiments.
Value: ⭐⭐⭐⭐⭐ Broad implications for SBDD and multimodal generation.