Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge¶

Conference: ICLR 2026 arXiv: 2602.07588 Code: None Area: Medical Imaging Keywords: molecular dynamics, trajectory generation, variational bridge matching, pretraining, reinforcement learning fine-tuning

TL;DR¶

PVB (Pretrained Variational Bridge) unifies the training objectives of single-structure pretraining and paired-trajectory fine-tuning via an encoder-decoder architecture combined with augmented bridge matching, enabling cross-domain biomolecular trajectory generation. It further accelerates protein–ligand holo-state exploration through RL fine-tuning based on adjoint matching.

Background & Motivation¶

Background: Molecular dynamics (MD) simulation is a fundamental tool for characterizing molecular behavior, yet its computational cost is prohibitively high, requiring femtosecond-level time steps. Recent deep generative models have begun learning dynamics at coarsened time scales to generate trajectories efficiently.

Limitations of Prior Work: Existing methods suffer from three key issues: (1) insufficient generalization across molecular systems; (2) limited molecular diversity in trajectory data, preventing full exploitation of structural information; and (3) a predominant focus on single-molecule simulation, with little attention to multi-molecule systems such as protein–ligand complexes.

Key Challenge: The most closely related prior work, UniSim, achieves cross-domain generalization via 3D molecular pretraining; however, a training objective mismatch exists between pretraining (unconditional generation of single structures \(x\)) and fine-tuning (conditional generation of trajectory pairs \((x_t, x_{t+\tau})\)), leading to insufficient transfer of pretrained knowledge.

Goal: (1) How to design a unified training framework in which pretraining and fine-tuning share the same generative objective? (2) How to apply generated trajectories to rapid holo-state exploration in protein–ligand docking?

Key Insight: A latent variable \(\mathbf{Y}_0\) is introduced to model the generative process as a Markov chain \(\mathbf{X}_0 \to \mathbf{Y}_0 \to \mathbf{Y}_1\). A variational encoder maps the initial structure to a noisy latent space, and an augmented bridge matching decoder transports it to the target state.

Core Idea: A unified encoder-decoder framework with augmented bridge matching eliminates the objective mismatch between pretraining and fine-tuning, while adjoint-matching-based RL fine-tuning accelerates holo-state exploration.

Method¶

Overall Architecture¶

PVB adopts an encoder-decoder architecture. The input is a molecular conformation \((z, C, x)\) (atomic numbers, covalent bonds, 3D coordinates), and the output is the conformation at the next time step. Training proceeds in three stages: 1. Pretraining: Trained on large-scale high-resolution single-structure data, setting \((\mathbf{X}_0, \mathbf{Y}_1) = (x, x)\). 2. Fine-tuning: Fine-tuned on paired MD trajectory data, setting \((\mathbf{X}_0, \mathbf{Y}_1) = (x_t, x_{t+\tau})\). 3. RL Fine-tuning (optional): Reinforcement learning via adjoint matching to guide trajectories toward the holo state.

Key Designs¶

Variational Encoder:
- Function: Maps the initial state \(\mathbf{X}_0\) to the latent variable \(\mathbf{Y}_0\).
- Mechanism: The prior is set as \(q_e(d\mathbf{Y}_0|\mathbf{X}_0) \coloneqq \mathcal{N}(x_0, \sigma_e^2 I)\) with \(\sigma_e = \sqrt{0.5}\) Å. A neural network \(\varphi_e\) learns the posterior distribution \(p_e\) by minimizing the KL divergence: \(\mathcal{L}_{KL} = -\frac{1}{2}\mathbb{E}[1 + \log \mathbf{V} - 2\log\sigma_e - \frac{\mathbf{V}}{\sigma_e^2}]\).
- Design Motivation: The primary purpose of introducing a latent variable is to prevent the conditional distribution from degenerating into a Dirac measure during single-structure pretraining. A sufficiently large \(\sigma_e\) ensures that the encoding process retains adequate structural information while avoiding trivial decoder collapse.
Augmented Bridge Matching Decoder:
- Function: Generates the target state \(\mathbf{Y}_1\) from the latent variable \(\mathbf{Y}_0\) while preserving the coupling between \((\mathbf{Y}_0, \mathbf{Y}_1)\).
- Mechanism: A Brownian bridge path measure is defined, and a vector field \(\varphi_d\) is trained to minimize \(\mathcal{L}_{ABM} = \mathbb{E}_{t, (\mathbf{Y}_0, \mathbf{Y}_1)}[\|\varphi_d(t, \mathbf{Y}_0, \mathbf{Y}_t) - \frac{\mathbf{Y}_1 - \mathbf{Y}_t}{1-t}\|^2]\). At inference, the target is generated by simulating the non-Markovian SDE \(d\mathbf{Y}_t = \varphi_d^*(t, \mathbf{Y}_0, \mathbf{Y}_t)dt + \sigma d\mathbf{B}_t\).
- Design Motivation: Augmented bridge matching ensures that the endpoint coupling \(\Pi_{0,1}\) is exactly preserved throughout the generative process, which is critical for faithfully reproducing MD dynamical properties. Proposition 1 guarantees that the encoder-decoder composition provides an unbiased estimate of the target conditional distribution.
Adjoint Matching-Based RL Fine-tuning:
- Function: Introduces a control vector field \(u\) to steer the generative distribution so that trajectories rapidly approach the protein–ligand holo state.
- Mechanism: The KL-regularized objective \(\max_u \mathbb{E}[r(\mathbf{Y}_1) - \frac{\beta}{2}\int_0^1 \|u\|^2 dt]\) is optimized. Via Girsanov's theorem and adjoint matching, the stochastic optimal control (SOC) problem is converted to \(\mathcal{L}_{adj} = \mathbb{E}[\|u(t, \mathbf{Y}_0, \mathbf{Y}_t) + \sigma\tilde{a}(t)\|^2]\), where the reduced adjoint state \(\tilde{a}\) is backpropagated through an ODE.
- Design Motivation: Directly simulating from the apo to the holo state requires millisecond-level time scales, which is computationally intractable. RL fine-tuning steers the generative distribution via an explicit reward function, bypassing inefficient local exploration. Adjoint matching, rather than direct gradient accumulation, enables memory-efficient training.

Loss & Training¶

Pretraining + Fine-tuning: \(\mathcal{L} = w_{KL} \cdot \mathcal{L}_{KL} + w_{ABM} \cdot \mathcal{L}_{ABM}\)
RL Fine-tuning: \(\mathcal{L}_{adj}\), with reward function \(r(\mathbf{X}) = -\text{rmsd}(\mathbf{X}, \mathbf{X}_{ref})\)
The control vector field is reparametrized as \(u = \frac{1}{\sigma}(\varphi_d^u - \varphi_d^*)\), requiring no additional network.
Pretraining data: PCQM4Mv2 + ANI-1x (small molecules), PDB (proteins), PDBBind2020 (protein–ligand complexes).

Key Experimental Results¶

Main Results (ATLAS Protein Trajectory Generation)¶

Model	JSD-Rg ↓	JSD-TIC ↓	JSD-MSM ↓	VAL-CA ↑	Decorr-TIC0 ↑
MD (10ns)	0.379	0.684	0.596	0.926	0.000
ITO	0.792	0.400	0.469	0.001	0.714
MDGEN	0.493	0.400	0.463	0.098	0.857
UniSim	0.538	0.372	0.344	0.106	0.786
PVB	0.457	0.371	0.333	0.975	0.929

Ablation Study / Protein–Ligand Complex (MISATO)¶

Model	EMD-ligand ↓	EMD-CoM ↓	RMSE-CONTACT ↓
ITO	0.494	0.479	0.987
UniSim	0.196	0.128	0.049
PVB	0.133	0.089	0.055

Key Findings¶

PVB substantially outperforms all baselines on VAL-CA (conformational validity) with a score of 0.975, compared to 0.106 for UniSim, indicating highly physically plausible generated conformations.
On slow dynamical modes (TIC, MSM), PVB surpasses 10 ns MD simulation and achieves the highest decorrelation rate (0.929).
On protein–ligand complexes, PVB reduces ligand RMSD error by 32% relative to UniSim.
After RL fine-tuning, the median ligand RMSD in protein–ligand docking decreases by approximately 18% compared to Vina+PVB (without RL).

Highlights & Insights¶

Unified Pretraining–Fine-tuning Framework: The objective mismatch between single-structure pretraining and paired trajectory fine-tuning is elegantly resolved via latent variables and augmented bridge matching. This idea is transferable to other generative models requiring cross-task transfer.
Substantial Advantage in VAL-CA: Nearly all conformations generated by PVB are free of bond breaks or atomic clashes (97.5% valid), far exceeding ITO's 0.1%, attributable to the encoder-decoder architecture's effective preservation of structural information.
Application of Adjoint Matching: Introducing SOC theory into RL fine-tuning for molecular trajectory generation enables memory-efficient training. This paradigm is generalizable to other diffusion or flow matching models that require guided generation.

Limitations & Future Work¶

Only heavy atoms are considered; hydrogen atoms and solvent effects are not modeled.
RL fine-tuning requires a known holo state as a reward signal, limiting applicability in truly blind docking scenarios.
The scale of pretraining data remains limited, particularly for protein–ligand complex structures.
The temporal resolution of generated trajectories is constrained by the coarsened time step \(\tau\).

vs. UniSim: Both employ pretraining strategies, but UniSim's pretraining only produces an atomic representation model, whereas PVB integrates the generative model into pretraining and achieves objective consistency via the latent space.
vs. AlphaFlow: AlphaFlow generates i.i.d. protein conformation samples and cannot preserve temporal dependencies, thus precluding estimation of dynamical observables.
vs. ITO/MDGEN: Trained from scratch, lacking cross-domain generalization; ITO exhibits extremely low conformational validity.

Rating¶

Novelty: ⭐⭐⭐⭐ The unified encoder-decoder + augmented bridge matching framework is elegantly designed; adjoint matching-based RL fine-tuning is a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers protein monomers (ATLAS + mdCATH) and protein–ligand complexes (MISATO + PDBBind) with comprehensive evaluation metrics.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous and the framework is described clearly.
Value: ⭐⭐⭐⭐ Provides a unified and efficient solution for cross-domain molecular dynamics simulation; the substantial improvement in conformational validity has practical application value.