Manipulating 3D Molecules in a Fixed-Dimensional E(3)-Equivariant Latent Space¶

Conference: NeurIPS 2025 arXiv: 2506.00771 Code: GitHub Area: Medical Imaging Keywords: Molecular Generation, VAE, E(3)-Equivariance, Latent Space Manipulation, Bayesian Flow Networks

TL;DR¶

This paper proposes MolFLAE, a 3D molecular variational autoencoder that learns a fixed-dimensional, E(3)-equivariant latent space. By introducing learnable virtual nodes and a Bayesian Flow Network (BFN) decoder, MolFLAE enables zero-shot molecular editing — including atom-count editing, structural reconstruction, and property interpolation — and demonstrates practical utility in drug optimization targeting the human glucocorticoid receptor (hGR).

Background & Motivation¶

Medicinal chemists routinely consider 3D molecular structure when optimizing drug candidates, seeking molecules with different scaffolds that nonetheless preserve key features such as shape, pharmacophore, and chemical properties. Existing deep learning approaches decompose 3D molecular editing into a collection of narrowly defined subtasks (e.g., molecular inpainting, property-guided optimization), typically relying on task-specific supervision and architectures, which limits flexibility and generalizability.

The central challenge is that 3D molecules possess a variable number of atoms, permutation invariance over atomic orderings, and SE(3)-equivariance under spatial rotations and translations. Most existing 3D generative models operate in the product space over individual atoms or functional groups, resulting in latent representations of variable dimensionality — which directly precludes vector-level operations such as interpolation and extrapolation, rendering zero-shot molecular editing infeasible or highly nontrivial.

Inspired by successful examples of latent space navigation in the image domain (e.g., style transfer, image editing), the authors argue that constructing a fixed-dimensional, well-structured latent space for 3D molecules would enable flexible zero-shot molecular manipulation.

Method¶

Overall Architecture¶

MolFLAE adopts a VAE architecture: the encoder maps variable-size 3D molecules to a fixed-dimensional E(3)-equivariant latent space, and the decoder uses a Bayesian Flow Network (BFN) to reconstruct full molecular structures from the latent encoding. The training objective is:

\[\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{reg}}\]

Key Designs¶

Learnable Virtual Node Encoder: The core innovation is the introduction of \(N_Z\) learnable virtual nodes (set to 10, so that they form a non-degenerate simplex in 3D space to encode chirality information) appended to the molecular point cloud. These virtual nodes are jointly updated with real atoms through an E(3)-equivariant neural network. After updating, the embeddings of real atoms are discarded, and only the virtual node embeddings are retained as the fixed-length latent code:

\((\_ , [\mathbf{z}_x, \mathbf{z}_h]) = \boldsymbol{\phi}_\theta([\mathbf{x}_M, \mathbf{v}_M], [\mathbf{x}_Z, \mathbf{v}_Z])\)

Here \(\mathbf{z}_x \in \mathbb{R}^{N_Z \times 3}\) is the E(3)-equivariant spatial component, and \(\mathbf{z}_h \in \mathbb{R}^{N_Z \times D_f}\) is the E(3)-invariant feature component. This design partially decouples spatial and semantic features — \(\mathbf{z}_x\) encodes shape and orientation, while \(\mathbf{z}_h\) encodes substructure composition.

VAE Regularization: KL divergence regularization is imposed on the latent space to ensure smoothness and continuity:

\(\mathcal{L}_{\text{reg}} = \text{KL}\left(\mathcal{N}([\boldsymbol{\mu}_x, \boldsymbol{\mu}_h], [\boldsymbol{\sigma}_x^2, \boldsymbol{\sigma}_h^2]) \| \mathcal{N}([0,0], [\text{var}_x, \text{var}_h]\mathbf{I})\right)\)

where \(\boldsymbol{\mu}_x = \mathbf{z}_x\) and \([\boldsymbol{\sigma}_x^2, \boldsymbol{\mu}_h, \boldsymbol{\sigma}_h^2] = \text{Linear}(\mathbf{z}_h)\). This encourages latent space smoothness and facilitates interpolation between molecules.

Bayesian Flow Network (BFN) Decoder: BFN is selected as the decoder because it can uniformly handle both continuous data (atomic coordinates, Gaussian distributions) and discrete data (atom types, categorical distributions). A Gaussian sender distribution \(p_S(\mathbf{y}^x | \mathbf{x}_M; \alpha) = \mathcal{N}(\mathbf{y}^x | \mathbf{x}_M, \alpha^{-1}\mathbf{I})\) is used for coordinates, and a transformed categorical sender distribution is used for atom types. The reconstruction loss is the sum of the coordinate loss and atom-type loss: \(\mathcal{L}_{\text{recon}} = \mathcal{L}_x^n + \mathcal{L}_v^n\).

Loss & Training¶

Training datasets: QM9 (134K small molecules, ≤9 heavy atoms), GEOM-Drugs (430K drug molecules), ZINC-9M (9.3M molecules)
Hydrogen atoms are treated explicitly for QM9 and GEOM-Drugs, and implicitly for ZINC-9M
The number of atoms in generated molecules is controlled by sampling from the atom-count prior of the training set

Key Experimental Results¶

Main Results: Unconditional 3D Molecular Generation¶

Method	QM9 Atom Sta(%)	QM9 Mol Sta(%)	QM9 Valid(%)	GEOM Valid(%)
EDM	98.7	82.0	91.9	92.6
GEOLDM	98.9	89.4	93.8	99.3
GEOBFN 100	98.6	87.2	93.0	93.1
UniGEM	99.0	89.8	95.0	98.4
MolFLAE 100	99.4	92.0	96.8	99.7

MolFLAE achieves competitively best performance across atom stability, molecular stability, and validity.

Ablation Study: Latent Space Disentanglement Validation¶

Setting	MACCS Sim (substructure)	Shape Sim (shape)	Valid(%)
Retain \(\mathbf{z}_x\)	0.421 ↓	0.394 ↑	100.0
Retain \(\mathbf{z}_h\)	0.580 ↑	0.174 ↓	100.0

Retaining \(\mathbf{z}_x\) yields substantially higher shape similarity than retaining \(\mathbf{z}_h\) (0.394 vs. 0.174), while the reverse holds for MACCS fingerprint similarity (0.580 vs. 0.421), confirming partial spatial–semantic disentanglement.

Molecular Analogue Generation (with Atom Count Variation)¶

Atom Count Change	MCS-IoU Similarity	Valid(%)	Atom Sta(%)
-2	69.79	100.0	84.58
-1	76.69	99.89	83.28
0	84.08	99.76	82.48
+1	76.05	99.89	82.38
+2	69.95	99.68	82.53

Key Findings¶

Interpolation in the latent space produces intermediate molecules with high validity (>99.8%), and physical/structural properties exhibit a pronounced linear trend (Pearson \(r > 0.9\))
hGR drug optimization case study: latent-space mixing of 90% AZD2906 and 10% BI-653048 yields 100 candidates, of which the top-10 outperform BI-653048 in docking score, and 8/10 outperform AZD2906 in hydrophilicity
Sample 34 retains key pharmacophores from both known active compounds, with an RMSD of only 1.35 Å between the docking pose and the generated conformation

Highlights & Insights¶

The fixed-dimensional, equivariant latent space design addresses the fundamental obstacle to 3D molecular editing — variable dimensionality that precludes vector-level operations
The virtual node concept is elegant: analogous to the [CLS] token in NLP, it achieves information compression in the 3D molecular context
The partial spatial–semantic disentanglement is an emergent property requiring no additional supervision
The hGR drug optimization case study convincingly demonstrates the practical value of the approach for real-world drug design

Limitations & Future Work¶

The disentanglement of the latent space remains partial; stronger disentanglement may require imposing invariance constraints on molecular conformational changes
The method is currently validated only on small molecules (QM9, ≤9 heavy atoms); scalability to larger biomolecules such as proteins has not been explored
The sampling efficiency of the BFN decoder, while advantageous over autoregressive methods, remains relatively slow
No direct comparison with other fixed-dimensional latent space methods (e.g., UAE-3D) on editing tasks is provided

Unlike diffusion-based models such as EDM and GEOBFN, which are designed primarily for generation, MolFLAE enables editing through the VAE latent space
The virtual node encoding strategy can inspire fixed-dimensional representations for other structured data modalities (e.g., proteins, material crystals)
The BFN decoder's capacity for joint continuous–discrete modeling is generalizable to other molecular types

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of a fixed-dimensional E(3)-equivariant latent space, virtual nodes, and BFN is pioneering in the 3D molecular domain
Experimental Thoroughness: ⭐⭐⭐⭐ Covers unconditional generation, analogue design, reconstruction, interpolation, and drug optimization, but lacks direct editing comparisons with closely related methods
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear and visualizations are rich, though some experimental details are relegated to the appendix
Value: ⭐⭐⭐⭐⭐ Provides a general and flexible framework for 3D molecular editing with direct applicability to drug discovery