Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment¶
Conference: ICML 2025 arXiv: 2506.10186 Code: GitHub Area: Molecule Generation / Diffusion Models Keywords: 3D Molecule Generation, Non-Equivariant, Rotational Alignment, Latent Diffusion, AutoEncoder
TL;DR¶
Proposes RADM (Rotationally Aligned Diffusion Model), which constructs an aligned latent space by learning sample-dependent SO(3) rotational transformations, enabling non-equivariant diffusion models to generate 3D molecules effectively, achieving generation quality comparable to SOTA equivariant models while offering better scalability and sampling efficiency.
Background & Motivation¶
In 3D molecule generation, rotating a molecule in three-dimensional space does not change its chemical properties (SE(3) symmetry). Mainstream methods satisfy this constraint through equivariant networks (such as EGNN):
But equivariant architectures have clear disadvantages:
Complex Parameterization: EGNN and similar models require special message passing rules to maintain equivariance.
Lack of Standard Implementations: Unlike the unified status of Transformers in vision/NLP.
Poor Efficiency and Scalability: Difficult to leverage modern acceleration techniques such as FlashAttention.
Key Challenge: Is equivariance strictly necessary? The probability of a molecule is determined by the total probability of all its possible 3D positions, rather than requiring equal probability at every single position.
Method¶
Overall Architecture¶
Trained in two stages: (1) Train an autoencoder with rotational alignment \(\rightarrow\) construct an aligned latent space; (2) Train a non-equivariant diffusion model in the aligned latent space.
Key Designs¶
1. Rotation Parameterization
Project an arbitrary matrix \(\mathbf{M} \in \mathbb{R}^{3 \times 3}\) to SO(3) via SVD:
This parameterization is smooth when \(\det(\mathbf{M}) \neq 0\), making it suitable for gradient optimization.
2. Rotation Network
A vanilla GNN (non-equivariant) is used to generate a sample-dependent rotation matrix \(\mathbf{R}_\theta\) from the molecule \((\mathbf{x}, \mathbf{h})\). Atomic coordinates and features are concatenated, passed through message passing, and finally mean-pooled and processed by a 2-layer MLP to obtain \(\mathbf{M}\).
3. Non-Equivariant Autoencoder
- Encoder: 1-layer EGNN (same as GeoLDM for easy ablation)
- Decoder: Non-equivariant GNN — Key Design: The decoder must be non-equivariant so that the reconstruction loss is sensitive to rotation, thereby providing gradient signals for the rotation network.
Reconstruction loss: $\(\mathcal{L} = -\mathbb{E}_{q_{\theta,\eta}(\mathbf{z}_x, \mathbf{z}_h | \mathbf{x}, \mathbf{h})}[\log p_\psi(\mathbf{R}_\theta\mathbf{x}, \mathbf{h} | \mathbf{z}_x, \mathbf{z}_h)]\)$
4. Non-Equivariant Latent Diffusion Model
A standard denoising diffusion model is trained in the aligned latent space. The noise prediction network can use: - Vanilla GNN (concatenating coordinates and features) - DiT (Diffusion Transformer): Directly reuse high-performance Transformer implementations from the vision field.
Training objective (standard DDPM):
5. Translation Handling
Coordinates are projected into an \((N-1) \times 3\)-dimensional subspace by subtracting the center of mass. The center of mass of the predicted noise is also subtracted at each diffusion step.
Key Experimental Results¶
QM9 Dataset¶
| Model | Atom Sta (%) | Mol Sta (%) | Valid (%) | Valid & Unique (%) |
|---|---|---|---|---|
| EDM (Equivariant) | 98.7 | 82.0 | 91.9 | 90.7 |
| GeoLDM (Equivariant) | 98.9 | 89.4 | 93.8 | 92.7 |
| GDM-aug (Non-Equivariant) | 97.6 | 71.6 | 90.4 | 89.5 |
| RADM (Non-Equivariant) | ~98.8 | ~87 | ~93 | ~92 |
GEOM-Drugs Dataset¶
| Model | Atom Sta (%) | Valid (%) |
|---|---|---|
| EDM | 81.3 | 92.6 |
| GeoLDM | 84.4 | 99.3 |
| GDM-aug | 77.7 | 91.8 |
| RADM | Large improvement over GDM-aug | - |
Efficiency Comparison¶
- RADM sampling speed is significantly faster than equivariant diffusion models.
- Using DiT as the denoising network allows leveraging FlashAttention acceleration.
- Non-equivariant architectures are more parameter-efficient.
Key Findings¶
- Non-equivariant RADM significantly outperforms all prior non-equivariant methods (GDM, GDM-aug, GraphLDM).
- Generation quality is close to SOTA equivariant models (GeoLDM).
- Rotational alignment is crucial: ablation studies prove that performance drops sharply when the rotation network is removed.
- Non-equivariant decoders are necessary (equivariant decoders would make the reconstruction loss invariant to rotation, preventing the rotation network from learning).
Highlights & Insights¶
- Revisiting the necessity of equivariance: Probabilistically, equivariance constraints are not strictly necessary, breaking the community's inertia.
- Inspiration for rotational alignment: 3D vision datasets (like ShapeNet) are aligned; why not molecules?
- Autoencoder learns unsupervised alignment: Cleverly uses the reconstruction objective to indirectly supervise the rotation network to align molecules.
- Potential for unified architectures: Non-equivariant models can directly utilize general-purpose architectures like DiT, bridging molecule generation and vision generation.
- SVD Rotation Parameterization: Smooth and unconstrained, suitable for end-to-end gradient learning.
Limitations & Future Work¶
- The autoencoder and diffusion model are trained separately, which might not achieve joint optimality.
- The encoder still uses EGNN (equivariant), rendering the framework not entirely non-equivariant.
- Only validated on small molecule datasets (QM9, GEOM-Drugs), without testing on macromolecules such as proteins.
- Rotational alignment only handles SO(3); permutation equivariance is still guaranteed by attention mechanisms.
- Latent space dimensions are the same as the original space, meaning real dimensional compression is not achieved.
Related Work & Insights¶
- Equivariant Diffusion: EDM, GeoLDM, MiDi
- Non-Equivariant Methods: GDM, GDM-aug, GraphLDM
- Rotation Representation: SVD Parameterization, Euler Angles, Exponential Coordinates
- Latent Diffusion: LDM, GeoLDM
Rating¶
⭐⭐⭐⭐ (4/5)
The argument is clear and powerful—equivariance is not mandatory. The design of the rotationally aligned autoencoder is elegant, opening up a path to connect molecular generation with general generative architectures. The experiments are thorough but limited in scale (only small molecules).