Unified All-Atom Molecule Generation with Neural Fields¶
Conference: NeurIPS 2025 arXiv: 2511.15906 Code: GitHub Area: Medical Imaging / Drug Design Keywords: Neural fields, molecule generation, all-atom representation, structure-based drug design, diffusion models
TL;DR¶
This paper proposes FuncBind, a framework that represents molecules as continuous atomic density functions via neural fields, constructing a unified conditional generative model capable of target-conditioned generation across three drug modalities: small molecules, macrocyclic peptides, and antibody CDR loops.
Background & Motivation¶
- Background: Structure-based drug design (SBDD) is a core task in drug discovery, aiming to generate candidate molecules with high affinity given a target protein's 3D structure. Current generative models typically focus on a single molecular modality: small-molecule models use point cloud or voxel representations, while protein models rely on residue-level point clouds and sequence databases.
- Limitations of Prior Work: This modality-specific design limits generalization — knowledge cannot be transferred across modalities, yet real-world drug discovery frequently involves multi-modal molecular interface design.
- Key Challenge: The lack of a modality-agnostic representation prevents cross-modal learning and limits the diversity of training data from which physical properties can be learned.
- Goal: Inspired by the success of cross-modal unified modeling in structure prediction (e.g., AlphaFold3, RoseTTAFold), this work proposes using neural fields as a unified molecular representation — modeling molecules as continuous functions mapping 3D coordinates to atomic densities — enabling joint training across three drug modalities within a single model.
Method¶
Overall Architecture¶
FuncBind is a two-stage latent-space conditional generative model: 1. Stage 1: Train a neural field autoencoder (VAE) to learn latent representations of molecules. 2. Stage 2: Train a conditional denoiser in latent space for novel molecule generation.
At inference, given a target protein structure, noise is sampled and iteratively denoised to produce a latent vector, which is decoded into an atomic density field and post-processed to recover the molecular structure.
Key Designs¶
- Spatial Feature Map Neural Field Representation: Unlike FuncMol, which uses a global embedding, FuncBind organizes the latent variable \(z\) as a spatial feature map (\(C \times L^3\)), where features at each spatial location capture local information. The encoder \(E_\psi\) is a 3D CNN that maps low-resolution molecular voxels into feature map space. The decoder \(D_\phi\) is a multiplicative filter network based on Gabor filters, extracting position-dependent embeddings \(z_x\) from the feature map via nearest-neighbor interpolation to decode the atomic density field: \(D_\phi(x, z_x) \to \mathbb{R}^n\). This design captures local atomic detail while remaining compatible with expressive denoising architectures such as U-Nets. The VAE training loss combines reconstruction loss with KL regularization:
- Conditional Denoiser: The denoiser receives a noisy latent vector \(y = z + \sigma\varepsilon\), conditioned on the target structure \(z^{tar}\), molecular modality \(c\), and noise level \(\sigma\). The target encoder \(E_{\psi'}\) shares architectural similarity with the molecular encoder but has independent parameters. The core network \(U_\theta\) adopts a 3D U-Net with the Karras preconditioning scheme:
A key architectural decision is to forgo SE(3) equivariance constraints in favor of data augmentation (rotations and translations).
- Sampling Strategies: Two score-based sampling methods are supported — diffusion models (via reverse SDE integration) and Walk-Jump Sampling (single noise-level sampling, simpler to train and faster to mix). Both leverage the Tweedie–Miyasawa formula to relate the denoiser to the conditional score function.
Loss & Training¶
- Autoencoder: Reconstruction loss + KL divergence; coordinates near atomic centers are upsampled during training for focused optimization.
- Denoiser: MSE denoising loss across multiple noise levels with Karras adaptive reweighting.
- Post-processing: Latent codes are rendered into 0.25 Å resolution voxels; peak detection and gradient ascent localize atomic coordinates; OpenBabel infers chemical bonds and residue identities.
- The model comprises 5B parameters and is jointly trained across all three modalities.
Key Experimental Results¶
Main Results: Small Molecule Generation (CrossDocked2020)¶
| Metric | FuncBind | VoxBind (SOTA) | MolCraft | TargetDiff |
|---|---|---|---|---|
| VinaScore ↓ | -5.71 | -6.94 | -6.59 | -5.47 |
| VinaDock ↓ | -7.26 | -8.30 | -7.92 | -7.80 |
| QED ↑ | 0.50 | 0.57 | 0.50 | 0.48 |
| SA ↑ | 0.65 | 0.70 | 0.69 | 0.58 |
| Diversity ↑ | 0.70 | 0.73 | 0.72 | 0.72 |
| Strain Energy ↓ | 217 | 162 | 195 | 1243 |
| # Atoms | 19.0 | 23.4 | 22.7 | 24.2 |
Main Results: Antibody CDR Loop Redesign (SAbDab)¶
| Method | H3-AAR ↑ | H3-RMSD ↓ | H1-AAR ↑ | H1-RMSD ↓ |
|---|---|---|---|---|
| FuncBind | 47.5% | 2.04 Å | 86.9% | 0.41 Å |
| AbDiffuser | 34.1% | 3.35 Å | 76.3% | 1.58 Å |
| DiffAb† | 26.8% | 3.60 Å | 65.8% | 1.19 Å |
| RAbD† | 22.1% | 2.90 Å | 22.9% | 2.26 Å |
Main Results: Macrocyclic Peptide Generation¶
| Method | TS ↑ | L-RMSD ↓ | I-RMSD ↓ | TM-Score ↑ | Vina dock ↑ |
|---|---|---|---|---|---|
| FuncBind | 0.33 | 2.6 Å | 1.8 Å | 0.36 | 41% |
| AfCycDesign | 0.34 | 7.6 Å | 3.7 Å | 0.33 | 29% |
| RFPeptide | 0.31 | 12 Å | 3.3 Å | 0.33 | 8.8% |
Ablation Study¶
| Configuration | Observation |
|---|---|
| Unified model vs. single-modality model | Comparable performance, but the unified model achieves significantly higher uniqueness |
| Diffusion vs. Walk-Jump Sampling | Diffusion outperforms on most metrics; WJS is simpler to train and mixes faster |
| Global embedding vs. spatial feature map | Spatial feature maps scale better to large molecules and are compatible with U-Net architectures |
Key Findings¶
- FuncBind surpasses all baselines on CDR loop redesign AAR and RMSD by a margin of 1.5–3×.
- Wet-lab validation: H3 loop redesign achieves a 45% binding rate on rigid epitopes.
- The model generates novel, chemically valid non-canonical amino acids, with fewer than 1% implausible structures.
- 41% of generated macrocyclic peptide samples achieve Vina docking scores superior to the seed molecule.
Highlights & Insights¶
- Power of unified representation: Unifying three structurally diverse molecular modalities via neural fields is an elegant design choice; the spatial feature map enables both fine local structural capture and compatibility with established vision network architectures.
- Data augmentation over equivariance: Replacing SE(3) equivariance constraints with rotation/translation augmentation simplifies the architecture without sacrificing performance.
- Closed-loop wet-lab validation: The full pipeline from generation to experimental validation is demonstrated for CDR design, with a 45% binding rate indicating genuine translational value.
- New benchmark contribution: A dataset of ~190K synthetic macrocyclic peptide–protein complexes is introduced.
Limitations & Future Work¶
- Small-molecule metrics remain slightly below those of specialized models such as VoxBind and MolCraft.
- The approach relies on accurate 3D structures of molecular interfaces, which are costly to obtain in practical drug discovery settings.
- Practical constraints such as synthesizability (small molecules) and developability (antibodies) are not addressed.
- The 5B-parameter model incurs substantial training and inference costs, posing a high deployment barrier.
- Cross-modal transfer learning has not been thoroughly explored.
Related Work & Insights¶
- FuncMol: The direct predecessor; this work extends it by introducing spatial feature maps and conditional generation.
- VoxBind: The SOTA voxel-based method; FuncBind can be viewed as its continuous counterpart.
- AlphaFold3/RoseTTAFold: Successful precedents for cross-modal unified structure prediction.
- Insight: Neural field representations hold considerable potential for molecular science and could be extended to a broader range of biological systems.
Rating¶
- Novelty: ⭐⭐⭐⭐ The neural field approach unifying three molecular modalities is novel, though the core techniques (VAE + diffusion) are well-established.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across three modalities, wet-lab validation, and a new dataset contribution.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with complete technical descriptions.
- Value: ⭐⭐⭐⭐⭐ A unified molecular generation framework of significant practical importance; wet-lab results substantially strengthen the claims.