Skip to content

Unified All-Atom Molecule Generation with Neural Fields

Conference: NeurIPS 2025 arXiv: 2511.15906 Code: GitHub Area: Medical Imaging / Drug Design Keywords: Neural fields, molecule generation, all-atom representation, structure-based drug design, diffusion models

TL;DR

This paper proposes FuncBind, a framework that represents molecules as continuous atomic density functions via neural fields, constructing a unified conditional generative model capable of target-conditioned generation across three drug modalities: small molecules, macrocyclic peptides, and antibody CDR loops.

Background & Motivation

  • Background: Structure-based drug design (SBDD) is a core task in drug discovery, aiming to generate candidate molecules with high affinity given a target protein's 3D structure. Current generative models typically focus on a single molecular modality: small-molecule models use point cloud or voxel representations, while protein models rely on residue-level point clouds and sequence databases.
  • Limitations of Prior Work: This modality-specific design limits generalization — knowledge cannot be transferred across modalities, yet real-world drug discovery frequently involves multi-modal molecular interface design.
  • Key Challenge: The lack of a modality-agnostic representation prevents cross-modal learning and limits the diversity of training data from which physical properties can be learned.
  • Goal: Inspired by the success of cross-modal unified modeling in structure prediction (e.g., AlphaFold3, RoseTTAFold), this work proposes using neural fields as a unified molecular representation — modeling molecules as continuous functions mapping 3D coordinates to atomic densities — enabling joint training across three drug modalities within a single model.

Method

Overall Architecture

FuncBind is a two-stage latent-space conditional generative model: 1. Stage 1: Train a neural field autoencoder (VAE) to learn latent representations of molecules. 2. Stage 2: Train a conditional denoiser in latent space for novel molecule generation.

At inference, given a target protein structure, noise is sampled and iteratively denoised to produce a latent vector, which is decoded into an atomic density field and post-processed to recover the molecular structure.

Key Designs

  1. Spatial Feature Map Neural Field Representation: Unlike FuncMol, which uses a global embedding, FuncBind organizes the latent variable \(z\) as a spatial feature map (\(C \times L^3\)), where features at each spatial location capture local information. The encoder \(E_\psi\) is a 3D CNN that maps low-resolution molecular voxels into feature map space. The decoder \(D_\phi\) is a multiplicative filter network based on Gabor filters, extracting position-dependent embeddings \(z_x\) from the feature map via nearest-neighbor interpolation to decode the atomic density field: \(D_\phi(x, z_x) \to \mathbb{R}^n\). This design captures local atomic detail while remaining compatible with expressive denoising architectures such as U-Nets. The VAE training loss combines reconstruction loss with KL regularization:
\[\mathcal{L}_{AE} = \sum_{v \in \mathcal{D}} \mathbb{E}_{z \sim q_\psi(z|v)} \left[\int \|D_\phi(x,z) - v(x)\|_2^2 dx\right] + \beta \text{KL}(q_\psi(z|v) \| \mathcal{N}(0,I_d))\]
  1. Conditional Denoiser: The denoiser receives a noisy latent vector \(y = z + \sigma\varepsilon\), conditioned on the target structure \(z^{tar}\), molecular modality \(c\), and noise level \(\sigma\). The target encoder \(E_{\psi'}\) shares architectural similarity with the molecular encoder but has independent parameters. The core network \(U_\theta\) adopts a 3D U-Net with the Karras preconditioning scheme:
\[\hat{z}_\theta(y|z^{tar}, \sigma, c) = \frac{1}{\sigma^2+1}y + \frac{\sigma}{\sqrt{\sigma^2+1}} U_\theta\left(\frac{y}{\sqrt{\sigma^2+1}}, z^{tar}, \frac{1}{4}\log\sigma, c\right)\]

A key architectural decision is to forgo SE(3) equivariance constraints in favor of data augmentation (rotations and translations).

  1. Sampling Strategies: Two score-based sampling methods are supported — diffusion models (via reverse SDE integration) and Walk-Jump Sampling (single noise-level sampling, simpler to train and faster to mix). Both leverage the Tweedie–Miyasawa formula to relate the denoiser to the conditional score function.

Loss & Training

  • Autoencoder: Reconstruction loss + KL divergence; coordinates near atomic centers are upsampled during training for focused optimization.
  • Denoiser: MSE denoising loss across multiple noise levels with Karras adaptive reweighting.
  • Post-processing: Latent codes are rendered into 0.25 Å resolution voxels; peak detection and gradient ascent localize atomic coordinates; OpenBabel infers chemical bonds and residue identities.
  • The model comprises 5B parameters and is jointly trained across all three modalities.

Key Experimental Results

Main Results: Small Molecule Generation (CrossDocked2020)

Metric FuncBind VoxBind (SOTA) MolCraft TargetDiff
VinaScore ↓ -5.71 -6.94 -6.59 -5.47
VinaDock ↓ -7.26 -8.30 -7.92 -7.80
QED ↑ 0.50 0.57 0.50 0.48
SA ↑ 0.65 0.70 0.69 0.58
Diversity ↑ 0.70 0.73 0.72 0.72
Strain Energy ↓ 217 162 195 1243
# Atoms 19.0 23.4 22.7 24.2

Main Results: Antibody CDR Loop Redesign (SAbDab)

Method H3-AAR ↑ H3-RMSD ↓ H1-AAR ↑ H1-RMSD ↓
FuncBind 47.5% 2.04 Å 86.9% 0.41 Å
AbDiffuser 34.1% 3.35 Å 76.3% 1.58 Å
DiffAb† 26.8% 3.60 Å 65.8% 1.19 Å
RAbD† 22.1% 2.90 Å 22.9% 2.26 Å

Main Results: Macrocyclic Peptide Generation

Method TS ↑ L-RMSD ↓ I-RMSD ↓ TM-Score ↑ Vina dock ↑
FuncBind 0.33 2.6 Å 1.8 Å 0.36 41%
AfCycDesign 0.34 7.6 Å 3.7 Å 0.33 29%
RFPeptide 0.31 12 Å 3.3 Å 0.33 8.8%

Ablation Study

Configuration Observation
Unified model vs. single-modality model Comparable performance, but the unified model achieves significantly higher uniqueness
Diffusion vs. Walk-Jump Sampling Diffusion outperforms on most metrics; WJS is simpler to train and mixes faster
Global embedding vs. spatial feature map Spatial feature maps scale better to large molecules and are compatible with U-Net architectures

Key Findings

  • FuncBind surpasses all baselines on CDR loop redesign AAR and RMSD by a margin of 1.5–3×.
  • Wet-lab validation: H3 loop redesign achieves a 45% binding rate on rigid epitopes.
  • The model generates novel, chemically valid non-canonical amino acids, with fewer than 1% implausible structures.
  • 41% of generated macrocyclic peptide samples achieve Vina docking scores superior to the seed molecule.

Highlights & Insights

  • Power of unified representation: Unifying three structurally diverse molecular modalities via neural fields is an elegant design choice; the spatial feature map enables both fine local structural capture and compatibility with established vision network architectures.
  • Data augmentation over equivariance: Replacing SE(3) equivariance constraints with rotation/translation augmentation simplifies the architecture without sacrificing performance.
  • Closed-loop wet-lab validation: The full pipeline from generation to experimental validation is demonstrated for CDR design, with a 45% binding rate indicating genuine translational value.
  • New benchmark contribution: A dataset of ~190K synthetic macrocyclic peptide–protein complexes is introduced.

Limitations & Future Work

  • Small-molecule metrics remain slightly below those of specialized models such as VoxBind and MolCraft.
  • The approach relies on accurate 3D structures of molecular interfaces, which are costly to obtain in practical drug discovery settings.
  • Practical constraints such as synthesizability (small molecules) and developability (antibodies) are not addressed.
  • The 5B-parameter model incurs substantial training and inference costs, posing a high deployment barrier.
  • Cross-modal transfer learning has not been thoroughly explored.
  • FuncMol: The direct predecessor; this work extends it by introducing spatial feature maps and conditional generation.
  • VoxBind: The SOTA voxel-based method; FuncBind can be viewed as its continuous counterpart.
  • AlphaFold3/RoseTTAFold: Successful precedents for cross-modal unified structure prediction.
  • Insight: Neural field representations hold considerable potential for molecular science and could be extended to a broader range of biological systems.

Rating

  • Novelty: ⭐⭐⭐⭐ The neural field approach unifying three molecular modalities is novel, though the core techniques (VAE + diffusion) are well-established.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across three modalities, wet-lab validation, and a new dataset contribution.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with complete technical descriptions.
  • Value: ⭐⭐⭐⭐⭐ A unified molecular generation framework of significant practical importance; wet-lab results substantially strengthen the claims.