PhysX-3D: Physical-Grounded 3D Asset Generation¶

Conference: NeurIPS 2025 arXiv: 2507.12465 Code: Project Page Area: 3D Vision Keywords: Physical-property 3D generation, 3D dataset, physical annotation, articulated object modeling, embodied AI

TL;DR¶

PhysX proposes the first end-to-end physical-property-driven 3D asset generation paradigm, comprising PhysXNet (the first 3D dataset with systematic annotations across five physical dimensions—absolute scale, material, functional affordance, kinematics, and functional description—covering 26K+ objects) and PhysXGen (a dual-branch feed-forward generation framework that injects physical knowledge into a pretrained 3D structural latent space).

Background & Motivation¶

Recent years have witnessed remarkable progress in 3D asset generation, yet existing methods focus almost exclusively on geometry and texture, neglecting physical properties. Real-world objects, however, inherently possess rich physical and semantic characteristics:

Absolute Scale: the true physical dimensions of an object

Material: material name, Young's modulus, Poisson's ratio, density

Functional Affordance: per-part priority for touching/grasping interactions

Kinematics: joint type, range of motion, motion direction, parent–child part relationships

Functional Description: basic, functional, and motion descriptive text

Coverage across existing datasets is highly fragmented (see Table 1 for comparison): - PartNet-Mobility contains only 2.7K objects with kinematic annotations alone - ABO provides material and scale annotations but only at the object level (not part level) - Objaverse is large-scale but contains no physical annotations whatsoever

This absence of physical properties severely hinders the practical deployment of 3D assets in simulation, robotics, and embodied AI. The core motivation of PhysX is to establish a complete physical 3D asset pipeline spanning upstream data annotation to downstream generative modeling.

Method¶

Overall Architecture¶

PhysX consists of two core components: - PhysXNet dataset: 26K+ physical 3D objects, plus PhysXNet-XL with 6 million procedurally augmented objects - PhysXGen generative model: a dual-branch feed-forward framework built upon the pretrained TRELLIS 3D generative model

Key Designs¶

Human-in-the-Loop Annotation Pipeline: organized into two stages:
- Initial Data Acquisition: Each part is first rendered via alpha compositing (the target part in red, all others in gray) to maximize visual clarity and minimize occlusion interference. GPT-4o then performs automatic annotation to obtain basic physical attributes (material, density, part name, affordance, functional description). Human annotators review and correct the VLM outputs.
- Kinematic Parameter Determination: For all parts with constrained motion (excluding free or rigid connections), the contact region between each child–parent mesh pair is computed. Plane fitting is applied to obtain candidate motion-axis directions, and candidate positions are generated (for revolute joints, K-means is additionally used to determine the axis position). Human annotators then select the best candidate and finalize the kinematic parameters.

During preprocessing, overly fine-grained parts in PartNet are merged (area \(\leq 0.2\) or face count \(\leq 100\) with area \(\leq 0.06\)), and merging results are manually verified. Kinematic types include five categories: A (free), B (prismatic/translational joint), C (revolute joint), D (hinge joint), E (rigid connection), and the composite type CB.

PhysXGen Dual-Branch Generation Architecture: consists of two stages:
- Physical 3D VAE: A physical VAE encoder \(\mathcal{E}_{phy}\) and decoder \(\mathcal{D}_{phy}\) are constructed to encode physical attributes—scale \(P_{dim} \in \mathbb{R}^{N \times 1}\), affordance \(P_{aff} \in \mathbb{R}^{N \times 1}\), density \(P_\rho \in \mathbb{R}^{N \times 1}\), and kinematic parameters \(P_{mov} \in \mathbb{R}^{N \times 11}\)—concatenated as \(P_{phy} \in \mathbb{R}^{N \times 14}\), together with CLIP-encoded functional descriptions \(P_{sem} \in \mathbb{R}^{N \times 768 \times 3}\), into a physical latent space \(P_{plat} \in \mathbb{R}^{N \times 8}\). The structural branch employs pretrained DINOv2 feature encoding. A key design is the residual connection establishing an information pathway from \(\mathcal{D}_{phy}\) to \(\mathcal{D}_{aes}\), exploiting the correlation between physical and structural representations.
- Physical Latent Generation: A Transformer-based diffusion model trained with a conditional flow matching (CFM) objective. The physical branch uses 14 Transformer blocks (fewer than the 24 blocks in the structural branch to reduce computational cost). The structural branch provides guidance to the physical branch via learnable skip-connection layers. The total loss is \(\mathcal{L}_{diff} = \mathcal{L}_{aes} + \mathcal{L}_{phy}\).
PhysXNet-XL Procedural Augmentation: Starting from PhysXNet, intra-class and cross-class combination rules are applied to procedurally generate 6 million+ physical 3D objects. Intra-class combination covers 9 categories including cabinets, tables, and bottles; cross-class combination identifies drawers and doors as modular components that can be flexibly integrated. Structural and physical consistency is enforced throughout.

Loss & Training¶

VAE Loss:

\[\mathcal{L}_{vae} = \mathcal{L}_{aes}^{color} + \mathcal{L}_{aes}^{geometry} + \mathcal{L}_{phy} + \mathcal{L}_{sem} + \mathcal{L}_{kl} + \mathcal{L}_{reg}\]

\(\mathcal{L}_{aes}^{color}\): L2 + LPIPS
\(\mathcal{L}_{aes}^{geometry}\): mask + normal + depth
\(\mathcal{L}_{phy}\) / \(\mathcal{L}_{sem}\): normalized L2
AdamW optimizer, lr = 1e-4, 8× A100 GPUs
24K training / 1K validation / 1K test

Key Experimental Results¶

Main Results¶

PhysXGen vs. baseline methods:

Method	PSNR↑	CD↓	F-Score↑	Abs. Scale↓	Material↑	Affordance↑	Motion COV↑	Motion MMD↓	Description↑
TRELLIS	24.31	13.2	76.9	–	–	–	–	–	–
TRELLIS+PhysPre	24.31	13.2	76.9	13.21	8.63	7.23	0.24	0.12	6.55
PhysXGen	24.53	12.7	77.3	7.24	13.01	11.30	0.33	0.08	10.11

Comparison with GPT baseline (TRELLIS+PartField+GPT-4o):

Method	Abs. Scale↓	Material↑	Affordance↑	Motion COV↑	Motion MMD↓	Description↑
TRELLIS+PartField+GPT	8.81	7.95	6.73	0.09	0.24	14.31
PhysXGen	7.24	13.01	11.30	0.33	0.08	10.11

PhysXGen achieves gains of 24%, 64%, 28%, and 72% over the GPT baseline on absolute scale, material, kinematics, and affordance, respectively.

Ablation Study¶

Effect of the dual-branch architecture:

Dep-VAE	Dep-Diff	PSNR↑	Abs. Scale↓	Material↑	Affordance↑	Motion COV↑	Description↑
✗	✗	24.31	13.21	8.63	7.23	0.24	6.55
✗	✓	24.31	12.01	10.69	8.95	0.26	7.71
✓	✗	24.32	10.57	9.86	9.32	0.28	7.54
✓	✓	24.53	7.24	13.01	11.30	0.33	10.11

Key Findings¶

Physical–structural correlation is central: joint modeling of the two modalities yields substantial improvements in physical attributes while also enhancing geometric quality (CD reduced from 13.2 to 12.7)
The dual-branch design in both the VAE and the diffusion model is indispensable; removing either branch significantly degrades physical attribute generation quality
GPT-4o performs better on functional descriptions (benefiting from its language capabilities), but is substantially inferior to end-to-end learning on structured physical attributes
Absolute scale prediction is challenged by long-tail distributions spanning 1–1000 cm, where neither linear nor logarithmic normalization yields satisfactory results
Kinematics is the most challenging attribute, requiring simultaneous accurate prediction of discrete part hierarchies and continuous motion parameters

Highlights & Insights¶

Addressing a critical gap: this work is the first to systematically define and annotate a complete physical-property spectrum for 3D objects, offering substantial value to the embodied AI community
Scalable annotation pipeline design: the combination of GPT-4o and human verification is both efficient and reliable, and is directly reusable for new datasets
Exploitation of physical–structural correlation is a well-motivated design choice: physical attributes (e.g., material → density → motion characteristics) are inherently correlated with geometric shape, making joint modeling natural and effective
The 6-million-scale procedural augmentation of PhysXNet-XL provides a viable path toward large-scale physical 3D data

Limitations & Future Work¶

Physical attribute generation may produce spatially inconsistent artifacts (e.g., discontinuous material or affordance predictions across adjacent regions)
Regression-based prediction of kinematic parameters struggles to accurately determine part counts and parent–child hierarchical relationships
The dataset is constrained by PartNet's indoor/CAD model distribution, lacking outdoor and real-scan data
Functional description relies on CLIP encodings, whose non-invertibility limits the ability to decode embeddings back into text
Only four physical property types are used for generation; finer-grained quantities such as friction coefficients are not included

TRELLIS serves as the foundation for structured 3D generation; PhysXGen extends it by superimposing a physical branch onto its latent space
Compared to PartNet-Mobility, PhysXNet represents a qualitative leap in both annotation dimensions and scale
The "part-level visual isolation + VLM annotation" strategy in the annotation pipeline generalizes to other scenarios requiring fine-grained annotation
The work has direct downstream value for robotic manipulation and physics-based simulation

Rating¶

Novelty: ⭐⭐⭐⭐ First to define the physical-property 3D generation problem; dataset contribution is particularly notable
Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons against multiple baselines and ablations validate core design choices, though the test set is limited to indoor CAD models
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and the annotation pipeline is described in thorough detail
Value: ⭐⭐⭐⭐⭐ The dataset provides significant impetus for the embodied AI and robotics communities; the research direction is of great importance