CVPR 2026 3D Vision 4D synthesis physics-aware Gaussian feed-forward inference DPO alignment single image to 4D MPM simulation

PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis¶

Conference: CVPR 2026 arXiv: 2508.13911 Code: Project Page Area: 3D Vision / Physical Simulation Keywords: 4D synthesis, physics-aware Gaussian, feed-forward inference, DPO alignment, single image to 4D, MPM simulation

TL;DR¶

The first framework for feed-forward prediction of 3DGS and physical attributes (material category, Young's modulus, Poisson's ratio) from a single image. A two-stage training pipeline (supervised pretraining + DPO preference fine-tuning) entirely bypasses SDS and differentiable physics engines. Combined with the 50K+ PhysAssets dataset, the method generates high-fidelity 4D physical simulations within one minute, surpassing per-scene optimization methods in both CLIP similarity and human preference rate.

Background & Motivation¶

Background: Physical 4D synthesis conventionally requires multi-view 3DGS reconstruction (hours), manual physical parameter specification, and subsequent simulation. SDS-based methods (OmniPhysGS/DreamPhysics) attempt to distill physical priors from video models, but require differentiable physics engines, making them computationally expensive and unstable.

Limitations of Prior Work: Three key bottlenecks: (a) dependence on pre-reconstructed 3DGS (dense multi-view inputs and per-scene optimization); (b) physical attributes either manually specified or SDS-optimized (inflexible/unstable); (c) naive coupling of 3DGS and physics modules that ignores physical cues embedded in appearance.

Key Challenge: Per-scene optimization inherently lacks generalizability — each new scene must be processed from scratch. SDS is data-driven but requires a differentiable physics engine and remains unstable.

Goal: Can per-scene optimization be entirely bypassed by learning a generative model that directly synthesizes complete physical 4D simulations from sparse inputs via feed-forward inference?

Key Insight: Reframe the problem from "slow iterative reconstruction" to "amortized feed-forward inference" — train a large Transformer on large-scale data to learn generalizable physical priors.

Core Idea: A feed-forward Transformer that jointly predicts 3DGS and physical attributes, combined with probabilistic physical modeling and DPO preference fine-tuning (instead of SDS), completing 4D inference in a single forward pass.

Method¶

Overall Architecture¶

Input: 1–4 RGB images + camera parameters → DINOv3 image encoding + Plücker ray camera encoding → token concatenation + 3 global tokens → 24-layer Transformer → dual-branch prediction: DPT Head → 3DGS parameters \(\psi\) + Physics Head → physical attribute distribution \(\theta\) → MPM simulator → 4D dynamic sequence.

Key Designs¶

Multi-Modal Tokenization and Global Physics Tokens:
- Function: Uniformly encode image and geometric information; introduce global tokens to aggregate scene-level physical information.
- Mechanism: DINOv3 encodes image patches; Plücker ray coordinates encode per-pixel primary rays; concatenated tokens are appended with 3 learnable global tokens (used by the physics head). For single-image inference, MVAdapter synthesizes auxiliary back/left/right views.
- Design Motivation: Global tokens prevent physical attribute prediction from relying on local features, enabling inference of material properties from holistic scene appearance cues.
Probabilistic Physical Attribute Prediction Head:
- Function: Predict material category (classification) and probability distributions over continuous physical parameters (regression) from global tokens.
- Mechanism: Classification head \(f_{material}(g_k) \to C\); regression head outputs mean and variance \((\mu_\theta, \log\sigma_\theta^2) = f_{phys}(g_k)\), defining the conditional distribution \(P(\theta|I) = \mathcal{N}(\theta|\mu_\theta, \text{diag}(\sigma_\theta^2))\); physical parameters are obtained by sampling at inference time.
- Design Motivation: Probabilistic modeling captures the inherent uncertainty that "the same appearance may correspond to multiple physical parameters" and enables sampling of multiple candidates for DPO.
DPO Preference Fine-Tuning as a Replacement for SDS:
- Function: Align simulation outputs with ground-truth videos via preference learning, entirely bypassing differentiability requirements.
- Mechanism: The pretrained policy is frozen as \(\pi_{ref}\); \(K\) candidate physical parameter sets are sampled from \(\pi_\omega\), each passed through MPM simulation and rendering; perceptual distances to ground truth are computed using SAM-2 segmentation and CoTracker-3 trajectory extraction; the closest/farthest candidates become winner/loser pairs; the DPO loss is minimized: \(L_{DPO} = -\mathbb{E}[\log\sigma(\beta\log\frac{\pi_\omega(\phi_w|z)}{\pi_{ref}(\phi_w|z)} - \beta\log\frac{\pi_\omega(\phi_l|z)}{\pi_{ref}(\phi_l|z)})]\)
- Design Motivation: SDS requires gradients to flow through the physics engine. DPO treats simulation and rendering as a black box — learning from output quality comparisons alone — substantially simplifying training.
PhysAssets Dataset (50K+):
- Function: Construct the first large-scale dataset pairing 3D assets with physical annotations and reference simulation videos.
- Mechanism: Assets are aggregated from Objaverse, OmniObject3D, ABO, and HSSD; Qwen3VL, a multimodal LLM, infers material categories and physical parameters from multi-view images; Framepack generates ground-truth simulation videos.
- Design Motivation: Supports both supervised pretraining (with GT physical parameters) and DPO fine-tuning (with GT simulation videos), filling a critical data gap in the field.

Loss & Training¶

Two stages: Stage 1 performs large-scale supervised pretraining with joint optimization of reconstruction loss (MSE + Alpha + LPIPS) and physical prediction loss. Stage 2 freezes the backbone and fine-tunes only the physics head via DPO. Training uses 32 × A800 GPUs for 3 days with batch size 8 per GPU. MPM simulation parameters: sub-step time \(2\times10^{-5}\) s, frame time \(4\times10^{-2}\) s, 50 frames per sequence.

Key Experimental Results¶

Main Results (5 Material Categories)¶

Method	metal CLIP	jelly CLIP	plast. CLIP	snow CLIP	sand CLIP	avg CLIP	avg UPR
OmniPhysGS	0.215	0.229	0.214	0.183	0.205	0.209	10%
DreamPhysics	0.227	0.246	0.244	0.207	0.222	0.229	17.2%
PhysGM (w/o DPO)	0.270	0.270	0.255	0.254	0.298	0.269	30%
PhysGM (w/ DPO)	0.273	0.277	0.269	0.255	0.300	0.275	42.8%

Ablation Study¶

Configuration	avg CLIP_sim	avg UPR	Note
PhysGM w/o DPO	0.269	30%	Pretraining only
PhysGM w/ DPO	0.275	42.8%	DPO significantly improves UPR (+12.8%)

Key Findings¶

Feed-forward outperforms per-scene optimization: PhysGM surpasses SDS baselines (requiring hours per scene) on both CLIP_sim and UPR across all material types — demonstrating that feed-forward inference does not sacrifice quality.
DPO improves perceptual quality rather than metric scores: Post-DPO CLIP_sim improvement is marginal, but UPR increases substantially by 12.8 percentage points — indicating that preference fine-tuning primarily enhances human-perceived physical realism.
Probabilistic modeling is the foundation for DPO: Replacing the probability distribution with a point estimate prevents effective multi-candidate sampling for DPO, causing fine-tuning to fail.
Joint training outperforms disentangled modules: Jointly predicting appearance and physics outperforms separate modules — validating the hypothesis that appearance encodes physical cues.
Speed: Complete 4D simulation in under 1 minute vs. hours for SDS-based methods.

Highlights & Insights¶

Feed-forward physical inference paradigm — A paradigm shift from "per-scene optimization" to "amortized inference." PhysGM demonstrates that large models trained on large-scale data can learn generalizable physical priors, replacing hours of optimization with a single forward pass.
Novel application of DPO in generative modeling — Transferring DPO from language model preference alignment to physical simulation quality alignment. The idea of constructing preference pairs from non-differentiable simulator outputs is highly innovative.
Elegant probabilistic physical modeling — Predicting distributions rather than point estimates simultaneously captures uncertainty and provides the sampling basis for DPO — a design that serves two purposes at once.

Limitations & Future Work¶

LLM-dependent dataset annotation: Physical parameters inferred by Qwen3VL may lack precision; measurements from dedicated physical experiments would be more reliable.
GT video quality: Reference simulation videos generated by Framepack may not themselves be fully physically realistic.
Limited material categories: Only 5 material categories are covered; composite materials and fluids are not addressed.
Predominantly single-object scenes: The capability to handle multi-object interaction scenes remains to be validated.
Future directions: Incorporating real physical experiment videos as ground truth; expanding material categories to include fluids and cloth; supporting interactive physical manipulation by users.

vs PhysGaussian: PhysGaussian pioneered 3DGS + MPM coupling but requires manual parameter specification per scene; PhysGM automatically predicts physical attributes without pre-reconstruction.
vs OmniPhysGS/DreamPhysics: These SDS-based methods require hours of per-scene optimization; PhysGM completes inference in 1 minute with superior results.
vs LGM/GS-LRM: These feed-forward 3D reconstruction methods handle only static scenes; PhysGM is the first to embed physical inference for dynamic 4D generation.
Insight: The DPO + non-differentiable simulation paradigm is generalizable to any generative task requiring feedback from black-box simulators (robotic control, fluid simulation, etc.).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First feed-forward physics-aware 4D generation framework; DPO as a replacement for SDS is a genuinely novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Five-material comparison + ablations + user study; more quantitative ablations would strengthen the paper.
Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, systematic methodology, and naturally motivated two-stage training.
Value: ⭐⭐⭐⭐⭐ Pioneering value for 4D synthesis and physics-aware 3D vision.

Experiments¶

Table 1: Physical Simulation Quality Comparison (5 Material Categories)¶

Method	metal CLIPsim	jelly CLIPsim	plasticine CLIPsim	snow CLIPsim	sand CLIPsim	avg CLIPsim	avg UPR
OmniPhysGS	0.2149	0.2291	0.2135	0.1834	0.2047	0.2091	10%
DreamPhysics	0.2273	0.2459	0.2437	0.2071	0.2217	0.2291	17.2%
PhysGM (w/o DPO)	0.2698	0.2700	0.2547	0.2541	0.2980	0.2693	30%
PhysGM (w/ DPO)	0.2732	0.2774	0.2691	0.2548	0.2997	0.2748	42.8%

Table 2: Multi-View Reconstruction Quality (GSO Dataset)¶

Method	Resolution	PSNR↑	SSIM↑	LPIPS↓
LGM	256	21.44	0.832	0.122
PhysGM (ours)	256	25.47	0.916	0.071
GS-LRM	512	30.52	0.952	0.050
PhysGM (ours)	512	28.95	0.953	0.039

Table 3: Efficiency and Generalization Comparison¶

Method	Training Paradigm	Generalizable	Inference Time	CLIPsim
OmniPhysGS	SDS	✗	>12h	0.2091
DreamPhysics	SDS	✗	>0.5h	0.2291
PhysGM	DPO	✓	<1min	0.2748

Key Findings¶

Feed-forward vs. per-scene optimization: PhysGM completes single-image-to-4D simulation in under 1 minute (inference <30s + MPM simulation), compared to >12h for OmniPhysGS and >0.5h for DreamPhysics.
DPO substantially improves simulation quality: Adding DPO raises CLIPsim from 0.2693 to 0.2748 and UPR from 30% to 42.8% (a 12.8 percentage point gain in human preference rate).
Reconstruction quality is competitive with specialized methods: PSNR exceeds LGM by 4.03 dB at 256 resolution; at 512 resolution, LPIPS is superior to GS-LRM using only 10% of its training data.
Only fully automatic solution: PhysGM is the only method that simultaneously requires no pre-optimized 3DGS, no predefined physical parameters, generalizes across scenes, does not depend on an LLM at inference time, and completes inference in under 30 seconds.

Highlights¶

Paradigm innovation: Transforms physical 4D synthesis from per-scene optimization to feed-forward inference, achieving a speedup of over 720× compared to OmniPhysGS (12h).
DPO for physical simulation alignment: The first application of DPO in the physical simulation domain, bypassing the requirement for differentiable physics engines by constructing preference pairs from black-box simulator outputs.
Probabilistic physical prediction: Outputting distributions over physical attributes rather than point estimates naturally supports DPO sampling and uncertainty modeling.
SAM-2 + CoTracker-3 for preference label construction: An automated preference annotation pipeline that quantifies simulation fidelity relative to ground truth via instance segmentation and trajectory tracking.
Large-scale physical annotation dataset PhysAssets: 50K+ assets spanning metal, jelly, plasticine, snow, sand, and other material categories, filling a critical data gap in the field.

Limitations & Future Work¶

MPM simulation as a computational bottleneck: MPM simulation (200³ grid resolution) remains the primary time-consuming step in 4D synthesis, limiting real-time applicability; efficient alternatives for fluid and fracture simulation are lacking.
Sim-to-real gap: Training data is based on synthetic simulation videos (generated by Framepack), and simplified constitutive models introduce an inherent discrepancy with real physics, limiting robustness in real-world deployment.
SH degree limitation: Spherical harmonics are set to degree 0 (diffuse only), precluding modeling of view-dependent specular effects.
Single-image depth ambiguity: 3D reconstruction accuracy from a single image is limited by occlusion and depth uncertainty.
Material coverage: Although 50K assets are included, physical attribute annotations are inferred by an MLLM rather than physically measured, limiting annotation accuracy.

vs PhysGaussian: Pioneered 3DGS + MPM coupling but requires manual per-scene parameter tuning; PhysGM predicts physical attributes automatically.
vs DreamPhysics/OmniPhysGS: Distill physical parameters from video models via SDS, requiring a differentiable simulator and hours of optimization per scene; PhysGM uses DPO to bypass differentiability.
vs PhysDreamer: Also optimizes Young's modulus via SDS but lacks generalizability; PhysGM is the first to achieve cross-scene generalization.
vs PhysSplat: Uses an LLM to infer physical parameters but depends on pre-reconstructed 3DGS; PhysGM operates end-to-end in a feed-forward manner.
vs LGM/GS-LRM: Feed-forward 3D reconstruction methods that predict only static geometry, without physical attributes.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First feed-forward physics-aware 4D synthesis framework; DPO for physical alignment is a field-first contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative comparison across 5 material categories + multi-view reconstruction ablation + user study; real-world quantitative evaluation is lacking.
Writing Quality: ⭐⭐⭐⭐ Clear logic, well-motivated two-stage training, and detailed method description.
Value: ⭐⭐⭐⭐⭐ Fundamentally transforms the paradigm of physical 4D synthesis — from hour-scale optimization to second-scale inference.

PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results (5 Material Categories)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Experiments¶

Table 1: Physical Simulation Quality Comparison (5 Material Categories)¶

Table 2: Multi-View Reconstruction Quality (GSO Dataset)¶

Table 3: Efficiency and Generalization Comparison¶

Key Findings¶

Highlights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶