PhysSkin: Real-Time and Generalizable Physics-Based Skin Simulation¶

Conference: CVPR 2026 arXiv: 2603.23194 Code: Project Page Area: Physics Simulation / 3D Animation Keywords: Physics-based Animation, Neural Skinning Field, Self-Supervised Learning, Subspace Physics, Linear Blend Skinning

TL;DR¶

PhysSkin is a generalizable physics-informed framework that learns continuous skinning weight fields directly from static 3D geometry via a neural skinning field autoencoder, coupled with a physics-informed self-supervised learning strategy (energy minimization + smoothness + orthogonality constraints), enabling real-time physics-based animation that generalizes across shapes and discretizations without any annotated data or simulation trajectories.

Background & Motivation¶

Real-time physics-based animation is a long-standing goal in computer vision and graphics, with significant implications for VR/AR, character animation, and interactive digital content creation. Current methods face the following challenges:

Classical subspace methods (e.g., full-space FEM/MPM): require solving large-scale nonlinear optimization in high-dimensional full space, making real-time performance infeasible; even with subspace dimensionality reduction, the mapping matrix must be optimized for a specific mesh topology, precluding generalization.

Neural subspace methods (e.g., CROM, Simplicits): employ neural networks to learn subspace mappings, but require training a separate network per object, preventing cross-shape generalization.

Supervised skinning methods (e.g., RigNet, Anymate): learn skeletons and skinning weights from expert-annotated data, but annotation is costly, physical constraints are absent, and approaches often rely on category-specific priors (e.g., human/animal skeleton templates).

Core Problem: How to learn a physically consistent, cross-shape generalizable, discretization-agnostic deformation subspace mapping without any annotated data?

Method¶

Overall Architecture¶

The core idea of PhysSkin is to learn a continuous skinning weight field as basis functions for the subspace mapping in the spirit of Linear Blend Skinning (LBS), lifting handle transformations (subspace coordinates) to full-space deformations.

Pipeline: 1. 3D shape → sampled surface points + volumetric cubature points 2. Surface points → Transformer encoder → shape latent representation 3. Latent representation → cross-attention decoder → continuous skinning weight field 4. Physics-informed self-supervised loss optimizes network parameters 5. At inference: given a new shape → feedforward inference of skinning field → subspace dynamics solving → real-time animation

Key Designs¶

Skinning Field Subspace Representation (Theoretical Foundation)
- Based on LBS, the full-space displacement is represented as a weighted superposition of $m$ affine transformations: $$\phi(\mathbf{X}, \mathbf{z}) = \mathbf{X} + \sum_{i=1}^m W_i(\mathbf{X}) \mathbf{Z}_i \begin{bmatrix}\mathbf{X}\\1\end{bmatrix}$$
- $W_i(\mathbf{X})$: skinning weight of the $i$-th handle at spatial point $\mathbf{X}$
- $\mathbf{Z}_i \in \mathbb{R}^{3\times 4}$: transformation of the $i$-th handle
- Subspace coordinates $\mathbf{z} \in \mathbb{R}^{12m}$ ($m \ll n$), full space $s \in \mathbb{R}^{3n}$
- Implicit time integration is used to solve dynamics in the subspace: $$\mathbf{z}_{t+1} = \arg\min_{\mathbf{z}} \frac{1}{2h^2}\|\mathbf{z} - 2\mathbf{z}_t + \mathbf{z}_{t-1}\|_\mathbf{M}^2 + E_{pot}(\phi(\mathbf{X}, \mathbf{z}))$$
- Subspace dimensionality is far smaller than full space → Newton's method converges rapidly → real-time animation
Neural Skinning Field Autoencoder (Architectural Core)
- Encoder: Transformer-based point cloud encoder following Michelangelo
- Samples 4096 surface points to extract shape latent representation $\mathbf{F}_s \in \mathbb{R}^{256 \times 768}$
- Uses cross-attention + 8-layer self-attention for iterative refinement
- Pre-trained on ShapeNet via SDF reconstruction; frozen during training
- Decoder (three-stage cross-attention design):
- Stage 1: $m$ learnable handle tokens $\mathbf{Q}_h$ extract handle latent representations $\mathbf{F}_h$ from $\mathbf{F}_s$ via cross-attention
- Stage 2: arbitrary spatial query points $\mathbf{X}$ extract per-point skinning features $\mathbf{F}_p$ from $\mathbf{F}_h$ via cross-attention
- Stage 3: ResNet-style MLP decodes features into skinning weights $W(\mathbf{X}) \in \mathbb{R}^m$
- Design motivation: the three-stage cross-attention realizes a natural hierarchy of "shape → handles → points" and is mesh-agnostic
Cubature Point Sampling (Discretization-Agnostic Design)
- Instead of fixed mesh topology, surface and volumetric points are sampled
- Surface points: Sharp Edge Sampling (SES) to capture geometric details
- Volumetric points: converted to watertight mesh → voxel grid → ray casting to classify interior/exterior points
- 1000 points are randomly sampled from the candidate set per training batch
- Design motivation: volumetric points capture interior deformation behavior that surface points alone cannot characterize
ONI Orthogonalization Layer
- Applies an Orthogonalization by Newton's Iteration (ONI) module at the final MLP layer
- Uses ELU activation to allow signed skinning weights (without enforcing non-negativity), enhancing expressiveness
- Design motivation: directly promotes orthogonality in the network forward pass, alleviating the optimization pressure on the loss

Loss & Training¶

Physics-Informed Self-Supervised Learning (PISSL) — Joint Optimization of Three Constraints

Potential Energy Minimization Loss $\mathcal{L}_{pot}$:
- Samples random subspace coordinates $\mathbf{z}$ from a Gaussian distribution and minimizes expected potential energy
- Uses linear interpolation between linear elastic and Neo-Hookean material models for improved stability
- Ensures the skinning field encodes low-energy deformation modes
Spatial Smoothness Loss $\mathcal{L}_{smooth}$:
- $\mathcal{L}_{smooth} = \mathbb{E}_{\mathbf{X}}\sum_{i=1}^m \|\nabla\Phi_\theta^i(\mathbf{X})\|^2$
- Penalizes the magnitude of spatial gradients of skinning weights, ensuring artifact-free deformations
Orthogonality Constraint Loss $\mathcal{L}_{orth}$:
- Computes the sum of squared inter-column inner products across all skinning modes to enforce orthogonality
- On-the-fly $\ell_2$ normalization: normalizes each column of the skinning mode matrix at every training step → prevents numerical drift → facilitates convergence of the orthogonality constraint
ConFIG Conflict-Aware Gradient Correction:
- The three losses frequently conflict in their optimization directions (energy vs. smoothness vs. orthogonality)
- ConFIG is used to correct destructive gradient interference, achieving balanced optimization
- Design motivation: naive joint optimization leads to instability and non-convergence due to gradient conflicts

Total loss: $\mathcal{L} = \mathcal{L}_{smooth} + \lambda_{pot}\mathcal{L}_{pot} + \lambda_{orth}\mathcal{L}_{orth}$

Key Experimental Results¶

Main Results¶

RigNet Dataset — Skinning Field Quality Evaluation

Method	Orthogonality $\Omega_{orth} \downarrow$	Condition Number $\kappa_{log} \downarrow$	Spectral Entropy $H_{spec} \uparrow$
RigNet	0.5324	2.7997	0.9762
M-I-A	1.4098	27.7357	0.7224
Anymate	1.5737	2.6093	0.9682
Puppeteer	0.5615	5.5605	0.9798
PhysSkin	0.0033	1.0453	0.9999

PhysSkin achieves orthogonality two orders of magnitude lower than the second-best method (RigNet).

ShapeNet Dataset

Method	$\Omega_{orth} \times 10^{-2} \downarrow$	$\kappa_{log} \downarrow$	$H_{spec} \uparrow$
Simplicits (per-object training)	0.2621	1.5205	0.9941
Anymate	5.3520	4.9221	0.8858
PhysSkin	0.0098	1.0460	0.9997

Even though Simplicits trains a dedicated network per object, PhysSkin's single generalized model substantially outperforms it.

Real-Time Animation Efficiency Comparison

3D Shape	Vertices	FEM per step (ms)	MPM per step (ms)	PhysSkin per step (ms)
Airplane	10K	79.83	141.83	12.26
Bag	121K	3012.47	233.79	13.39
Camera	80K	2121.02	203.38	12.52
Pillow	127K	3170.93	251.81	13.74

PhysSkin is 6.5–230× faster than FEM and 11.5–18.3× faster than MPM, with runtime nearly independent of vertex count.

Ablation Study¶

Configuration	$\Omega_{orth} \times 10^{-2} \downarrow$	$\kappa_{log} \downarrow$	$H_{spec} \uparrow$
w/o skinning normalization	6.5533	8.5492	0.8113
w/o ONI layer	0.0081	1.0844	0.9997
w/o ConFIG optimization	8.9247	11.8595	0.7594
w/o $\mathcal{L}_{orth}$	100.0	29.18	NaN
w/o $\mathcal{L}_{smooth}$	0.0050	1.0567	0.9998
Full Model	0.0033	1.0453	0.9999

Key Findings¶

ConFIG is the most critical component: its removal degrades orthogonality by 2700× (0.0033→8.9247), demonstrating that gradient conflict is the central optimization challenge.
Orthogonality constraint is indispensable: removing $\mathcal{L}_{orth}$ causes the orthogonality metric to reach 100 and spectral entropy to collapse to NaN.
Skinning normalization has a significant impact: its removal degrades orthogonality by approximately 2000×.
Runtime is nearly decoupled from vertex count: scaling from 10K to 127K vertices increases PhysSkin's per-step time only from 12.26 to 13.74 ms.
A single model generalizes across all shapes: one PhysSkin model handles objects across all categories, whereas Simplicits requires per-object training.

Highlights & Insights¶

Fully annotation-free physics-based skinning: no simulation trajectories, no expert-annotated skeletons or skinning weights are required — the method operates solely from static geometry, substantially lowering the barrier to 3D animation.
Physics-constrained optimization strategy is the core contribution: the combination of on-the-fly normalization and ConFIG gradient correction resolves the fundamental conflicts inherent in multi-constraint optimization.
Discretization-agnostic continuous skinning field: a single model can handle meshes of varying topology and resolution, and can be directly applied to 3D Gaussian splatting representations.
Original evaluation metrics: three skinning quality metrics grounded in matrix analysis and spectral theory (orthogonality, condition number, spectral entropy) are proposed, filling the gap left by the absence of ground-truth references in self-supervised skinning evaluation.
Real-time performance stems from a fundamental reduction in problem dimensionality: the subspace dimension $12m$ ($m$ handles × 12 parameters) is far smaller than the full-space dimension $3n$, enabling rapid convergence of Newton's method in the subspace.

Limitations & Future Work¶

Absence of semantic priors: skinning weights are driven entirely by physical constraints without incorporating semantic information (e.g., joint locations, functional parts), which may be suboptimal for complex topologies.
Fixed number of handles: the choice of $m$ bounds expressive capacity, but the paper does not sufficiently discuss adaptive selection strategies.
Simplified material models: only hyperelastic materials (Neo-Hookean) are supported; plasticity, viscoelasticity, fracture, and other complex material behaviors are not covered.
Evaluation limited to skinning quality: while animation results are demonstrated, quantitative accuracy comparisons against ground-truth simulation trajectories are absent.
Dependence on pre-trained encoder: the shape encoder Michelangelo is pre-trained on ShapeNet; generalization to 3D shapes outside ShapeNet remains unverified.

Relationship to Simplicits (SIGGRAPH 2024): PhysSkin directly inherits the skinning field subspace formulation from Simplicits but addresses its two major shortcomings: (1) generalization — a single model vs. per-object training; (2) training stability — ConFIG + normalization vs. naive optimization.
Distinction from Anymate: Anymate employs supervised learning from annotated data, whereas PhysSkin is fully self-supervised; Anymate outputs discrete skeleton weights while PhysSkin outputs a continuous field.
Implications for multi-objective optimization: the ConFIG gradient correction approach may be broadly applicable to other multi-constraint learning settings such as physics-informed neural networks (PINNs).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Generalizable physics self-supervised skinning field with multi-constraint gradient correction; the solution is complete and original.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, multiple baselines, comprehensive ablation, and efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear; architectural visualization is excellent.
Value: ⭐⭐⭐⭐⭐ — Real-time performance + generalizability + annotation-free design make this industrially practical for 3D animation.