PhysSkin: Real-Time and Generalizable Physics-Based Skin Simulation¶
Conference: CVPR 2026 arXiv: 2603.23194 Code: Project Page Area: Physics Simulation / 3D Animation Keywords: Physics-based Animation, Neural Skinning Field, Self-Supervised Learning, Subspace Physics, Linear Blend Skinning
TL;DR¶
PhysSkin is a generalizable physics-informed framework that learns continuous skinning weight fields directly from static 3D geometry via a neural skinning field autoencoder, coupled with a physics-informed self-supervised learning strategy (energy minimization + smoothness + orthogonality constraints), enabling real-time physics-based animation that generalizes across shapes and discretizations without any annotated data or simulation trajectories.
Background & Motivation¶
Real-time physics-based animation is a long-standing goal in computer vision and graphics, with significant implications for VR/AR, character animation, and interactive digital content creation. Current methods face the following challenges:
Classical subspace methods (e.g., full-space FEM/MPM): require solving large-scale nonlinear optimization in high-dimensional full space, making real-time performance infeasible; even with subspace dimensionality reduction, the mapping matrix must be optimized for a specific mesh topology, precluding generalization.
Neural subspace methods (e.g., CROM, Simplicits): employ neural networks to learn subspace mappings, but require training a separate network per object, preventing cross-shape generalization.
Supervised skinning methods (e.g., RigNet, Anymate): learn skeletons and skinning weights from expert-annotated data, but annotation is costly, physical constraints are absent, and approaches often rely on category-specific priors (e.g., human/animal skeleton templates).
Core Problem: How to learn a physically consistent, cross-shape generalizable, discretization-agnostic deformation subspace mapping without any annotated data?
Method¶
Overall Architecture¶
The core idea of PhysSkin is to learn a continuous skinning weight field as basis functions for the subspace mapping in the spirit of Linear Blend Skinning (LBS), lifting handle transformations (subspace coordinates) to full-space deformations.
Pipeline: 1. 3D shape → sampled surface points + volumetric cubature points 2. Surface points → Transformer encoder → shape latent representation 3. Latent representation → cross-attention decoder → continuous skinning weight field 4. Physics-informed self-supervised loss optimizes network parameters 5. At inference: given a new shape → feedforward inference of skinning field → subspace dynamics solving → real-time animation
Key Designs¶
-
Skinning Field Subspace Representation (Theoretical Foundation)
- Based on LBS, the full-space displacement is represented as a weighted superposition of \(m\) affine transformations: $\(\phi(\mathbf{X}, \mathbf{z}) = \mathbf{X} + \sum_{i=1}^m W_i(\mathbf{X}) \mathbf{Z}_i \begin{bmatrix}\mathbf{X}\\1\end{bmatrix}\)$
- \(W_i(\mathbf{X})\): skinning weight of the \(i\)-th handle at spatial point \(\mathbf{X}\)
- \(\mathbf{Z}_i \in \mathbb{R}^{3\times 4}\): transformation of the \(i\)-th handle
- Subspace coordinates \(\mathbf{z} \in \mathbb{R}^{12m}\) (\(m \ll n\)), full space \(s \in \mathbb{R}^{3n}\)
- Implicit time integration is used to solve dynamics in the subspace: $\(\mathbf{z}_{t+1} = \arg\min_{\mathbf{z}} \frac{1}{2h^2}\|\mathbf{z} - 2\mathbf{z}_t + \mathbf{z}_{t-1}\|_\mathbf{M}^2 + E_{pot}(\phi(\mathbf{X}, \mathbf{z}))\)$
- Subspace dimensionality is far smaller than full space → Newton's method converges rapidly → real-time animation
-
Neural Skinning Field Autoencoder (Architectural Core)
- Encoder: Transformer-based point cloud encoder following Michelangelo
- Samples 4096 surface points to extract shape latent representation \(\mathbf{F}_s \in \mathbb{R}^{256 \times 768}\)
- Uses cross-attention + 8-layer self-attention for iterative refinement
- Pre-trained on ShapeNet via SDF reconstruction; frozen during training
- Decoder (three-stage cross-attention design):
- Stage 1: \(m\) learnable handle tokens \(\mathbf{Q}_h\) extract handle latent representations \(\mathbf{F}_h\) from \(\mathbf{F}_s\) via cross-attention
- Stage 2: arbitrary spatial query points \(\mathbf{X}\) extract per-point skinning features \(\mathbf{F}_p\) from \(\mathbf{F}_h\) via cross-attention
- Stage 3: ResNet-style MLP decodes features into skinning weights \(W(\mathbf{X}) \in \mathbb{R}^m\)
- Design motivation: the three-stage cross-attention realizes a natural hierarchy of "shape → handles → points" and is mesh-agnostic
-
Cubature Point Sampling (Discretization-Agnostic Design)
- Instead of fixed mesh topology, surface and volumetric points are sampled
- Surface points: Sharp Edge Sampling (SES) to capture geometric details
- Volumetric points: converted to watertight mesh → voxel grid → ray casting to classify interior/exterior points
- 1000 points are randomly sampled from the candidate set per training batch
- Design motivation: volumetric points capture interior deformation behavior that surface points alone cannot characterize
-
ONI Orthogonalization Layer
- Applies an Orthogonalization by Newton's Iteration (ONI) module at the final MLP layer
- Uses ELU activation to allow signed skinning weights (without enforcing non-negativity), enhancing expressiveness
- Design motivation: directly promotes orthogonality in the network forward pass, alleviating the optimization pressure on the loss
Loss & Training¶
Physics-Informed Self-Supervised Learning (PISSL) — Joint Optimization of Three Constraints
-
Potential Energy Minimization Loss \(\mathcal{L}_{pot}\):
- Samples random subspace coordinates \(\mathbf{z}\) from a Gaussian distribution and minimizes expected potential energy
- Uses linear interpolation between linear elastic and Neo-Hookean material models for improved stability
- Ensures the skinning field encodes low-energy deformation modes
-
Spatial Smoothness Loss \(\mathcal{L}_{smooth}\):
- \(\mathcal{L}_{smooth} = \mathbb{E}_{\mathbf{X}}\sum_{i=1}^m \|\nabla\Phi_\theta^i(\mathbf{X})\|^2\)
- Penalizes the magnitude of spatial gradients of skinning weights, ensuring artifact-free deformations
-
Orthogonality Constraint Loss \(\mathcal{L}_{orth}\):
- Computes the sum of squared inter-column inner products across all skinning modes to enforce orthogonality
- On-the-fly \(\ell_2\) normalization: normalizes each column of the skinning mode matrix at every training step → prevents numerical drift → facilitates convergence of the orthogonality constraint
-
ConFIG Conflict-Aware Gradient Correction:
- The three losses frequently conflict in their optimization directions (energy vs. smoothness vs. orthogonality)
- ConFIG is used to correct destructive gradient interference, achieving balanced optimization
- Design motivation: naive joint optimization leads to instability and non-convergence due to gradient conflicts
Total loss: \(\mathcal{L} = \mathcal{L}_{smooth} + \lambda_{pot}\mathcal{L}_{pot} + \lambda_{orth}\mathcal{L}_{orth}\)
Key Experimental Results¶
Main Results¶
RigNet Dataset — Skinning Field Quality Evaluation
| Method | Orthogonality \(\Omega_{orth} \downarrow\) | Condition Number \(\kappa_{log} \downarrow\) | Spectral Entropy \(H_{spec} \uparrow\) |
|---|---|---|---|
| RigNet | 0.5324 | 2.7997 | 0.9762 |
| M-I-A | 1.4098 | 27.7357 | 0.7224 |
| Anymate | 1.5737 | 2.6093 | 0.9682 |
| Puppeteer | 0.5615 | 5.5605 | 0.9798 |
| PhysSkin | 0.0033 | 1.0453 | 0.9999 |
PhysSkin achieves orthogonality two orders of magnitude lower than the second-best method (RigNet).
ShapeNet Dataset
| Method | \(\Omega_{orth} \times 10^{-2} \downarrow\) | \(\kappa_{log} \downarrow\) | \(H_{spec} \uparrow\) |
|---|---|---|---|
| Simplicits (per-object training) | 0.2621 | 1.5205 | 0.9941 |
| Anymate | 5.3520 | 4.9221 | 0.8858 |
| PhysSkin | 0.0098 | 1.0460 | 0.9997 |
Even though Simplicits trains a dedicated network per object, PhysSkin's single generalized model substantially outperforms it.
Real-Time Animation Efficiency Comparison
| 3D Shape | Vertices | FEM per step (ms) | MPM per step (ms) | PhysSkin per step (ms) |
|---|---|---|---|---|
| Airplane | 10K | 79.83 | 141.83 | 12.26 |
| Bag | 121K | 3012.47 | 233.79 | 13.39 |
| Camera | 80K | 2121.02 | 203.38 | 12.52 |
| Pillow | 127K | 3170.93 | 251.81 | 13.74 |
PhysSkin is 6.5–230× faster than FEM and 11.5–18.3× faster than MPM, with runtime nearly independent of vertex count.
Ablation Study¶
| Configuration | \(\Omega_{orth} \times 10^{-2} \downarrow\) | \(\kappa_{log} \downarrow\) | \(H_{spec} \uparrow\) |
|---|---|---|---|
| w/o skinning normalization | 6.5533 | 8.5492 | 0.8113 |
| w/o ONI layer | 0.0081 | 1.0844 | 0.9997 |
| w/o ConFIG optimization | 8.9247 | 11.8595 | 0.7594 |
| w/o \(\mathcal{L}_{orth}\) | 100.0 | 29.18 | NaN |
| w/o \(\mathcal{L}_{smooth}\) | 0.0050 | 1.0567 | 0.9998 |
| Full Model | 0.0033 | 1.0453 | 0.9999 |
Key Findings¶
- ConFIG is the most critical component: its removal degrades orthogonality by 2700× (0.0033→8.9247), demonstrating that gradient conflict is the central optimization challenge.
- Orthogonality constraint is indispensable: removing \(\mathcal{L}_{orth}\) causes the orthogonality metric to reach 100 and spectral entropy to collapse to NaN.
- Skinning normalization has a significant impact: its removal degrades orthogonality by approximately 2000×.
- Runtime is nearly decoupled from vertex count: scaling from 10K to 127K vertices increases PhysSkin's per-step time only from 12.26 to 13.74 ms.
- A single model generalizes across all shapes: one PhysSkin model handles objects across all categories, whereas Simplicits requires per-object training.
Highlights & Insights¶
- Fully annotation-free physics-based skinning: no simulation trajectories, no expert-annotated skeletons or skinning weights are required — the method operates solely from static geometry, substantially lowering the barrier to 3D animation.
- Physics-constrained optimization strategy is the core contribution: the combination of on-the-fly normalization and ConFIG gradient correction resolves the fundamental conflicts inherent in multi-constraint optimization.
- Discretization-agnostic continuous skinning field: a single model can handle meshes of varying topology and resolution, and can be directly applied to 3D Gaussian splatting representations.
- Original evaluation metrics: three skinning quality metrics grounded in matrix analysis and spectral theory (orthogonality, condition number, spectral entropy) are proposed, filling the gap left by the absence of ground-truth references in self-supervised skinning evaluation.
- Real-time performance stems from a fundamental reduction in problem dimensionality: the subspace dimension \(12m\) (\(m\) handles × 12 parameters) is far smaller than the full-space dimension \(3n\), enabling rapid convergence of Newton's method in the subspace.
Limitations & Future Work¶
- Absence of semantic priors: skinning weights are driven entirely by physical constraints without incorporating semantic information (e.g., joint locations, functional parts), which may be suboptimal for complex topologies.
- Fixed number of handles: the choice of \(m\) bounds expressive capacity, but the paper does not sufficiently discuss adaptive selection strategies.
- Simplified material models: only hyperelastic materials (Neo-Hookean) are supported; plasticity, viscoelasticity, fracture, and other complex material behaviors are not covered.
- Evaluation limited to skinning quality: while animation results are demonstrated, quantitative accuracy comparisons against ground-truth simulation trajectories are absent.
- Dependence on pre-trained encoder: the shape encoder Michelangelo is pre-trained on ShapeNet; generalization to 3D shapes outside ShapeNet remains unverified.
Related Work & Insights¶
- Relationship to Simplicits (SIGGRAPH 2024): PhysSkin directly inherits the skinning field subspace formulation from Simplicits but addresses its two major shortcomings: (1) generalization — a single model vs. per-object training; (2) training stability — ConFIG + normalization vs. naive optimization.
- Distinction from Anymate: Anymate employs supervised learning from annotated data, whereas PhysSkin is fully self-supervised; Anymate outputs discrete skeleton weights while PhysSkin outputs a continuous field.
- Implications for multi-objective optimization: the ConFIG gradient correction approach may be broadly applicable to other multi-constraint learning settings such as physics-informed neural networks (PINNs).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Generalizable physics self-supervised skinning field with multi-constraint gradient correction; the solution is complete and original.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, multiple baselines, comprehensive ablation, and efficiency comparisons.
- Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear; architectural visualization is excellent.
- Value: ⭐⭐⭐⭐⭐ — Real-time performance + generalizability + annotation-free design make this industrially practical for 3D animation.