MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition¶
Conference: ECCV 2024
arXiv: 2407.07284
Code: Project Page
Area: 3D Vision
Keywords: 3D Gaussian Splatting, Multi-identity Representation, Tensor Decomposition, Human Animation, Monocular Video
TL;DR¶
MIGS is proposed to unify the 3DGS parameters of multiple human identities into a single low-rank tensor via CP tensor decomposition, significantly reducing parameter size while achieving robust animation for unseen poses.
Background & Motivation¶
Background: 3D Gaussian Splatting (3DGS) has been successfully applied to human avatar modeling, enabling real-time rendering with high visual quality. Existing methods like 3DGS-Avatar and GauHuman combine 3DGS with the SMPL human prior to learn animatable human representations from monocular videos.
Limitations of Prior Work: Currently, all 3DGS human avatar methods rely on per-identity optimization—each individual requires a independently trained model. This leads to: (a) a linear explosion of parameters in multi-person scenes, where \(N_i\) individuals require \(N_i \times N_g \times M\) parameters; (b) limited training data for a single identity, resulting in a sharp decline in animation quality under out-of-distribution (OOD) poses.
Key Challenge: Per-identity models can only learn limited human deformation patterns, yet real-world applications require animation robustness under extreme poses (e.g., highly challenging dances). Joint multi-identity learning can share deformation knowledge, but how can this be achieved without exploding the parameter count?
Goal: Learn a unified multi-identity 3DGS representation from monocular videos to simultaneously compress parameters and enhance animation robustness under OOD poses through cross-identity knowledge sharing.
Key Insight: Drawing inspiration from the classic TensorFaces, the Gaussian parameters of all identities can be organized into a high-order tensor, and low-rank approximation can be achieved using CP decomposition.
Core Idea: Different human bodies share similar structural features; therefore, the multi-identity Gaussian parameter tensor exhibits a low-rank structure that can be efficiently represented using CP decomposition.
Method¶
Overall Architecture¶
The pipeline of MIGS consists of three steps: (1) defining 3D Gaussians in the canonical space for each identity \(i\), with parameters including position \(\boldsymbol{\mu}\), scale \(\boldsymbol{s}\), rotation quaternion \(\boldsymbol{q}\), feature vector \(\boldsymbol{f}\), and opacity \(\alpha\); (2) stacking the Gaussian parameters of all identities into a third-order tensor \(\boldsymbol{\mathcal{W}} \in \mathbb{R}^{N_i \times N_g \times M}\); (3) performing CP decomposition on this tensor to learn only the decomposed factor matrices. During animation, a non-rigid deformation network \(f_d\) and an LBS-based rigid transformation are used to transform the canonical Gaussians into the observation space.
Key Designs¶
- High-Order Tensor Construction: For \(N_i\) identities, each having \(N_g\) Gaussians, and each Gaussian containing \(M=43\) dimensional parameters (3 for position, 3 for scale, 4 for rotation, 32 for features, 1 for opacity), the tensor is constructed as:
This naturally decouples the three dimensions: "identity", "Gaussian index", and "parameter type".
- CP Tensor Decomposition: CANDECOMP/PARAFAC (CP) decomposition is applied to the tensor \(\boldsymbol{\mathcal{W}}\). First, the tensor is unfolded along the second dimension to obtain \(\boldsymbol{W}_{(2)} \in \mathbb{R}^{N_g \times (N_i M)}\), which is then approximated as:
where \(\boldsymbol{U}_1 \in \mathbb{R}^{M \times R}\), \(\boldsymbol{U}_2 \in \mathbb{R}^{N_i \times R}\), \(\boldsymbol{U}_3 \in \mathbb{R}^{N_g \times R}\), and \(\odot\) represents the Khatri-Rao product. In practice, only \((M + N_i + N_g)R\) parameters need to be learned instead of \(M \cdot N_i \cdot N_g\). When \(R=100, N_g=5 \times 10^4, N_i=30\), the parameter count is reduced from \(6.5 \times 10^7\) to \(5 \times 10^6\), decreasing by an order of magnitude.
-
Non-Rigid and Rigid Deformation: The non-rigid deformation network \(f_d\) outputs offsets for position, scale, and rotation: \((\delta\boldsymbol{\mu}, \delta\boldsymbol{s}, \delta\boldsymbol{q}, \boldsymbol{z}) = f_d(\boldsymbol{\mu}_c; \boldsymbol{z}_p)\). The rigid transformation is based on SMPL's LBS: \(\boldsymbol{T} = \sum_{b=1}^{B} f_r(\boldsymbol{\mu}_d)_b \boldsymbol{B}_b\). Color is predicted from the feature vectors and spherical harmonics using an MLP \(f_c\).
-
Initialization Strategy: Points are initialized by sampling \(N_g\) points from the SMPL mesh of the first identity. The CP decomposition of their parameter matrix is computed using the CPPower algorithm from TensorLy, yielding \(\boldsymbol{U}_1\), \(\boldsymbol{U}_3\), and the first row of \(\boldsymbol{U}_2\). The first row of \(\boldsymbol{U}_2\) is then copied to all other rows.
-
Personalization and New Identities: (a) Personalization: Other parameters are frozen while only the color MLP \(f_c\) is fine-tuned to recover high-frequency details. (b) New Identities: A new row is added to \(\boldsymbol{U}_2\), and only this new row and \(f_c\) are optimized, preventing disruption to the learned multi-identity deformation knowledge.
Loss & Training¶
The loss functions from 3DGS-Avatar are adopted: RGB photometric loss + mask loss + skinning weight regularization + as-isometric-as-possible regularization. During training, frames from different identities are sampled alternately for rendering optimization. Notably, no per-frame latent codes are used to prevent overfitting to the training frames.
Key Experimental Results¶
Main Results¶
ZJU-MoCap Novel View Synthesis (Trained on 6 identities):
| Method | 377 PSNR↑ | 386 PSNR↑ | 392 PSNR↑ | 394 PSNR↑ |
|---|---|---|---|---|
| HumanNeRF | 30.41 | 33.20 | 31.04 | 30.31 |
| 3DGS-Avatar | 30.64 | 33.63 | 31.66 | 30.54 |
| MIGS (Ours) | 32.85 | 34.98 | 33.88 | 32.28 |
AIST++ Dance Dataset (Trained on 30 identities):
| Method | Basic PSNR↑ | Basic LPIPS*↓ | Advanced PSNR↑ | Advanced LPIPS*↓ |
|---|---|---|---|---|
| HumanNeRF | 24.58 | 29.20 | 22.01 | 39.01 |
| 3DGS-Avatar | 28.89 | 18.20 | 25.51 | 28.86 |
| MIGS | 29.82 | 17.73 | 26.54 | 26.02 |
Ablation Study¶
Impact of CP Decomposition Rank R (AIST++ Advanced Test, LPIPS*↓):
| Number of Identities | R=10 | R=100 | R=200 |
|---|---|---|---|
| 10 | ~28 | ~26 | ~26 |
| 20 | ~32 | ~27 | ~27 |
| 30 | ~38 | ~28 | ~27 |
\(R=10\) is insufficient to capture the diversity of multiple identities, \(R=100\) is already sufficient, and \(R=200\) yields no significant improvement. After personalization fine-tuning, the results of \(R=100\) and \(R=200\) are almost identical.
Key Findings¶
- Increasing the number of training identities \(\rightarrow\) improves OOD pose robustness (decreases LPIPS), but results become smoother \(\rightarrow\) personalization fine-tuning can recover details.
- On the highly challenging dance poses in AIST++, MIGS significantly outperforms all per-identity methods, especially under extreme poses like crossed limbs.
- New identity learning requires only a 10-second short video + optimizing the new row of \(\boldsymbol{U}_2\).
Highlights & Insights¶
- Physical Intuition of Low-Rank Assumption: Different human bodies share skeletal structures and motion patterns, which provides strong prior support for the low-rank structure of the parameter tensor.
- Extremely High Parameter Efficiency: 30 identities require only \(1/13\) of the parameters required by a single-identity model.
- Scalable Design: Introducing a new identity only requires adding a single row, avoiding the need to retrain the entire model.
- This work elegantly ports the classic tensor decomposition approach (the TensorFaces concept) into the 3DGS era.
Limitations & Future Work¶
- Results tend to become smoother when the number of identities is large, relying heavily on personalization fine-tuning.
- All identities share the same number of Gaussians \(N_g\), which cannot adapt to variations in different body sizes.
- The current scale has only been validated up to 30 identities; scalability to thousands of identities remains unverified.
- The non-rigid deformation network is still a shared MLP, which may limit the expression of extreme deformations.
Related Work & Insights¶
- TensorFaces (2002): A pioneering work using multilinear tensor decomposition to represent facial variation patterns, serves as a direct inspiration for MIGS.
- 3DGS-Avatar: The single-identity baseline for MIGS, which removes the per-frame latent code to enhance generalization.
- SNARF: A differentiable forward skinning method, used in the rigid transformation module of MIGS.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to introduce CP tensor decomposition to multi-identity 3DGS modeling.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated on two datasets with detailed ablation and new identity generalization experiments.
- Writing Quality: ⭐⭐⭐⭐ — Clear mathematical derivations and intuitive explanations.
- Value: ⭐⭐⭐⭐ — Highly efficient multi-identity representation holds practical potential for applications.