LEIA: Latent View-Invariant Embeddings for Implicit 3D Articulation¶

Conference: ECCV 2024
arXiv: 2409.06703
Code: Project Page
Area: 3D Vision
Keywords: articulated objects, neural radiance fields, hypernetworks, implicit representation, state interpolation

TL;DR¶

LEIA is proposed to characterize different states of articulated objects by learning view-invariant latent embeddings. It utilizes a HyperNetwork to modulate NeRF weights, enabling smooth interpolation between unseen articulated configurations without requiring any prior motion knowledge or 3D supervision.

Background & Motivation¶

Background: NeRF has achieved great success in static scene reconstruction, but extending it to dynamic objects or object articulation remains a challenging problem.

Limitations of Prior Work: - Existing methods rely on heuristic assumptions about the number of moving parts or object categories, which limits practical applications. - Methods like PARIS require decoupling objects into static and moving parts, which fails in multi-part articulation scenarios. - Video-based dynamic NeRF methods cannot handle large articulations of everyday objects effectively.

Key Challenge: The need to flexibly model arbitrary types and quantities of articulated motions without relying on motion priors or part decoupling.

Goal: To learn 3D representations of articulated objects under different states starting only from multi-view images, and to generate unseen intermediate states during training.

Key Insight: Encoding each articulated state as a learnable latent embedding and mapping it to the weight parameterization of NeRF via a hypernetwork.

Core Idea: Mapping view-invariant state latent embeddings to NeRF weights using a hypernetwork, and generating novel articulated states via interpolation in the latent space.

Method¶

Overall Architecture¶

The system comprises three core components: (1) a learnable latent dictionary \(Z\) that assigns an embedding vector to each articulated state; (2) a hypernetwork \(h_l\) that maps the latent embeddings into weight modulation matrices for NeRF; (3) a base NeRF network \(f_\theta\) based on the Instant-NGP architecture. During training, each batch samples a state and supervises it with its corresponding multi-view images. During inference, linear interpolation is performed within the latent space to generate unseen intermediate states.

Key Designs¶

State-conditioned HyperNetwork Modulation: Instead of directly predicting the full NeRF weights \(\theta\) (which is computationally expensive), the hypernetwork predicts low-rank matrices \(P^l \in \mathbb{R}^{K \times r}\) and \(Q^l \in \mathbb{R}^{r \times K}\) (\(r \ll K\)), modifying the base network weights via element-wise modulation:

\[\theta_t^l = \eta(P^l \times Q^l) \circ \theta^l\]

where \(\eta\) is an activation function and \(\circ\) represents element-wise multiplication. This low-rank modulation is analogous to subnet selection, where the rank \(r\) controls the compression-performance trade-off.

Learnable Latent Dictionary: The latent dictionary \(Z = \{t: z_t \mid t \in [0,1,...,T]\}\) utilizes nn.Embedding as a lookup table, where each state ID maps to a learnable embedding \(z_t \in \mathbb{R}^D\). All hypernetworks share the same latent embedding as input, serving as a bridge connecting state semantics and NeRF parameterization.
Latent Space Linear Interpolation: Given two states \(t_1, t_2\) and their embeddings \(z_1, z_2\), the intermediate states are generated by weighted linear interpolation:

\[z_{\text{inter}} = (1 - \beta_i) \cdot z_t + \beta_i \cdot z_{t-1}\]

where \(\beta_i \in [\frac{1}{\alpha}, \frac{2}{\alpha}, ..., \frac{\alpha-1}{\alpha}]\), which can generate \(\alpha - 1\) intermediate states. The interpolated latent vectors pass through the hypernetwork to yield the NeRF weights, enabling novel-view rendering of the new state.

Loss & Training¶

The total loss comprises the following components:

Smooth L1 Reconstruction Loss: \(L_{\text{SmoothL1-NeRF}} = \sum_{r \in R} \text{SmoothL1Loss}(\hat{C}(r) - C(r))\), which is more robust than L2.
Foreground Mask Loss: \(L_{\text{mask}}\), which is the BCE loss between predicted opacity and the ground-truth (GT) foreground mask.
Latent Manifold Loss: \(\mathcal{L}_{\text{manifold}}(l_i) = \frac{1}{K}\sum_{k=1}^{K} \|l_i - n_k\|_2^2\), which encourages local consistency among \(K\)-nearest neighbors of latent embeddings, promoting a smooth and continuous manifold structure.
Occlusion Regularization: \(L_{\text{occ}} = \frac{1}{K}\sum_{k=1}^{K} \sigma_k \cdot m_k\), which reduces density accumulation in front of the camera.
Depth Smoothness Regularization: \(L_{\text{DS}}\), which enforces smooth transitions of depth values between adjacent pixels.
Positional Encoding: Adding sinusoidal positional encoding to the latent vectors to inject state sequential information.

Training is conducted using the AdamW optimizer, with one state sampled per batch.

Key Experimental Results¶

Main Results¶

Evaluating on 12 objects (8 categories) from the PartNet-Mobility dataset, comparing with the PARIS method on interpolated state reconstruction:

Metric	PARIS	VanillaInt	LEIA
PSNR↑	27.81	27.81	29.55
SSIM↑	0.96	0.94	0.96
LPIPS↓	0.06	0.07	0.06
CD↓	0.45	0.37	0.36

LEIA's performance advantage is particularly significant on multi-part articulated objects (Storage2-4, Sunglasses, Box), as PARIS's motion parameter estimation fails in multi-part scenarios.

Ablation Study¶

Component	PSNR	SSIM	LPIPS
With manifold loss	29.40	0.95	0.05
W/o manifold loss	28.54	0.94	0.06
With depth reg.	29.63	0.96	0.05
W/o depth reg.	26.93	0.93	0.07
With occlusion reg.	29.64	0.95	0.05
W/o occlusion reg.	28.64	0.95	0.06
4 states	29.69	0.96	0.05
2 states	28.04	0.95	0.06

Key Findings¶

The manifold loss prevents the latent space from overfitting to extreme states, acting as the key to achieving meaningful interpolation.
4 states show significant improvement over 2 states, as additional states help build a better structure in the latent space.
t-SNE visualization reveals clear separability for different joint embeddings, forming smooth trajectories along their respective motion directions.
Positional encoding helps with fine parts (e.g., sunglasses) but may introduce noise for large objects.

Highlights & Insights¶

Agnostic to Motion Priors: Avoids specifying motion types (rotation/translation) or requiring part decoupling; a unified model handles all scenarios.
Scalable to Multiple Parts: t-SNE validates that the latent space can automatically separate the motions of different joints, which methods like PARIS fail to do.
Low-rank Modulation Strategy: Inspired by LoRA, modulating NeRF weights via low-rank matrices significantly reduces the parameter count of the hypernetwork.
Real-world Validation: Successfully generates plausible intermediate states even on real-world scenes of chest drawers captured by mobile phones.

Limitations & Future Work¶

The intermediate states do not guarantee physical consistency (e.g., door hinge constraints), which is the trade-off for discarding motion priors.
Failure occurs under severe self-occlusion (e.g., a microwave transitions from closed to open, where shape changes are too drastic).
Systematic evaluation has only been performed on the PartNet-Mobility synthetic dataset, with limited validation on real-world scenarios.
Introducing lightweight physical constraints could be considered to improve the plausibility of current interpolated states.

PARIS [ICCV 2023]: Decouples objects into static/dynamic parts and estimates motion parameters for each individually -> limiting generalization to multi-part setups.
A-SDF [CVPR 2020]: Requires articulation encoding and 3D supervision -> limiting practical applicability.
CLA-NeRF [ICRA 2022]: Requires articulation pose inputs -> dependency on priors.
Hypernetworks for INR Compression [NeurIPS 2023]: Source of the low-rank modulation concept.

Rating¶

Novelty: ⭐⭐⭐⭐ Elegant and clear logic in unifying articulated state modeling with hypernetworks and a latent dictionary.
Experimental Thoroughness: ⭐⭐⭐ Thorough ablation study, but evaluation dataset scale is limited (only 12 objects) and lacks extensive real-world data.
Writing Quality: ⭐⭐⭐⭐ Method explanation is clear, and mathematical derivations are complete.
Value: ⭐⭐⭐⭐ Offers a concise and scalable solution for articulated object modeling.