Twinner: Shining Light on Digital Twins in a Few Snaps¶

Conference: CVPR 2025
arXiv: 2503.08382
Code: None
Area: 3D Vision
Keywords: Digital Twins, PBR Material Reconstruction, Environmental Illumination Estimation, Large Reconstruction Models, Voxel Grid Representation

TL;DR¶

Twinner is proposed as the first large feed-forward reconstruction model capable of simultaneously recovering scene illumination, object geometry, and PBR material properties from a sparse set of images. Using a tricolumn representation, procedural synthetic data, and fine-tuning on real data via a differentiable PBR renderer, Twinner outperforms existing feed-forward methods and matches per-scene optimization approaches on StanfordORB.

Background & Motivation¶

Digital twins require the recovery of geometry, materials, and illumination. However, existing methods suffer from several limitations:

Illumination Baking Issue: Methods like NeRF and 3D-GS bake lighting into the radiance field, preventing relighting.
Slow Per-Scene Optimization: Optimization-based approaches such as NVDiffRec require long reconstruction times and depend heavily on hand-crafted priors.
Scarcity of High-Quality PBR Data: Feed-forward models like MetalLRM require training on large-scale, high-quality PBR-textured 3D assets, which are scarce in quantity.
Synthetic-to-Real Domain Gap: Models trained solely on synthetic datasets generalize poorly to real-world images.

This work addresses the two core challenges: training data scarcity and the synthetic-to-real domain gap.

Method¶

Overall Architecture¶

Twinner extends the LightplaneLRM architecture: input images are encoded by a DINO ViT, and a DiT transformer predicts a tricolumn 3D representation (encoding albedo, roughness, metalness, and normals), while another DiT predicts the environmental illumination cubemap. A differentiable PBR renderer then renders the predicted materials and illumination into shaded images, which are compared with original images to provide indirect PBR supervision.

Key Designs¶

1. Tricolumn Voxel Grid Representation

Function: Reduces the number of tokens to a quadratic scale while preserving the advantages of direct voxel representations.
Mechanism: The $R \times R \times R$ voxel grid is unrolled along three axes into three planes $V_{xy} \in \mathbb{R}^{(CR) \times R \times R}$, where the feature vector dimension of each plane is $CR$ (representing $R$ voxels of $C$-dimensional features stacked). Features for any query point $(x,y,z)$ are obtained via 3D interpolation: $V(x,y,z) = f(V_{xy}(x,y;z), V_{yz}(y,z;x), V_{zx}(z,x;y))$.
Design Motivation: Standard triplanes mix features from different voxels, leading to projection artifacts. Tricolumn maintains independent representations for each voxel while keeping the token count at $3R^2$ (quadratic), avoiding the $R^3$ (cubic) memory limitations of transformers.

2. Procedural PBR Dataset + Real Data Fine-Tuning

Function: Resolves the scarcity of high-quality PBR training data.
Mechanism: Inspired by Zeroverse, synthetic objects are procedurally generated with PBR textures and known environmental lighting, providing direct material and illumination supervision $\mathcal{L}_S$. The model is then fine-tuned on real data (e.g., CO3Dv2) using the photometric loss of a differentiable PBR renderer—comparing the output shaded renderings against real-world images without requiring ground-truth PBR annotations.
Design Motivation: Procedural data, although lacking realism, provides clean PBR ground truth, while real-world data bridges the domain gap. Combining both stages offers complementary benefits.

3. Environmental Illumination Prediction

Function: Predicts the scene's cubemap environmental illumination from a sparse set of input views.
Mechanism: An auxiliary DiT module ingests image feature tokens and predicts the six faces of the cubemap. The cubemap is processed by a differentiable mipmap and utilized for split-sum approximated PBR rendering, where the photometric loss provides indirect illumination supervision.
Design Motivation: Accurate lighting estimation is crucial for material-lighting disentanglement. Performing direct prediction from input views is significantly faster than per-scene optimization.

Loss & Training¶

Synthetic data stage: Direct material supervision $$\mathcal{L}_S = \ell_\mathcal{M}(\mathcal{I}_a, \hat{\mathcal{I}}_a) + \ell_\mathcal{M}(\mathcal{I}_r, \hat{\mathcal{I}}_r) + \ell_\mathcal{M}(\mathcal{I}_m, \hat{\mathcal{I}}_m) + \ell_\mathcal{M}(\mathcal{I}_n, \hat{\mathcal{I}}_n)$$

Real data stage: Photometric loss (including LPIPS + L2 + mask BCE + depth L1) + normal consistency loss $\mathcal{L}_N$

Key Experimental Results¶

StanfordORB Benchmark¶

Method	Type	Novel Relight PSNR↑	NVS PSNR↑	Illumination Angular Error↓	Time
NVDiffRec	Optimization	22.5	26.8	18.2°	~hours
MetalLRM	Feed-forward	20.1	24.5	22.4°	~seconds
Twinner	Feed-forward	21.8	25.9	13.1°	~seconds

Illumination Prediction Performance¶

Method	Illumination Angular Error↓
Baseline feed-forward method	22.4°
Optimization methods	18.2°
Twinner	13.1°

Key Findings¶

Twinner improves the illumination estimation angular error by 28% compared to existing feed-forward baselines (22.4° → 13.1°).
Relighting quality (PSNR) outperforms the feed-forward method MetalLRM by ~1.7 dB, closely approaching the optimization-based NVDiffRec.
Inference runs on the scale of seconds, hundreds of times faster than per-scene optimization.
The tricolumn representation shows clear advantages over the traditional triplane in recovering fine geometric details and accurate material properties.

Highlights & Insights¶

Elegant Tricolumn Representation: Strikes a perfect trade-off between the computational efficiency of triplanes and the expressive power of volumetric voxel grids.
Pragmatic Two-Stage Training: Procedural synthetic data provides direct PBR ground truth, while real-world data effectively bridges the domain gap.
First LRM for Complete Joint Reconstruction: Simultaneously predicts geometry, material, and illumination, offering an end-to-end feed-forward pipeline for digital twins.

Limitations & Future Work¶

Self-occlusion and cast shadows of the object are not modeled during rendering.
The PBR model is simplified compared to the standard Disney microfacet model.
Environmental illumination is assumed to be spatially invariant, which limits performance under complex indoor lighting.
The setup currently supports only 4 input views; incorporating more views could further improve performance.

LRM/LightplaneLRM: Foundational feed-forward reconstruction architectures, but limited to recovering radiance fields.
NVDiffRec: Per-scene optimization for PBR reconstruction, presenting high reconstruction quality but at slow optimization speeds.
MetalLRM: A pioneer in feed-forward PBR prediction, but constrained by the synthetic-to-real domain gap.

Rating¶

⭐⭐⭐⭐⭐ — Comprehensive technical contribution: innovative tricolumn architecture, a pragmatic procedural plus real-world fine-tuning strategy, and the first complete feed-forward digital twin LRM. The results on StanfordORB demonstrate that feed-forward methods can closely match optimization-based approaches.