Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention¶
Conference: ICLR2026
arXiv: 2602.06478
Code: efficient-lvsm.github.io
Area: 3D Vision
Keywords: novel view synthesis, Transformer, Dual-Stream Architecture, KV-Cache, Attention Decoupling
TL;DR¶
This paper proposes Efficient-LVSM, a dual-stream architecture that decouples input view encoding from target view generation, reducing the complexity of novel view synthesis from \(O(N_{in}^2)\) to \(O(N_{in})\). On RealEstate10K, the model achieves state-of-the-art performance (29.86 dB PSNR) using only 50% of LVSM's training time, with a 4.4× inference speedup.
Background & Motivation¶
Novel View Synthesis (NVS)—reconstructing 3D scenes from 2D images—is a central problem in computer vision. The field has evolved from per-scene optimization methods such as NeRF and 3DGS to feed-forward Transformer approaches like LVSM, which directly synthesize novel views from posed images without hand-crafted 3D priors.
However, LVSM's decoder-only design concatenates all input and target tokens before applying full self-attention, leading to two critical bottlenecks:
- Inefficiency: Attention complexity scales quadratically with the number of input views as \(O(N^2)\); input representations are recomputed redundantly when generating multiple target views.
- Performance limitations: Heterogeneous tokens—content-rich input views and pose-only target queries—share the same attention parameters, hindering each stream from learning specialized representations.
Core Problem¶
How can one decouple input encoding from target generation within an end-to-end feed-forward NVS framework, while simultaneously improving both efficiency and quality?
The encoder-decoder variant of LVSM avoids redundant computation but compresses all inputs into a single latent vector, causing significant information loss and degraded reconstruction quality. A new architecture is needed that preserves multi-level fine-grained features while supporting efficient inference.
Method¶
1. Dual-Stream Architecture¶
The core idea is to fully decouple input view processing and target view generation into two independent streams:
- Input Encoder: Applies intra-view self-attention independently to each input view, with no cross-view interaction.
- Target Decoder: Target tokens first undergo self-attention, then query input features via cross-attention.
This design reduces complexity from \(O(N^2 M)\) to \(O(NM + N)\).
2. Intra-View Self-Attention¶
The input encoder applies self-attention exclusively among patches within each individual input view:
Each view is processed independently, naturally enabling zero-shot generalization to varying numbers of input views.
3. Self-then-Cross Attention (Target Decoder)¶
The target decoder alternates between self-attention and cross-attention:
Self-attention enables target tokens to exchange scene-level information with one another, while cross-attention extracts the required content from input features. Ablation studies show that a 6+6 alternating structure outperforms 12 layers of pure cross-attention.
4. Co-Refinement Mechanism¶
Unlike conventional encoder-decoder designs that consume only the final encoder layer's features, co-refinement allows each decoder layer to query the corresponding encoder layer. This enables the decoder to leverage:
- Fine-grained texture details from early layers
- High-level semantic information from later layers
Visualizations confirm that co-refinement produces feature maps capturing significantly more target-view detail than a vanilla encoder-decoder baseline.
5. REPA Distillation¶
REPA is employed to distill visual features from a pretrained DINOv3 model into intermediate encoder layers by maximizing patch-level similarity:
The pretrained encoder and projection layers are discarded at inference time, incurring no additional inference overhead. A key finding is that REPA yields substantially larger gains for Efficient-LVSM than for LVSM, since full self-attention in the latter entangles feature maps across different views.
6. KV-Cache and Incremental Inference¶
The decoupled design naturally supports KV-caching:
- Key/value pairs for input views are computed once and cached for reuse.
- Adding a new target view: cached representations are directly reused for rendering.
- Adding a new input view: only the new view is processed and appended to the cache.
The marginal cost of adding input or target views is nearly constant, making the model well-suited for interactive applications.
Key Experimental Results¶
Scene-Level (RealEstate10K, 2 Input Views)¶
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| GS-LRM | 28.10 | 0.892 | 0.114 |
| LVSM Dec-Only (512) | 29.53 | 0.904 | 0.141 |
| Efficient-LVSM (512) | 29.86 | 0.905 | 0.147 |
Object-Level (ABO / GSO, 512 Resolution)¶
| Method | ABO PSNR | GSO PSNR |
|---|---|---|
| GS-LRM | 29.09 | 30.52 |
| LVSM Dec-Only | 32.10 | 32.36 |
| Efficient-LVSM | 32.65 | 32.92 |
Efficiency Comparison¶
- Training convergence: approximately 2× faster than LVSM
- Inference speed: 4.4× faster than LVSM Dec-Only (up to 14.9× in certain settings)
- Training resources: 64 A100 GPUs for 3 days, representing only 50% of LVSM's training cost
- Incremental inference: latency and memory remain nearly constant as views are added
Zero-Shot Generalization¶
Trained with 4 input views, the model generalizes at test time to varying numbers of input views, enabled by the per-view independent processing design.
Highlights & Insights¶
- Rigorous problem analysis: The limitations of LVSM's full self-attention are systematically analyzed from the perspectives of information heterogeneity and computational complexity, motivating the dual-stream decoupled design.
- Elegant co-refinement design: Layer-wise alignment between encoder and decoder features fully exploits multi-scale information.
- Strong practical value: KV-cache combined with incremental inference enables deployment in interactive 3D scene browsing applications.
- Dual gains in efficiency and quality: Efficient-LVSM surpasses LVSM across all benchmarks while substantially reducing training and inference costs.
- Conditional finding on REPA distillation: The work reveals a coupling between distillation effectiveness and attention architecture design.
Limitations & Future Work¶
- Intra-view attention in the input encoder contains no cross-view interaction; scene understanding relies entirely on the decoder's cross-attention, which may be insufficient for scenes with heavy occlusion.
- Scaling behavior at larger model sizes or higher resolutions remains unexplored.
- Evaluation is limited to static scenes; applicability to dynamic scenes is unknown.
- LPIPS at 512 resolution is marginally worse than LVSM Dec-Only (0.147 vs. 0.141), indicating room for improvement in perceptual quality.
Related Work & Insights¶
| Dimension | LVSM Dec-Only | LVSM Enc-Dec | Efficient-LVSM |
|---|---|---|---|
| Input complexity | \(O(N^2)\) | \(O(N^2)\) | \(O(N)\) |
| Parameter sharing | Heterogeneous tokens shared | Separated but uses only last layer | Separated + layer-wise co-refinement |
| KV-Cache | Not supported | Not supported | Supported |
| Incremental inference | Not supported | Not supported | Supported |
| Variable-view generalization | Poor | Poor | Strong (zero-shot) |
| RealEstate10K PSNR | 29.53 | 28.55 | 29.86 |
Compared to Gaussian splatting-based methods such as pixelSplat and MVSplat, Efficient-LVSM does not rely on explicit 3D representations, is more end-to-end, and achieves higher quality, though at greater computational cost.
The following broader insights emerge from this work:
- Transferability of the dual-stream decoupling principle: Any Transformer task involving heterogeneous inputs—such as multimodal understanding or robot perception—can benefit from decoupling the provider and query streams.
- Generality of co-refinement: Layer-wise cross-querying can be applied to other encoder-decoder architectures beyond NVS.
- KV-Cache for 3D vision: Importing the mature KV-cache technique from LLMs into 3D vision provides a new pathway toward real-time interactive rendering.
- Coupling between distillation and architecture: The substantially different REPA gains across architectures suggest that distillation strategies should be co-designed with the model architecture.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The dual-stream co-refinement design represents a clear contribution to NVS, though individual components are not entirely novel in isolation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablations covering scene-level and object-level benchmarks across efficiency, quality, and generalization dimensions.
- Writing Quality: ⭐⭐⭐⭐ — The progression from problem analysis to solution derivation is logically coherent and well-illustrated.
- Value: ⭐⭐⭐⭐⭐ — Simultaneous gains in efficiency and quality, strong practical utility of KV-cache support, and clear guidance for future work.