Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention¶

Conference: ICLR2026
arXiv: 2602.06478
Code: efficient-lvsm.github.io
Area: 3D Vision
Keywords: novel view synthesis, Transformer, Dual-Stream Architecture, KV-Cache, Attention Decoupling

TL;DR¶

This paper proposes Efficient-LVSM, a dual-stream architecture that decouples input view encoding from target view generation, reducing the complexity of novel view synthesis from $O(N_{in}^2)$ to $O(N_{in})$. On RealEstate10K, the model achieves state-of-the-art performance (29.86 dB PSNR) using only 50% of LVSM's training time, with a 4.4× inference speedup.

Background & Motivation¶

Novel View Synthesis (NVS)—reconstructing 3D scenes from 2D images—is a central problem in computer vision. The field has evolved from per-scene optimization methods such as NeRF and 3DGS to feed-forward Transformer approaches like LVSM, which directly synthesize novel views from posed images without hand-crafted 3D priors.

However, LVSM's decoder-only design concatenates all input and target tokens before applying full self-attention, leading to two critical bottlenecks:

Inefficiency: Attention complexity scales quadratically with the number of input views as $O(N^2)$; input representations are recomputed redundantly when generating multiple target views.
Performance limitations: Heterogeneous tokens—content-rich input views and pose-only target queries—share the same attention parameters, hindering each stream from learning specialized representations.

Core Problem¶

How can one decouple input encoding from target generation within an end-to-end feed-forward NVS framework, while simultaneously improving both efficiency and quality?

The encoder-decoder variant of LVSM avoids redundant computation but compresses all inputs into a single latent vector, causing significant information loss and degraded reconstruction quality. A new architecture is needed that preserves multi-level fine-grained features while supporting efficient inference.

Method¶

1. Dual-Stream Architecture¶

The core idea is to fully decouple input view processing and target view generation into two independent streams:

Input Encoder: Applies intra-view self-attention independently to each input view, with no cross-view interaction.
Target Decoder: Target tokens first undergo self-attention, then query input features via cross-attention.

This design reduces complexity from $O(N^2 M)$ to $O(NM + N)$.

2. Intra-View Self-Attention¶

The input encoder applies self-attention exclusively among patches within each individual input view:

\[\mathbf{S_i}^l = \mathbf{S_i}^{l-1} + \text{Self-Attn}_{\text{input}}^l(\mathbf{S_i}^{l-1})\]

Each view is processed independently, naturally enabling zero-shot generalization to varying numbers of input views.

3. Self-then-Cross Attention (Target Decoder)¶

The target decoder alternates between self-attention and cross-attention:

\[\mathbf{T_j}^l = \mathbf{T_j}^{l-1} + \text{Self-Attn}_{\text{target}}^l(\mathbf{T_j}^{l-1})$$ $$\mathbf{T_j}^l = \mathbf{T_j}^l + \text{Cross-Attn}_{\text{target}}^l(\mathbf{T_j}^l, \mathbf{S_1}^l, ..., \mathbf{S_N}^l)\]

Self-attention enables target tokens to exchange scene-level information with one another, while cross-attention extracts the required content from input features. Ablation studies show that a 6+6 alternating structure outperforms 12 layers of pure cross-attention.

Unlike conventional encoder-decoder designs that consume only the final encoder layer's features, co-refinement allows each decoder layer to query the corresponding encoder layer. This enables the decoder to leverage:

Fine-grained texture details from early layers
High-level semantic information from later layers

Visualizations confirm that co-refinement produces feature maps capturing significantly more target-view detail than a vanilla encoder-decoder baseline.

5. REPA Distillation¶

REPA is employed to distill visual features from a pretrained DINOv3 model into intermediate encoder layers by maximizing patch-level similarity:

\[\mathcal{L}_{REPA} = \frac{1}{N}\sum_{i=1}^{N}\text{sim}(f(\mathbf{I}), h_\phi(\mathbf{X_k}))\]

The pretrained encoder and projection layers are discarded at inference time, incurring no additional inference overhead. A key finding is that REPA yields substantially larger gains for Efficient-LVSM than for LVSM, since full self-attention in the latter entangles feature maps across different views.

6. KV-Cache and Incremental Inference¶

The decoupled design naturally supports KV-caching:

Key/value pairs for input views are computed once and cached for reuse.
Adding a new target view: cached representations are directly reused for rendering.
Adding a new input view: only the new view is processed and appended to the cache.

The marginal cost of adding input or target views is nearly constant, making the model well-suited for interactive applications.

Key Experimental Results¶

Scene-Level (RealEstate10K, 2 Input Views)¶

Method	PSNR ↑	SSIM ↑	LPIPS ↓
GS-LRM	28.10	0.892	0.114
LVSM Dec-Only (512)	29.53	0.904	0.141
Efficient-LVSM (512)	29.86	0.905	0.147

Object-Level (ABO / GSO, 512 Resolution)¶

Method	ABO PSNR	GSO PSNR
GS-LRM	29.09	30.52
LVSM Dec-Only	32.10	32.36
Efficient-LVSM	32.65	32.92

Efficiency Comparison¶

Training convergence: approximately 2× faster than LVSM
Inference speed: 4.4× faster than LVSM Dec-Only (up to 14.9× in certain settings)
Training resources: 64 A100 GPUs for 3 days, representing only 50% of LVSM's training cost
Incremental inference: latency and memory remain nearly constant as views are added

Zero-Shot Generalization¶

Trained with 4 input views, the model generalizes at test time to varying numbers of input views, enabled by the per-view independent processing design.

Highlights & Insights¶

Rigorous problem analysis: The limitations of LVSM's full self-attention are systematically analyzed from the perspectives of information heterogeneity and computational complexity, motivating the dual-stream decoupled design.
Elegant co-refinement design: Layer-wise alignment between encoder and decoder features fully exploits multi-scale information.
Strong practical value: KV-cache combined with incremental inference enables deployment in interactive 3D scene browsing applications.
Dual gains in efficiency and quality: Efficient-LVSM surpasses LVSM across all benchmarks while substantially reducing training and inference costs.
Conditional finding on REPA distillation: The work reveals a coupling between distillation effectiveness and attention architecture design.

Limitations & Future Work¶

Intra-view attention in the input encoder contains no cross-view interaction; scene understanding relies entirely on the decoder's cross-attention, which may be insufficient for scenes with heavy occlusion.
Scaling behavior at larger model sizes or higher resolutions remains unexplored.
Evaluation is limited to static scenes; applicability to dynamic scenes is unknown.
LPIPS at 512 resolution is marginally worse than LVSM Dec-Only (0.147 vs. 0.141), indicating room for improvement in perceptual quality.

Dimension	LVSM Dec-Only	LVSM Enc-Dec	Efficient-LVSM
Input complexity	$O(N^2)$	$O(N^2)$	$O(N)$
Parameter sharing	Heterogeneous tokens shared	Separated but uses only last layer	Separated + layer-wise co-refinement
KV-Cache	Not supported	Not supported	Supported
Incremental inference	Not supported	Not supported	Supported
Variable-view generalization	Poor	Poor	Strong (zero-shot)
RealEstate10K PSNR	29.53	28.55	29.86

Compared to Gaussian splatting-based methods such as pixelSplat and MVSplat, Efficient-LVSM does not rely on explicit 3D representations, is more end-to-end, and achieves higher quality, though at greater computational cost.

The following broader insights emerge from this work:

Transferability of the dual-stream decoupling principle: Any Transformer task involving heterogeneous inputs—such as multimodal understanding or robot perception—can benefit from decoupling the provider and query streams.
Generality of co-refinement: Layer-wise cross-querying can be applied to other encoder-decoder architectures beyond NVS.
KV-Cache for 3D vision: Importing the mature KV-cache technique from LLMs into 3D vision provides a new pathway toward real-time interactive rendering.
Coupling between distillation and architecture: The substantially different REPA gains across architectures suggest that distillation strategies should be co-designed with the model architecture.

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual-stream co-refinement design represents a clear contribution to NVS, though individual components are not entirely novel in isolation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablations covering scene-level and object-level benchmarks across efficiency, quality, and generalization dimensions.
Writing Quality: ⭐⭐⭐⭐ — The progression from problem analysis to solution derivation is logically coherent and well-illustrated.
Value: ⭐⭐⭐⭐⭐ — Simultaneous gains in efficiency and quality, strong practical utility of KV-cache support, and clear guidance for future work.

Dimension	LVSM Dec-Only	LVSM Enc-Dec	Efficient-LVSM
Input complexity	\(O(N^2)\)	\(O(N^2)\)	\(O(N)\)
Parameter sharing	Heterogeneous tokens shared	Separated but uses only last layer	Separated + layer-wise co-refinement
KV-Cache	Not supported	Not supported	Supported
Incremental inference	Not supported	Not supported	Supported
Variable-view generalization	Poor	Poor	Strong (zero-shot)
RealEstate10K PSNR	29.53	28.55	29.86