Skip to content

ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Conference: NeurIPS 2025 arXiv: 2505.23734 Code: Project Page Area: 3D Vision Keywords: 3D Gaussian Splatting, Feed-Forward 3DGS, Information Bottleneck, Multi-View Compression, Novel View Synthesis

TL;DR

Grounded in the Information Bottleneck (IB) principle, this paper analyzes the capacity bottleneck of feed-forward 3DGS and proposes ZPressor, a lightweight, architecture-agnostic module that compresses multi-view inputs into a compact anchor-view representation, enabling existing models to scale to 100+ input views (480P, 80GB GPU) with consistent performance gains on DL3DV-10K and RealEstate10K.

Background & Motivation

Feed-forward 3DGS methods (e.g., pixelSplat, MVSplat, DepthSplat) represent a significant advance in novel view synthesis. Unlike traditional 3DGS that requires per-scene optimization, feed-forward approaches predict 3DGS parameters in a single forward pass through an encoder, substantially improving practicality.

However, existing feed-forward 3DGS models face a fundamental scalability bottleneck: as the number of input views increases, performance degrades rather than improves, while memory consumption grows rapidly. For example, DepthSplat achieves only 19.23 PSNR with 36-view input (compared to 23.32 with 12 views), and pixelSplat runs out of memory beyond 8 views. This issue is not purely an engineering problem — even memory-efficient attention or activation checkpointing cannot resolve the performance degradation.

The authors provide an information-theoretic explanation: the joint entropy of multi-view features \(H(\mathbf{F}_1, \mathbf{F}_2, ..., \mathbf{F}_K)\) is not equal to the sum \(\sum H(\mathbf{F}_i)\), so a large number of views introduces substantial information redundancy. In existing pixel-aligned designs, the number of 3D Gaussians grows linearly with input views, causing representational overload.

The core idea is to apply the IB principle to guide compression: learning a compact latent representation \(\mathcal{Z}\) that retains task-relevant information (for NVS) while discarding redundancy. Concretely, input views are divided into anchor views and support views; cross-attention aggregates information from support views into anchor views, producing a compressed latent state \(\mathcal{Z}\).

Method

Overall Architecture

ZPressor is an architecture-agnostic module that can be inserted after the encoder of any feed-forward 3DGS model. Given features \(\mathcal{X} = \{\mathbf{F}_i\}_{i=1}^K\) from \(K\) input views, ZPressor compresses them into \(\mathcal{Z} = \text{ZPressor}(\mathcal{X})\), which is then fed into the pixel-aligned Gaussian prediction network \(\Psi_{pred}\).

Key Designs

  1. Anchor View Selection:

    • Farthest Point Sampling (FPS) is applied to select \(N\) anchor views from \(K\) camera positions.
    • Formula: \(\mathbf{T}_{a_{i+1}} = \arg\max_{\mathbf{T}_j \in \mathcal{T} \setminus \mathcal{S}}(\min_{\mathbf{T}_k \in \mathcal{S}} d(\mathbf{T}_j, \mathbf{T}_k))\)
    • Design Motivation: Maximizing spatial coverage — anchors must span diverse viewpoints of the scene within a limited count, directly affecting the informational completeness of the compressed representation.
  2. Support-to-Anchor Assignment:

    • Each support view is assigned to its nearest anchor: \(\mathcal{C}_i = \{f(\mathbf{T}) \in \mathcal{X}_{support} \mid \|\mathbf{T} - \mathbf{T}_{a_i}\| \leq \|\mathbf{T} - \mathbf{T}_{a_j}\|, \forall j \neq i\}\)
    • Design Motivation: Spatially proximate views contain the most complementary information; assigning them to the same anchor maximizes fusion effectiveness.
  3. Views Information Fusion:

    • Fusion is implemented via cross-attention: \(\mathcal{Z} = \text{Attention}(Q, K, V)\), where \(Q \leftarrow \mathcal{X}_{anchor}\), \(K, V \leftarrow \mathcal{X}_{support}\).
    • Anchor features serve as queries while support view features provide keys and values, enabling effective injection of support information into anchors.
    • Additional self-attention layers enhance intra-cluster information flow; stacking multiple blocks further improves fusion.
    • Design Motivation: Cross-attention satisfies two key properties: (1) it integrates support information with anchors as the primary subject; (2) it captures inter-group correlations while maintaining compactness and avoiding redundancy. Gradients also flow from the prediction side back to both anchor and support views.

Loss & Training

  • IB-principled training objective: \(\mathcal{L} = \mathbb{E}_{\mathcal{Z} \sim p_\theta(\mathcal{Z}|\mathcal{X})}[-\log q_\phi(\mathcal{Y}|\mathcal{Z})] + \beta \mathbb{E}_\mathcal{X}[\text{KL}[p_\theta(\mathcal{Z}|\mathcal{X}), r(\mathcal{Z})]]\)
  • The prediction term \(-\log q_\phi(\mathcal{Y}|\mathcal{Z})\) is modeled by rendering loss (MSE + LPIPS).
  • The compression term is controlled by setting the anchor count \(N\) to an acceptable training value.
  • All baseline training configurations (learning rate, AdamW optimizer, and other hyperparameters) are strictly followed; no additional data or regularization is introduced.
  • Training context views: 6 (DepthSplat/MVSplat) or 4 (pixelSplat); anchor count set to 6.

Key Experimental Results

Main Results (DL3DV-10K)

Views Method PSNR↑ SSIM↑ LPIPS↓
36 DepthSplat 19.23 0.666 0.286
36 DepthSplat + ZPressor 23.88 (+4.65) 0.815 0.150
24 DepthSplat 20.38 0.711 0.253
24 DepthSplat + ZPressor 24.26 (+3.88) 0.820 0.147
12 DepthSplat 23.32 0.807 0.162
12 DepthSplat + ZPressor 24.30 (+0.97) 0.821 0.146

Ablation Study

Configuration PSNR↑ SSIM↑ LPIPS↓ Inference Time (s) Peak Memory (GB)
DepthSplat Baseline 23.32 0.808 0.162 0.260 6.80
+ ZPressor (Full) 24.30 0.821 0.146 0.184 3.80
+ ZPressor (w/o multi-block) 24.18 0.817 0.149 0.140 3.79
+ ZPressor (w/o self-attention) 23.85 0.810 0.156 0.183 3.80

Key Findings

  • ZPressor's gains are more pronounced with denser input views: +4.65 dB PSNR at 36 views vs. +0.97 dB at 12 views.
  • pixelSplat runs out of memory beyond 8 views; with ZPressor it scales to 36 views (PSNR 26.59).
  • ZPressor not only improves quality but also reduces inference time (0.260s → 0.184s) and peak memory (6.80GB → 3.80GB).
  • Information fusion analysis confirms that removing fusion (w/o fusion: PSNR 23.80) or replacing support views with repeated anchors (fuse anchors: PSNR 24.23) both underperform the default (24.30), validating the contribution of support view information.
  • Bottleneck constraint analysis: small scene coverage (CG50) is sufficiently handled with 7 anchors; more anchors introduce redundancy. Larger scene coverage (CG100) requires more anchors.

Highlights & Insights

  • Applying the IB principle to analyze capacity issues in feed-forward 3DGS provides a theoretical foundation rather than a purely engineering optimization. This principled methodology enables ZPressor to be adapted across different architectures in an architecture-agnostic manner.
  • The result of simultaneously reducing latency, memory, and improving quality is highly compelling — compressing information redundancy not only saves resources but also eliminates the interference of redundant information, improving representation quality.
  • The bottleneck constraint analysis revealing the relationship between scene information density and the optimal anchor count is particularly insightful: 7 anchors suffice for small scenes, and adding more introduces redundancy — a perfect demonstration of the compression–preservation trade-off in the IB principle.
  • Consistent improvements across three baseline models with distinct architectures strongly validate the architecture-agnostic claim.
  • With ZPressor, DepthSplat's peak memory drops from 6.80GB to 3.80GB (−44%) and inference time from 0.260s to 0.184s (−29%), representing significant efficiency gains.

Limitations & Future Work

  • Effectiveness is limited under extremely dense views (e.g., 1000 views), as compression to 50 views still poses computational challenges. Combining with 3D Gaussian merging or memory-efficient rendering warrants exploration.
  • The anchor count \(N\) is currently a manually set hyperparameter and does not adapt to scene information density. A policy network to predict the optimal anchor count could be considered.
  • Evaluation is limited to static scene datasets; applicability to dynamic scenes (4D scene reconstruction) remains unexplored.
  • Compression is performed only along the view dimension; spatial resolution compression is not addressed. Combining both dimensions could further improve scalability.
  • Training still uses a small number of context views (6); whether training with more views is beneficial deserves investigation.
  • Detailed comparison with concurrent work such as FreeSplat is limited.
  • vs. FreeSplat/GGN: These methods reduce redundancy by merging Gaussians via cross-view projection inspection, but lack a principled framework. ZPressor provides a more systematic solution grounded in IB theory.
  • vs. StreamGS: StreamGS merges redundant Gaussians in 3D space; ZPressor performs information compression at the encoder input stage. The two approaches address redundancy at different levels and are in principle complementary.
  • vs. Engineering Optimizations (memory-efficient attention, etc.): Engineering optimizations can only alleviate memory issues; they cannot resolve performance degradation. ZPressor addresses the root cause at the level of representation learning.

Rating

  • Novelty: ⭐⭐⭐⭐ Analyzing feed-forward 3DGS through the IB principle is a novel perspective, though cross-attention-based compression itself is not particularly new.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three baseline models, two large-scale datasets, and comprehensive ablation and efficiency analyses make this highly complete.
  • Writing Quality: ⭐⭐⭐⭐⭐ Theoretical analysis is clear; the logical chain from principle to design to experiments is coherent and complete.
  • Value: ⭐⭐⭐⭐⭐ Addresses the core scalability problem of feed-forward 3DGS; the plug-and-play module offers exceptionally high practical value.