MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting¶

Conference: ICLR 2026 arXiv: 2510.19210 Code: https://cvsp-lab.github.io/MoE-GS Area: 3D Vision / Dynamic Scene Reconstruction Keywords: 3D Gaussian Splatting, dynamic scene, mixture of experts, novel view synthesis, knowledge distillation

TL;DR¶

This paper proposes MoE-GS, the first framework to introduce a Mixture-of-Experts architecture into dynamic Gaussian Splatting. Through a Volume-aware Pixel Router, it adaptively fuses multiple heterogeneous deformation priors (HexPlane / per-Gaussian / polynomial / interpolation), consistently surpassing state-of-the-art methods on the N3V and Technicolor datasets, while maintaining efficiency via single-pass rendering, gate-aware pruning, and knowledge distillation.

Background & Motivation¶

Background: Novel view synthesis for dynamic scenes has extended from NeRF to 3DGS, giving rise to a variety of dynamic Gaussian methods: MLP-based deformation networks (4DGaussians, E-D3DGS), polynomial motion models (STG), and interpolation-based approaches (Ex4DGS).

Limitations of Prior Work: Through empirical analysis, the authors identify inconsistencies at three levels: (a) Scene-level — different methods exhibit large performance variation across scenes, with no universally optimal approach; (b) Spatial-level — within a single scene, different regions are best reconstructed by different methods; (c) Temporal-level — the optimal method for a given video changes dynamically across frames.

Key Challenge: Each deformation model embeds a specific inductive bias — HexPlane favors low-motion regions, per-Gaussian embeddings suit fast and consistent optical flow, polynomials handle globally smooth motion, and interpolation addresses locally diverse motion. Real-world scenes typically contain mixed motion patterns that no single method can comprehensively cover.

Goal: To adaptively fuse multiple heterogeneous dynamic Gaussian experts so that the model automatically selects the most appropriate deformation prior for different spatial and temporal regions.

Key Insight: Drawing inspiration from the MoE architecture, each dynamic GS method is treated as an expert, and a router is designed to adaptively fuse experts at the pixel level. The key challenge is that the router must simultaneously perceive 3D volumetric information and 2D pixel information.

Core Idea: By splatting per-Gaussian 3D routing weights into pixel space via differentiable weight splatting, the method achieves volume-aware adaptive expert fusion.

Method¶

Overall Architecture¶

Stage 1: Each expert is trained independently → Stage 2: Expert parameters are frozen and the Volume-aware Pixel Router is trained → Inference: the router adaptively fuses the rendered outputs of \(N\) experts. Optional post-processing: pruning or distillation.

Key Designs¶

Volume-aware Pixel Router:
- Function: Adaptively assigns expert weights at the pixel level while perceiving 3D volumetric information.
- Mechanism: Per-Gaussian weights \(\bm{w}_i^{per} = [w_i, w_i^{dir}, (t \cdot w_i^{time})]^T\) (encoding view and time dependency) are learned for each Gaussian, splatted into 2D pixel space to obtain \(w_{2D}(u)\), then refined by a lightweight MLP and normalized via softmax to produce gating weights \(G'_k(u)\).
- Design Motivation: A Pixel Router (pure 2D MLP) lacks volumetric awareness and produces overly smoothed results; a Volume Router (directly adjusting opacity in 3D space) suffers from unstable optimization. The Volume-aware Pixel Router optimizes in 2D space (stable) while leveraging 3D features (volumetric context).
- Comparison: PSNR — Pixel Router 31.12 < Volume Router 32.05 < VA Pixel Router 33.23
Single-Pass Multi-Expert Rendering:
- Function: Merges all experts' Gaussians into a single batch, performing projection and rasterization only once.
- Mechanism: Each Gaussian is augmented with a one-hot expert identity \(e_j \in \mathbb{R}^K\); during alpha blending, colors are separated by expert identity: \(C_k(u) = \sum_j T_j(u) \alpha_j(u) c_j \cdot (e_j)_k\).
- Effect: FPS increases from 40 to 68 (Table 5).
Gate-Aware Pruning:
- Function: Removes Gaussians that contribute little to the MoE output.
- Mechanism: The gradient of gating weights with respect to per-Gaussian weights is accumulated: \(\mathcal{E}_i = \frac{1}{|\mathcal{D}|} \sum_v \|\frac{\partial G'_k(v)}{\partial \bm{w}_i^{per}(v)}\|\); Gaussians below a threshold are pruned.
- Effect: At 55% pruning, PSNR drops by only 0.02 dB, FPS increases from 44 to 83, and memory is reduced from 878 to 351 MB.
Knowledge Distillation:
- Function: Transfers MoE performance to a single expert for lightweight deployment.
- Mechanism: \(\mathcal{L}_k^{KD} = \lambda \cdot \mathcal{L}(G'_k \cdot I_{E_k}, G'_k \cdot I_{GT}) + (1-\lambda) \cdot \mathcal{L}((1-G'_k) \cdot I_{E_k}, (1-G'_k) \cdot I_{MoE})\) — regions with high router weights are supervised by ground truth, while regions with low weights use the MoE output as pseudo-labels.
- Design Motivation: When \(N \geq 4\), multi-expert inference incurs significant overhead; distilling into a single expert preserves near-MoE performance.

Loss & Training¶

Training loss: L1 + SSIM (standard 3DGS loss).
Two-stage training: Stage 1 trains each expert independently; Stage 2 freezes the experts and trains only the router.
Experts require a smaller training budget: MoE-GS with 20% of the training budget still outperforms any single expert trained with 100%.

Key Experimental Results¶

Main Results¶

Method	N3V Avg. PSNR↑	Technicolor Avg. PSNR↑
4DGaussians	31.43	30.79
E-D3DGS	32.33	33.06
STG	31.92	33.69
Ex4DGS	32.10	33.45
MoE-GS (N=3)	33.23	34.55
MoE-GS (N=4)	33.27	—

MoE-GS (N=3) outperforms the strongest single expert, E-D3DGS, by 0.9 dB PSNR.

Ablation Study¶

Router Variant	PSNR↑	SSIM↑
Pixel Router	31.12	0.952
Volume Router	32.05	0.951
Volume-aware Pixel Router	33.23	0.954

Efficiency Strategy	PSNR	FPS	Memory (MB)
w/o both	32.54	36	747
Full MoE-GS (N=3)	33.23	68	270

Key Findings¶

Expert diversity matters: The gain from N=2→3 is significant (+0.69 dB), while N=3→4 yields a marginal improvement (+0.04 dB).
Low training budgets remain effective: MoE-GS with 20% of the training budget (32.60) still outperforms any single expert trained with 100%.
Router visualization shows that routing weights semantically correspond to motion patterns — high-motion regions tend to favor the per-Gaussian deformation expert.
A distilled single expert can achieve performance close to the full MoE (detailed figures are provided in the appendix).

Highlights & Insights¶

Splatting as routing: The method cleverly reuses the 3DGS splatting mechanism for routing weight propagation — learning 3D weights while optimizing and fusing in 2D space — simultaneously achieving volumetric awareness and optimization stability.
Complementary heterogeneous experts: Different deformation priors (embedding / polynomial / interpolation) each excel in distinct motion regimes; the MoE architecture is naturally suited to exploit such complementarity.
Complete efficiency toolbox: From single-pass rendering and gate-aware pruning to full knowledge distillation, the framework provides a complete deployment path spanning high quality to high efficiency.

Limitations & Future Work¶

The MoE framework inherently increases parameter count and training cost (\(N\) experts = \(N\times\) training time, though reducible to 20%).
The two-stage training (experts first, then router) is not joint end-to-end optimization and may not reach the global optimum.
The expert combination is a manually selected fixed set; automated expert selection or construction is not explored.
Validation is limited to multi-view video datasets; extension to monocular dynamic scenes remains unexplored.

vs. 4DGaussians: 4DGaussians uses HexPlane embeddings for deformation, performing well in low-motion scenes but poorly in high-motion ones; MoE-GS can automatically select the appropriate expert.
vs. STG: STG models trajectories with polynomials, yielding globally smooth results but insufficient local detail; as one MoE expert, it contributes its global prior.
vs. E-D3DGS: E-D3DGS is the strongest single baseline (32.33), yet MoE-GS achieves 33.23 by fusing multiple experts.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to introduce MoE into dynamic GS; the Volume-aware Pixel Router is an elegant design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two standard benchmarks, multiple \(N\) configurations, comprehensive ablations, efficiency analysis, and distillation evaluation.
Writing Quality: ⭐⭐⭐⭐ — Well-motivated (three-level analysis), with clear method descriptions.
Value: ⭐⭐⭐⭐ — MoE + GS is a promising direction, though generalizability warrants further validation.