Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation¶

Conference: CVPR 2026 arXiv: 2603.12766 Code: None Area: 3D Vision Keywords: 4D editing, 3DGS, dynamic scenes, motion propagation, optimal transport, color uncertainty

TL;DR¶

This paper proposes Catalyst4D, a framework that propagates mature 3D static editing results into 4D dynamic Gaussian scenes via Anchor-based Motion Guidance (AMG, which establishes region-level correspondences using optimal transport) and Color Uncertainty-guided Appearance Refinement (CUAR, which automatically identifies and corrects occlusion artifacts). The method consistently outperforms existing approaches in CLIP semantic similarity.

Background & Motivation¶

Background: Static scene editing with 3DGS has reached considerable maturity—methods such as DGE, DreamCatalyst, and SGSST support fine-grained object manipulation and global style transfer with good spatial consistency. 4D scene reconstruction has also advanced significantly (Swift4D, 4DGS, etc.), typically adopting a canonical 3D Gaussian plus a learned deformation field \(\mathcal{F}_\theta\) to represent dynamics.

Limitations of Prior Work: Dynamic 4D scene editing remains highly challenging. Existing methods (Instruct 4D-to-4D, CTRL-D, Instruct-4DGS) primarily rely on 2D diffusion models to edit per-frame images and subsequently fit a 4D representation, leading to: (1) spatial distortion—2D editing lacks geometric reasoning; (2) temporal flickering—inconsistent 2D edits across frames; (3) unintended modification of non-target regions—due to the global influence of 2D diffusion models.

Key Challenge: 3D editing is high quality but limited to static scenes; the deformation network of a 4D representation is trained only on the original geometry, and thus cannot infer the motion of edited Gaussians (which have undergone cloning, splitting, and pruning) that have drifted from the original distribution—new Gaussians have no motion prior.

Goal: Transfer mature 3D static editing capabilities to 4D dynamic scenes while preserving geometric accuracy and temporal consistency.

Key Insight: Decouple spatial editing from temporal propagation—first edit the first frame using a mature 3D editor, then extend the edited result to all time steps via geometry-aware motion propagation.

Core Idea: Use anchor matching with optimal transport to establish region-level motion correspondences between pre- and post-edit Gaussians, aggregate and propagate known deformations from source Gaussians to edited Gaussians, and apply color uncertainty-driven appearance refinement to correct temporal artifacts.

Method¶

Overall Architecture¶

Catalyst4D takes as input an existing 4D reconstruction \((\mathcal{G}_c, \mathcal{F}_\theta)\) and the edited first-frame Gaussians \(\mathcal{G}_{\text{edit}}^1\). The pipeline consists of two stages: (1) the AMG module constructs anchors on both the original first-frame Gaussians \(\mathcal{G}^1\) and the edited Gaussians \(\mathcal{G}_{\text{edit}}^1\), establishes correspondences via optimal transport, and aggregates source Gaussian deformations to propagate motion to the edited Gaussians across all time steps; (2) the CUAR module renders optical flow from the first frame to frame \(t\), warps the first-frame edited image to subsequent frames as pseudo ground truth, estimates per-Gaussian color uncertainty, and selectively refines high-uncertainty regions. The framework is compatible with both Swift4D (multi-camera) and 4DGS (monocular).

Key Designs¶

Anchor-based Motion Guidance (AMG):
- Function: Establish stable region-level motion correspondences between pre- and post-edit Gaussians, avoiding the noise inherent in point-wise matching.
- Mechanism: Anchors are constructed on both the original and edited point clouds—candidate rays are generated by uniformly sampling point pairs on the minimum bounding sphere, and rays intersecting the local neighborhood \(\mathcal{N}_{ei}\) are identified via a cylinder test with radius \(\delta=\frac{\sqrt{3}}{2}d_{\text{mean}}\). The distance-weighted centroid \(\mathbf{p}=\frac{\sum_{\mathbf{x}\in\mathcal{N}_{ei}}d_x\mathbf{x}}{\sum d_x}\) serves as the anchor. The two anchor sets \(A_{\text{src}}, A_{\text{edit}}\) are matched via unbalanced optimal transport (Sinkhorn algorithm) to produce a soft correspondence matrix \(P\in\mathbb{R}^{n\times m}\). The per-frame positional deformation \(\Delta\boldsymbol{\mu}_{\mathbf{g}}^t\) for each edited Gaussian is computed by weighted aggregation of source Gaussian deformations, with weights incorporating opacity and Mahalanobis distance.
- Design Motivation: Anchors are structurally stable, spatially representative region-level reference points that are more robust than point-wise KNN; optimal transport establishes semantically consistent correspondences and naturally prevents cross-semantic motion entanglement (e.g., hand motion erroneously affecting the torso).
Color Uncertainty-guided Appearance Refinement (CUAR):
- Function: Identify and correct color artifacts arising from changes in occlusion relationships.
- Mechanism: The deformation field is used to render optical flow maps \(F_{1\to t}^v\) from the first frame to frame \(t\), warping the first-frame edited image to subsequent frames as pseudo ground truth. Per-Gaussian color uncertainty is estimated as \(\xi_t^v=1-\exp(-\|SH(\mathbf{sh},\mathbf{v})_t-SH(\mathbf{sh},\mathbf{v})_1\|_1)\), composited into a pixel-level uncertainty map \(U_t^v\) via \(\alpha\)-blending, and binarized into an artifact mask \(M_t^v=(U_t^v>\epsilon\cdot\text{mean}(U_t^v))\). L1+SSIM refinement loss is applied only within the high-uncertainty masked regions; outside the mask, L1 regularization against the pre-refinement rendering prevents unintended modification.
- Design Motivation: Editing operations inevitably affect interior Gaussians, and changes in occlusion relationships expose them upon motion. Rather than applying a diffusion model for post-hoc inpainting (which would introduce new inconsistencies), CUAR uses the highly reliable first-frame edited result as supervision via geometric warping—preserving consistency with the 3D edit.
Region-decoupled Deformation Aggregation:
- Function: Ensure that each edited Gaussian inherits motion only from its semantically corresponding region.
- Mechanism: For each edited Gaussian \(\mathbf{g}\), its influencing anchors \(A_{\text{edit}}^{\text{sub}}\) are identified; the correspondence mapping locates the source anchors \(A_{\text{src}}^{\text{sub}}\); the source Gaussians \(\mathcal{G}_{\text{src}}^{1,\text{sub}}\) contributing to those source anchors are retrieved and their temporal deformations aggregated. Weights are given by \(w_{\mathbf{g}'}=\sigma_{\mathbf{g}'}\exp(-\frac{1}{2}(\boldsymbol{\mu}_{\mathbf{g}'}-\boldsymbol{\mu}_{\mathbf{g}})^T\boldsymbol{\Sigma}_{\mathbf{g}'}^{-1}(\boldsymbol{\mu}_{\mathbf{g}'}-\boldsymbol{\mu}_{\mathbf{g}}))\).
- Design Motivation: By mediating through anchor-level correspondences, each edited Gaussian receives motion signals only from semantically matched regions, preventing the cross-part motion entanglement characteristic of KNN-based approaches.

Loss & Training¶

The refinement loss is \(L_{\text{refine}}=(1-\zeta)L_{\text{fore}}+\zeta L_{\text{back}}\), where \(L_{\text{fore}}\) is the L1+SSIM loss (\(\eta=0.2\)) between the rendered image and the warped pseudo ground truth within the masked region, and \(L_{\text{back}}\) is L1 regularization between the rendered image and the pre-refinement rendering outside the mask. Hyperparameters: \(\zeta=0.3\); \(\epsilon\) controls mask coverage. The deformation network is not retrained. Anchor construction takes <30s, Sinkhorn solving ~15s, motion guidance ~1min, CUAR 25–35min, for a total training time of ~50min per scene.

Key Experimental Results¶

Main Results¶

Scene	Method	CLIP Sim↑	Consistency↑	Time↓
Sear-steak	Catalyst4D	0.252	0.983	50min
Sear-steak	CTRL-D	0.249	0.985	55min
Sear-steak	Instruct-4DGS	0.220	0.980	40min
Sear-steak	IN4D	0.246	0.962	2h (2 GPUs)
Coffee-martini	Catalyst4D	0.249	0.986	50min
Coffee-martini	CTRL-D	0.246	0.983	55min
Trimming	Catalyst4D	0.251	0.967	40min
Trimming	CTRL-D	0.248	0.962	50min

Ablation Study¶

Configuration	CLIP Sim↑	Consistency↑	Note
Full model	0.252	0.971	AMG + CUAR complete model
w/o AMG	0.245	0.966	Missing motion guidance degrades semantics and temporal coherence
w/o CUAR	0.248	0.969	Missing appearance refinement causes color artifacts
KNN-Guide	—	—	Cross-part motion entanglement (hand motion affects torso)
DeformNet-Guide	—	—	Edited Gaussians deviate from training distribution, causing geometric artifacts

Key Findings¶

AMG is the primary contribution—removing it reduces CLIP Sim by 0.007, a larger impact than removing CUAR (0.004).
The KNN baseline exhibits typical cross-semantic motion entanglement (visualized in Figure 6), validating the necessity of region-level anchor correspondences.
Directly applying the deformation network to infer edited Gaussian motion fails—editing operations cause Gaussians to deviate from the canonical training distribution.
Catalyst4D consistently achieves the best semantic fidelity (CLIP Sim) and remains highly competitive in temporal consistency.
Training time of 50min outperforms IN4D (2h, dual GPU) and is on par with CTRL-D.

Highlights & Insights¶

The decoupled strategy of "edit 3D first, then propagate to 4D" elegantly circumvents the difficulties of direct 4D editing, inheriting the quality of mature 3D editing methods.
Optimal transport for establishing region-level correspondences is more stable and semantically consistent than point-wise KNN—it is a strong tool for 3D correspondence construction.
CUAR's color uncertainty estimation provides a principled way to automatically identify regions requiring correction—requiring no additional annotation and directly exploiting temporal SH color discrepancies.
The method supports both monocular and multi-camera settings and is compatible with multiple 4D representations (Swift4D/4DGS), demonstrating broad generality.

Limitations & Future Work¶

The upper bound on editing quality is determined by the first-frame 3D editing method—the propagated result is only as good as the 3D input.
The deformation network is not modified nor are Gaussian densities re-optimized; motion guidance may fail locally when the underlying 4D reconstruction is of poor quality.
Severe topological changes (object appearance/disappearance) may challenge anchor correspondences.
Failure cases occur on the D-NeRF trex scene—background Gaussians drift into the edited foreground region.
Evaluation is limited to three datasets; generalization to larger scenes and additional editing types (e.g., lighting, material) requires further investigation.

vs. Instruct 4D-to-4D / Instruct-4DGS: These methods rely on 2D diffusion models for per-frame editing and lack precise localization. Catalyst4D starts from 3D editing and directly constrains Gaussians via gradients, achieving more precise localization without modifying non-target regions.
vs. CTRL-D: Adopts a DreamBooth-finetuned 2D-to-4D route, achieving visually close results, but the 2D-to-4D reconstruction gap introduces blurriness and over-smoothing, and non-edited regions (e.g., objects on a table) are unintentionally modified.
vs. Static 3D editing methods (DGE / DreamCatalyst / SGSST): Catalyst4D extends the editing capability of these methods from static to dynamic scenes—the relationship is complementary rather than competitive.

Rating¶

Novelty: ⭐⭐⭐⭐ The 3D-to-4D propagation paradigm and the anchor-plus-optimal-transport mechanism represent clear contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, four baselines, independent ablations of AMG and CUAR, and honest disclosure of failure cases.
Writing Quality: ⭐⭐⭐⭐ Logic is clear, figures are intuitive, and mathematical notation is rigorous.
Value: ⭐⭐⭐ 4D editing is a frontier problem but with a relatively narrow application scope; the method offers inspiration for other cross-representation transfer tasks.