FreeArtGS: Articulated Gaussian Splatting Under Free-Moving Scenario¶
Conference: CVPR 2026 arXiv: 2603.22102 Code: https://freeartgs.github.io/ Area: 3D Vision Keywords: Articulated object reconstruction, Gaussian splatting, free-moving scenario, joint estimation, motion segmentation
TL;DR¶
FreeArtGS addresses articulated object reconstruction from monocular RGB-D video under a free-moving scenario, where both object pose and joint state change arbitrarily and simultaneously. The proposed three-stage pipeline — motion-driven part segmentation, robust joint estimation, and end-to-end 3DGS optimization — substantially outperforms all baselines on the newly introduced FreeArt-21 benchmark and existing datasets.
Background & Motivation¶
- Background: Articulated object reconstruction is a fundamental problem in 3D vision with significant implications for augmented reality and robot simulation. Existing approaches fall into three categories: (a) single-image generation via foundation models, with limited generalizability; (b) reconstruction from two articulated states captured by fixed multi-view cameras, requiring axis alignment across states; and (c) monocular video reconstruction under the assumption of a stationary base part.
- Limitations of Prior Work: Single-image generation methods lack post-optimization and generalize poorly; multi-view dual-state methods are limited in practicality due to the difficulty of axis alignment; monocular video methods rely on the "static base" assumption, which is frequently violated in real-world manipulation (e.g., both parts of scissors or pliers move simultaneously), and coverage remains incomplete.
- Key Challenge: In practice, articulated objects are often freely manipulated — object pose and joint state change concurrently with no fixed reference part. Existing methods are fundamentally incapable of handling this most natural usage scenario.
- Goal: To reconstruct the complete appearance, geometry, and joint parameters of articulated objects from monocular RGB-D video under the free-moving scenario.
- Key Insight: Combining dense 2D point tracking priors with 3DGS optimization — point tracking provides motion cues to drive part segmentation, while optimization yields high-precision final reconstruction.
- Core Idea: Point tracking and feature priors are used for free-motion part segmentation; relative transformation sequences are used to estimate joint type and axis; end-to-end 3DGS optimization jointly refines appearance, geometry, and joint parameters.
Method¶
Overall Architecture¶
Input: Monocular RGB-D video with foreground masks (generated by SAM). Output: Canonical Gaussians for two parts \(\mathcal{G}_c^0, \mathcal{G}_c^1\) and joint parameters \(\mathcal{J}\). The pipeline consists of three modules: (1) Free-moving part segmentation — decomposing the articulated object into two rigid parts from motion; (2) Joint estimation — inferring joint type and axis from per-part camera transformations; (3) End-to-end optimization — jointly optimizing appearance, geometry, camera poses, and articulation parameters.
Key Designs¶
-
Free-moving Part Segmentation:
- Function: Decompose the articulated object in free-moving video into two rigid parts.
- Mechanism: The core assumption is that within a short temporal window, each part's motion can be approximated as an independent rigid-body transformation. AllTracker provides pixel-level 2D trajectories, which are lifted to 3D using depth. DINOv3 features initialize per-point part weights \(w_{t,p} \in [0,1]\). Within a sliding window of 8 frames, two rigid-body transformations \(T^0, T^1\) and soft part weights are jointly optimized. The main loss is a Huber loss measuring per-point relative motion error as a proxy for part assignment. Key regularization terms include: an entropy loss encouraging near-binary assignments, a feature-space neighbor-graph smoothness loss for spatial consistency, and a BCE loss against initialization weights to prevent deviation from semantic priors.
- Design Motivation: No part is assumed stationary; segmentation is driven purely by differential relative motion. Feature-space regularization prevents unstable point tracking results from leading to degenerate solutions.
-
Joint Estimation:
- Function: Infer joint type (revolute/prismatic) and axis parameters from the sequence of part transformations.
- Mechanism: An off-the-shelf pose estimator calibrates the per-frame transformation of each part to the camera \(E_i^k \in SE(3)\); each part's 3DGS is reconstructed separately with pose optimization. Both parts are registered to a unified coordinate system using the part with minimal motion as reference. Joint type is determined from the relative transformation sequence \(\{T_i\}\) — small rotational span combined with strong linearity indicates a prismatic joint; otherwise, a revolute joint is assumed. For revolute joints, the rotation axis is obtained in closed form from pairwise relative rotations; for prismatic joints, PCA yields the translation axis. Two robustness measures are applied: (a) pairwise relative transformations between adjacent frames \(T_{i \to (i+1)}\) are used instead of absolute transformations; (b) outlier transformations are filtered using a \(2\sigma\) threshold.
- Design Motivation: Absolute transformations \(T_i\) are highly sensitive to point tracking noise; pairwise relative transformations are more robust. Outlier filtering further improves stability.
-
End-to-end Optimization:
- Function: Jointly refine appearance, geometry, camera poses, and articulation parameters.
- Mechanism: Joint parameters are parameterized as revolute (\(u, o, \theta_i\)) or prismatic (\(u, d_i\)). Blended Rendering is introduced: after applying rigid-body transformations to canonical Gaussians, alpha blending is performed according to part weights \(w \in [0,1]\): \(\mathcal{G}_i = w(\mathcal{G}_c \circ I) \cup (1-w)(\mathcal{G}_c \circ \mathcal{J}_i)\). Supervision includes RGB (L1 + SSIM), depth (L1), and foreground mask (L1). The total loss is \(\mathcal{L}_{E2E} = \sum_i (\mathcal{L}_{rgb}^i + \lambda_{depth}\mathcal{L}_{depth}^i + \lambda_{mask}\mathcal{L}_{mask}^i)\).
- Design Motivation: The first two modules provide coarse but reasonable initialization; end-to-end optimization leverages differentiable rendering to tightly couple appearance and kinematics, correcting small errors in the coarse joint estimates. Blended Rendering allows part assignments to be refined at fine granularity during optimization.
Loss & Training¶
Part segmentation stage: \(\mathcal{L} = 200\mathcal{L}_{main} + 10\mathcal{L}_{smooth} + 0.01\mathcal{L}_{ent} + 5\mathcal{L}_{init}\), with 100 iterations per frame pair. Part reconstruction and end-to-end optimization each run for 30,000 iterations, implemented on top of NeRFStudio. The full pipeline takes approximately 25 minutes (100-frame 640×360 video, RTX 4090).
Key Experimental Results¶
Main Results (FreeArt-21, Revolute Joints)¶
| Method | Axis↓ (deg) | Position↓ (cm) | State↓ (deg) | CD-w↓ (cm) | CD-m↓ (cm) | PSNR↑ (dB) |
|---|---|---|---|---|---|---|
| ArticulateAnything | 42.00 | 59.38 | - | - | - | - |
| Video2Articulation | 20.00 | 16.31 | 27.37 | 2.29 | 10.74 | - |
| FreeArtGS | 1.04 | 0.29 | 1.43 | 0.14 | 0.28 | 24.02 |
Ablation Study (FreeArt-21, Revolute Joints)¶
| Configuration | Axis↓ | Position↓ | State↓ | CD-w↓ | PSNR↑ |
|---|---|---|---|---|---|
| Full model | 1.04 | 0.29 | 1.43 | 0.14 | 24.02 |
| w/o Smooth Loss | 28.01 | 17.73 | 18.74 | 5.72 | 10.60 |
| w/o Init Loss | 9.35 | 19.58 | 14.64 | 0.75 | 13.07 |
| w/o Noise Resistance | 4.75 | 2.22 | 1.30 | 0.17 | 22.65 |
| w/o Blended Rendering | 1.72 | 1.88 | 1.88 | 0.12 | 22.23 |
Key Findings¶
- FreeArtGS achieves approximately 20× improvement in joint axis accuracy over Video2Articulation (1.04° vs. 20.00°) and 56× improvement in position accuracy.
- Smooth Loss contributes most: removing it causes axis error to surge from 1.04° to 28.01°, demonstrating that instability in point tracking must be mitigated through feature-space regularization.
- Init Loss is also critical: its removal increases position error from 0.29 cm to 19.58 cm, confirming that semantic priors from DINOv3 features are essential for correct part assignment.
- Noise Resistance (outlier filtering) provides notable improvements in joint estimation robustness.
- Blended Rendering improves PSNR by approximately 2 dB while preserving joint accuracy.
- FreeArtGS also outperforms all methods on the Video2Articulation-S dataset (static base setting), demonstrating generality.
- On six real-world objects, the average axis error is 2.73° and geometric CD is 2.48 cm.
Highlights & Insights¶
- Value of problem formulation: This work is the first to formally define and address articulated object reconstruction under the free-moving scenario — the most natural manipulation setting, and strictly more practical than existing assumptions (static base part, multi-view dual-state).
- Prior + optimization combination: Off-the-shelf models (AllTracker, DINOv3, SAM) provide initialization priors; optimization provides final accuracy. Neither alone suffices — priors are noisy, and pure optimization lacks reliable initialization.
- FreeArt-21 benchmark construction: A VR-based teleoperation system is used to manipulate PartNet-Mobility objects in Sapien, generating free-moving data covering 7 categories and 21 objects (5 revolute + 2 prismatic joints), filling a critical gap in the field.
- 25-minute full pipeline: Processing a 100-frame video requires only 25 minutes (segmentation 6 min + joint estimation 1 min + end-to-end optimization 18 min), offering strong practical viability.
Limitations & Future Work¶
- The current method assumes exactly two rigid parts and cannot handle multi-part articulated structures (e.g., robot arms); sequential capture of each moving part could serve as a potential extension.
- The pipeline depends on multiple off-the-shelf models (AllTracker, DINOv3, SAM, pose estimator); cascading errors may amplify in complex scenes. A unified feed-forward model would be the ideal long-term solution.
- RGB-D input (depth) is required; pure RGB video is not currently supported due to insufficient accuracy of continuous-video depth prediction.
- While the method exhibits some robustness to hand occlusion during manipulation, severe occlusion may still cause failure.
Related Work & Insights¶
- vs. Video2Articulation: V2A relies on pretrained feed-forward reconstruction models (Monst3R) to predict dynamics, failing under the free-moving scenario; FreeArtGS uses an optimization-based approach to segment parts from motion differences.
- vs. ArticulateAnything: AA employs VLM-based inference to generate URDFs, which is prone to hallucination and yields incorrect axes in most cases; FreeArtGS obtains accurate joints through geometric optimization.
- vs. RSRD: RSRD assumes each part has a distinctive motion pattern, which is unsuitable for articulated objects where part motions are kinematically coupled; it achieves the worst results across all metrics on V2A-S.
- vs. dynamic reconstruction methods: Feed-forward dynamic reconstruction methods (e.g., Monst3R) cannot recover precise motion in the free-moving scenario; FreeArtGS combines feed-forward priors with optimization to bridge this gap.
Rating¶
- Novelty: ⭐⭐⭐⭐ The free-moving setting constitutes a genuinely novel problem formulation; the method is a non-trivial integration of existing techniques (3DGS, point tracking, rigid-body fitting).
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Triple validation via a self-constructed benchmark, existing datasets, and real-world objects; comprehensive ablations with full metric coverage.
- Writing Quality: ⭐⭐⭐⭐ The problem is clearly defined, the method is well-organized, and design motivations for each module are thoroughly articulated.
- Value: ⭐⭐⭐⭐⭐ A new problem, a new benchmark, and strong results with direct applicability to digital twins and robot learning.