CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions¶

Conference: ICLR 2026
arXiv: 2602.01844
Code: https://github.com/whynot-zyl/CloDS
Area: 3D Vision
Keywords: cloth dynamics, unsupervised learning, Gaussian splatting, differentiable rendering, intuitive physics

TL;DR¶

CloDS proposes the first framework for unsupervised cloth dynamics learning from multi-view videos. By introducing Spatial Mapping Gaussian Splatting (SMGS) to establish a differentiable mapping between 2D images and 3D meshes, combined with dual-position opacity modulation to address self-occlusion, the method enables a GNN to learn cloth dynamics approaching fully supervised performance without any physical parameter supervision.

Background & Motivation¶

Background: Deep learning has achieved significant progress in simulating dynamic systems (fluids, cloth, multi-body dynamics), but existing methods heavily rely on known physical properties as supervision signals (e.g., particle positions, mesh node coordinates).

Limitations of Prior Work: - In real-world scenarios, physical properties (material parameters, environmental conditions) are often unknown, limiting practical applicability. - Intuitive physics methods (learning dynamics from visual observations) primarily target rigid body interactions and perform poorly on continuum mechanics, especially cloth. - Dynamic scene novel view synthesis methods fail to generalize to unseen frames; video prediction methods struggle to maintain temporal consistency under frequent self-occlusion.

Key Challenge: Cloth has an infinite-dimensional state space, complex self-occlusion, and large nonlinear deformations. Existing particle representations (e.g., NeuroFluid) are ill-suited for the thin-sheet structure of cloth, while directly applying mesh-based Gaussian Splatting introduces perspective distortion due to self-occlusion.

Goal: - Define and solve the Cloth Dynamics Grounding (CDG) problem: unsupervised cloth dynamics learning from multi-view videos. - Design a differentiable 2D↔3D mapping that enables GNN dynamics models to be trained with pixel-level losses. - Resolve rendering distortions under large deformations and severe self-occlusion.

Key Insight: Decompose the problem into the joint learning of three probabilistic models — rendering \(p(Y_t|M_t)\), inverse rendering \(p(M_t|Y_{1:t})\), and dynamics transition \(p(M_{t+1}|M_t)\) — chained through a Differentiable Visual Computing (DVC) framework.

Core Idea: Establish a differentiable, temporally consistent 2D-3D mapping via mesh-based Gaussian Splatting with dual-position opacity modulation, inverting 3D mesh sequences from video as pseudo-labels for dynamics learning.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) construct Gaussian components from the first-frame mesh and multi-view images; (2) recover 3D mesh sequences \(\tilde{M}_{1:T}\) from subsequent frames via backpropagation; (3) train a GNN dynamics model on the recovered mesh sequences.

Key Designs¶

Spatial Mapping Gaussian Splatting (SMGS):
- Function: Establish a temporally consistent differentiable 2D-3D mapping.
- Mechanism: Anchor Gaussian components to mesh faces, with each Gaussian center determined by barycentric interpolation \(\mu_t = \beta_1 X_{t,1}^W + \beta_2 X_{t,2}^W + \beta_3 X_{t,3}^W\). As the mesh deforms, Gaussians update automatically using the same \(\beta\) coefficients and new vertex positions. Rotation matrices are determined by face normals and Gram-Schmidt orthogonalization; scales are determined by edge lengths.
- Design Motivation: Based on GaMeS, but GaMeS cannot handle self-occlusion under large deformations. The key innovation of SMGS is dual-position opacity modulation.
Dual-Position Opacity Modulation:
- Function: Dynamically control the opacity of each Gaussian component via \(\alpha_{i,t} = f_\theta(\mu_{i,t}^W, \mu_{i,t}^M)\).
- Mechanism: Opacity is modulated by jointly considering world coordinates (relative position \(\mu^W\)) and mesh coordinates (absolute position \(\mu^M\)), where \(f_\theta\) is an MLP. World coordinates prevent perspective distortion (correctly assigning weights under overlap); mesh coordinates prevent cloth from becoming transparent when it moves to previously unobserved regions.
- Design Motivation: GaMeS does not adjust opacity during motion, leading to incorrect weights in occluded regions and rendering errors. Using world coordinates alone causes transparency in novel regions; combining both addresses both failure modes simultaneously.
Inverse Rendering for 3D Label Recovery:
- Function: Recover 3D mesh node positions from 2D images via backpropagation.
- Mechanism: Given the current mesh \(M_t\), optimize displacement \(\Delta x_t^W\) such that the rendered result \(\tilde{I}_{t+1}\) matches the target image \(Y_{t+1}\). Applied recursively, the full sequence \(\tilde{M}_{1:T}\) can be recovered from the first-frame mesh.
- Design Motivation: No 3D mesh annotations (e.g., physics simulator outputs) are required; multi-view video alone suffices to obtain pseudo-labels for training the dynamics model.

Loss & Training¶

Stage 1 (Gaussian construction): Standard 3DGS loss \(\mathcal{L}_{render} = (1-\lambda)\mathcal{L}_1 + \lambda\mathcal{L}_{D-SSIM}\), \(\lambda=0.2\).
Stage 2 (mesh recovery): \(\mathcal{L}_{geometry} = \mathcal{L}_1(\text{SMGS}(\tilde{x}_{t+1}^W), Y_{t+1}) + \gamma\mathcal{L}_{edge}\), where the edge loss preserves inter-node distances to prevent excessive deformation.
Stage 3 (dynamics training): \(\mathcal{L}_{node} = \sum_{t=1}^T \text{MSE}(\hat{x}_t^W, x_t^W)\), rollout length \(T=8\).

Key Experimental Results¶

Main Results (Cloth Dynamics Learning — RMSE)¶

Method	Supervision	Viewed Interp.	Viewed Extrap.	Unviewed Interp.	Unviewed Extrap.
MGN	All mesh	0.1286	0.1291	0.1358	0.1314
MGN*	50 meshes	0.1380	0.1388	0.1460	0.1362
CloDS	50 mesh + 50 video	0.1321	0.1344	0.1399	0.1339
CloDS**	All video	0.1294	0.1307	0.1388	0.1325

Dynamic Scene Novel View Synthesis¶

Model	PSNR↑	SSIM×10↑	LPIPS×1000↓
4DGS	23.21	9.718	15.82
GaMeS	33.02	9.937	5.21
SMGS (Ours)	36.24	9.959	3.53
3DGS (upper bound)	39.63	9.986	2.53

Key Findings¶

CloDS learns dynamics from video approaching fully supervised performance: CloDS** trained on all videos achieves RMSE very close to MGN trained on all meshes (gap <5%), demonstrating the viability of the unsupervised approach.
SMGS substantially outperforms baseline rendering methods: PSNR exceeds GaMeS by 3.2 dB and 4DGS by 13 dB.
Both components of dual-position opacity modulation are indispensable: removing world coordinates causes perspective distortion; removing mesh coordinates causes transparency in moving regions.
Strong generalization: On cloth with unseen initial states, CloDS extrapolation RMSE is only ~1% higher than on viewed configurations.

Highlights & Insights¶

Introducing the DVC framework for cloth dynamics is a far-sighted contribution: It establishes a complete closed loop of rendering → inverse rendering → dynamics, enabling physical dynamics to be learned from only the first-frame mesh and multi-view video. This paradigm is transferable to any scenario requiring physical model learning from visual observations.
Dual-position opacity modulation is simple yet critical: By feeding both world and mesh coordinates into a single MLP to modulate opacity, the method resolves rendering distortion under self-occlusion with minimal computational overhead.
The decoupled three-stage training design is highly practical: Stages 1 and 2 require no temporal loss (single-step training); only Stage 3 uses rollout. This allows the first two stages to be parallelized, substantially reducing training complexity.

Limitations & Future Work¶

Validation is conducted solely on Blender synthetic datasets; noise, lighting variation, and multi-object interaction in real-world scenes remain insufficiently tested.
Initial mesh estimation requires a first-frame mesh as prior, which may be unavailable in real-world settings.
The GNN dynamics model (MGN) is fixed; stronger dynamics models (e.g., Transformer-based) are not explored.
Only single-piece cloth is handled; cloth–object interactions (e.g., dressing) are not considered.
Cloth trajectories in the dataset are relatively simple (flag); more complex scenarios (e.g., garment folding) may require substantially more training data.

vs. NeuroFluid: NeuroFluid learns fluid dynamics via particle representations and differentiable rendering, but particles are ill-suited for the thin-sheet structure of cloth; CloDS adopts a mesh representation more appropriate for cloth.
vs. GaMeS: Both use mesh-based Gaussian Splatting, but GaMeS does not handle opacity changes during motion, leading to incorrect rendering in occluded regions.
vs. 4DGS/M5D-GS: 4D dynamic scene methods perform poorly under large deformations and do not learn generalizable dynamics models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Pioneers the CDG problem and a corresponding solution framework; dual-position opacity modulation is a notable innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across multiple tasks, but validation is limited to synthetic data.
Writing Quality: ⭐⭐⭐⭐ Mathematical modeling is clear, though notation is somewhat dense.
Value: ⭐⭐⭐⭐ Opens a new direction for learning cloth dynamics from visual observations, with potential transfer to robotic manipulation and related applications.