ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness¶
Conference: ICCV 2025 arXiv: 2503.10624 Code: https://boqian-li.github.io/ETCH Area: 3D Vision / Human Body Reconstruction / Clothed Human Body Fitting Keywords: Body Fitting, Clothed Humans, SE(3) Equivariance, Tightness Vector, SMPL, Point Cloud
TL;DR¶
This paper proposes ETCH, a framework that models SE(3)-equivariant tightness vectors from clothing surfaces to body surfaces, reducing clothed human body fitting to a tightness-aware sparse marker fitting task. On the CAPE and 4D-Dress datasets, ETCH achieves 16.7%–69.5% reduction in joint error on loose garments and an average 49.9% improvement in shape accuracy compared to state-of-the-art methods (both tightness-agnostic and tightness-aware).
Background & Motivation¶
Problem Definition¶
Given a 3D point cloud of a clothed human, the goal is to fit the underlying SMPL parametric body model (pose \(\theta\), shape \(\beta\), translation \(t\)). This task is critical for applications such as motion capture, virtual try-on, and immersive telepresence.
Limitations of Prior Work¶
Optimization-based methods (e.g., NICP): Rely on multi-stage pipelines (multi-view rendering → 2D keypoints → triangulation → SMPL optimization), are sensitive to pose initialization, and suffer from cascading errors when 2D keypoint detection fails under loose garments.
Tightness-agnostic learning methods (e.g., ArtEq): Achieve pose generalization via joint-level SE(3) equivariance but cannot handle cases where clothing deviates significantly from the body (e.g., skirts, puffy jackets), as they directly regress body parameters.
Tightness-aware methods (e.g., IPNet/PTF): Attempt to disentangle clothing layers but use scalar tightness representations (UV maps or double-layer occupancy fields), which still struggle under out-of-distribution poses and shapes.
Core Insight¶
Although clothing does not strictly follow joint motion, the displacement vectors from the outer clothing surface to the body surface approximately satisfy SE(3) equivariance locally. Combining equivariance with tightness modeling simultaneously achieves pose generalization and clothing disentanglement.
Method¶
Overall Architecture¶
ETCH operates in two stages: 1. Equivariant tightness vector prediction: Input point cloud \(X\) → EPN network extracts SO(3)-equivariant/invariant features → predicts tightness direction \(D\) (equivariant features), magnitude \(B\) (invariant features), label \(L\), and confidence \(C\) (invariant features + Point Transformer). 2. Marker aggregation and SMPL optimization: Points on the outer surface are projected toward the inner body surface along tightness vectors → grouped by label and aggregated via confidence-weighted voting into 86 sparse markers → Levenberg–Marquardt optimization of SMPL parameters.
Key Design 1: Equivariant Tightness Vector¶
The tightness vector \(\mathbf{v}_i = b_i \mathbf{d}_i\) consists of a direction and a magnitude: - Direction prediction (SE(3)-equivariant): Uses EPN's SO(3)-equivariant features \(\mathbf{f}^{equiv} \in \mathbb{R}^{N \times 60 \times C}\); a self-attention network learns weights \(w_{ij}\) over the rotation group dimension, combines 60 discrete rotation matrices via weighted summation, projects onto SO(3) via SVD to obtain per-point rotation matrices \(\hat{\mathcal{R}}_i\), and multiplies by a unit vector to yield the direction. - Magnitude prediction (invariant features): Mean pooling over equivariant features yields invariant features \(\mathbf{f}^{inv}\); contextual information is captured via Point Transformer before regressing the magnitude value.
Key Design 2: Marker Labels and Confidence¶
- Label \(L\): An 86-class classification task indicating which body marker each outer-surface point corresponds to; implemented with Point Transformer + softmax.
- Confidence \(C\): Computed via group convolution and soft aggregation; ground-truth confidence is defined by exponential decay based on geodesic distance: \(c_i = \exp(-\lambda \times g(\mathbf{m}_k, \mathbf{y}_j; \mathcal{S}_Y))\).
Key Design 3: Sparse Marker Aggregation¶
For each marker \(k\), the top-\(m\) high-confidence inner points whose predicted labels match \(k\) are selected and aggregated via weighted pooling:
The exponent \(\alpha\) further amplifies the influence of high-confidence points. SMPL parameters are then optimized over the aggregated markers using the Levenberg–Marquardt algorithm.
Loss & Training¶
Multi-task supervision (non-end-to-end training): $\(\mathcal{L} = w_d \mathcal{L}_d + w_b \mathcal{L}_b + w_l \mathcal{L}_l + w_c \mathcal{L}_c\)$ - \(\mathcal{L}_d\): Cosine loss for direction - \(\mathcal{L}_b\): MSE loss for magnitude - \(\mathcal{L}_l\): Cross-entropy loss for labels - \(\mathcal{L}_c\): MSE loss for confidence
Key Experimental Results¶
Datasets¶
- CAPE: 15 subjects, tight garments, 26K training / 1K validation frames, cross-subject split.
- 4D-Dress: 32 subjects / 64 outfits, loose garments with large dynamics, 59K training / 1.9K validation frames, cross-motion-sequence split.
Main Results¶
| Method | Type | CAPE V2V↓ | CAPE MPJPE↓ | 4D-Dress V2V↓ | 4D-Dress MPJPE↓ |
|---|---|---|---|---|---|
| NICP | Agnostic | 1.726 | 1.343 | 4.754 | 3.654 |
| ArtEq | Agnostic | 2.200 | 1.557 | 2.328 | 1.657 |
| IPNet | Aware | 2.593 | 1.917 | 3.826 | 2.625 |
| PTF | Aware | 2.036 | 1.497 | 2.796 | 2.053 |
| ETCH | Aware | 1.647 | 0.922 | 1.939 | 1.116 |
Key findings: ETCH achieves the best performance across all datasets and metrics. On 4D-Dress, MPJPE is reduced by 32.6% relative to ArtEq and 45.6% relative to PTF.
Ablation Study¶
| Setting | Tightness | Correspondence | Direction Features | CAPE V2V↓ | 4D-Dress V2V↓ |
|---|---|---|---|---|---|
| Ours (full) | Vector | Sparse Marker | Equivariant | 1.647 | 1.939 |
| Ours-A (no equivariance) | Vector | Sparse Marker | XYZ | 1.661 | 2.033 |
| Ours-C (dense correspondence) | Vector | Dense | Equivariant | 1.909 | 2.285 |
| Ours-D (scalar tightness) | Scalar | Dense | Equivariant | 1.777 | 2.410 |
| Ours-E (invariant only) | Vector | Sparse Marker | Invariant | 1.888 | 2.842 |
Key findings: 1. Sparse markers vs. dense correspondence: sparse markers reduce V2V by 13.7% / 15.1% on CAPE / 4D-Dress. 2. In a one-shot setting (~1% of training data), equivariant features reduce direction error by 67.2%–89.8%, demonstrating strong out-of-distribution generalization. 3. Vector tightness substantially outperforms scalar tightness on loose garments (4D-Dress). 4. Shape accuracy (β parameter MAE) improves by an average of 49.9%.
Challenging Subsets (4D-Dress)¶
| Challenge Type | ETCH V2V↓ | 2nd-best V2V↓ | Gain |
|---|---|---|---|
| Loose garments | 2.276 | 3.264 (PTF) | 30.3% |
| Extreme shapes | 1.831 | 2.137 (ArtEq) | 14.3% |
| Challenging poses | 1.992 | 2.420 (ArtEq) | 17.7% |
Highlights & Insights¶
- Novelty of vector tightness: Unlike TightCap's scalar UV map or IPNet's double-layer occupancy field, modeling tightness as a displacement vector from clothing to body is inherently directional and correctly points toward the inner body surface.
- Complementarity of equivariance and tightness: ArtEq's joint-level equivariance cannot handle loose garments, while scalar tightness lacks directional information. The core insight of ETCH is that cloth-to-body displacement vectors satisfy approximate local SE(3) equivariance, enabling the two properties to complement each other.
- Voting mechanism via sparse markers: Compared to per-point dense correspondence optimization, sparse markers with confidence-weighted aggregation form a voting strategy that is robust to outliers.
- No additional priors required: Unlike IPNet/PTF/NICP, which require VPoser pose priors or shape regularization, ETCH optimizes directly over 86 markers and achieves superior results.
- Strong one-shot generalization: Equivariant features yield correct direction predictions even when trained on only ~1% of the data.
Limitations & Future Work¶
- Failure on partial inputs: Missing regions in point clouds prevent marker capture, leading to fitting failures.
- Fine structures not supported: The current marker layout does not cover finger and facial details; extension to SMPL-X requires a unified framework for handling multi-scale receptive fields.
- Unknown scalability: While performance is strong on current medium-scale datasets, it is unclear whether the approach saturates under billion-scale scan–body paired data.
- Double-edged nature of Chamfer post-optimization: Post-optimization benefits tight garments (CAPE) but degrades results for loose garments (4D-Dress) by inflating the body toward the outer clothing surface.
Related Work & Insights¶
- The distinction between registration and fitting is noteworthy: registration focuses on matching the outer surface, while fitting targets alignment of the inner body surface — a fundamental difference under loose garments.
- There remains significant room for applying equivariant networks (VN/TFN/SE3-Transformer) to non-rigid human body scenarios; ETCH demonstrates the practical utility of "approximate local equivariance."
- Synthetic data or generative 3D human models may offer a viable path to scaling training data.
Rating ⭐⭐⭐⭐¶
- Novelty: ⭐⭐⭐⭐⭐ (Vector-form equivariant tightness is a novel and effective design that elegantly combines two orthogonal ideas.)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive ablations, challenging subsets, one-shot experiments, and qualitative results yield compelling conclusions.)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, precise terminology, and a clever acronym.)
- Value: ⭐⭐⭐⭐ (Trained on a single RTX 4090 in 4 days; marker fitting takes 5 seconds per sample — practically deployable.)