NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model¶
Conference: ECCV 2024
arXiv: 2407.12727
Area: Image Generation
TL;DR¶
This paper proposes NL2Contact, which is the first to leverage natural language descriptions for the controllable modeling of 3D hand-object contact maps. It generates hand poses and contact areas from text using a staged diffusion model, and constructs ContactDescribe, the first hand-object contact dataset with fine-grained linguistic descriptions.
Background & Motivation¶
Hand-object contact modeling is crucial for applications such as animation, VR/AR, and robotic grasping. Existing methods (e.g., ContactOpt, S2Contact) rely on geometric constraints to infer contact from point clouds, but suffer from two core problems:
Uncontrollable: These methods cannot specify or control the contact patterns. The generated results often tend toward "full-grasp" modes where all fingers contact the object, which deviates from actual human usage habits (e.g., using scissors only requires two fingers).
Lack of Semantics: Existing intent labels (verbs, object affordance) are too coarse-grained to precisely describe contact patterns.
To address this, this paper is the first to introduce the new task of guiding 3D hand-object contact modeling with natural language, leveraging the expressive power of language to achieve more precise contact control.
Method¶
Overall Architecture¶
NL2Contact consists of three core modules:
- Text-to-Hand-Object Fusion Module: Fuses cross-modal features of text, hand pose, and object point clouds.
- Staged Latent Diffusion Module: A two-stage diffusion model that first generates the hand pose and then the contact map.
- Contact Optimization: Iteratively optimizes hand pose parameters using the generated contact maps.
The input consists of an initial hand pose \(\widetilde{\mathcal{H}}\), an object point cloud \(\mathcal{O} \in \mathbb{R}^{2048 \times 3}\), and a text description \(\mathbb{T}\). The output is the contact probability maps on both the hand and the object.
Key Designs¶
ContactDescribe Dataset Construction:
- Based on the ContactPose dataset (2,300 grasping instances, 25 types of everyday objects, 50 participants).
- Multi-level (coarse-to-fine) language descriptions are designed: high-level describes the grasping action, mid-level describes the grasp types and finger states, and low-level specifies the location of contacting joints.
- Assisted by ChatGPT, diverse natural language descriptions were generated, with 5 different descriptions for each grasping instance, totaling 11,500 descriptions.
Text-to-Hand Fusion: - VPoser is used to encode hand pose features \(f_\theta^g \in \mathbb{R}^{64}\) and BERT is used to extract text token embeddings \(f_{text} \in \mathbb{R}^{768 \times n}\). - Fuses text, object, and hand pose features through two cascaded multi-head attention modules.
Staged Diffusion: - Stage 1 (Hand Pose Diffusion): Conditioned on the Text-to-Hand embedding, it denoises and generates the hand pose within the VPoser latent space. - Stage 2 (Contact Diffusion): Freezing the parameters of Stage 1, and conditioned on the generated hand pose and text-object fused features, it uses a PointNet encoder to encode the contact map into a latent space of \(x_c^0 \in \mathbb{R}^{32 \times 32}\) and then denoises it to generate the contact map using a U-Net.
Loss & Training¶
Hand pose diffusion loss: \(\mathcal{L}_{\text{pose}} = \mathbb{E}[\|\epsilon - G_\theta^1(\mathbf{x}_h^t | t, f_h)\|_2^2]\)
Contact diffusion loss: \(\mathcal{L}_{\text{contact}} = \mathbb{E}[\|\epsilon - G_\theta^2(\mathbf{x}_c^t | t, f_c)\|_2^2]\)
Contact optimization loss: \(\mathcal{L}_{opt} = \|\mathcal{C}'_H - \hat{\mathcal{C}}_H\|_2^2 + \lambda_o \|\mathcal{C}'_O - \hat{\mathcal{C}}_O\|_2^2 + \lambda_{pen}\mathcal{L}_{pen}\)
where \(\lambda_O=5\), \(\lambda_{pen}=3\), and the penetration loss \(\mathcal{L}_{pen}\) penalizes hand-object interpenetrations.
Key Experimental Results¶
Main Results¶
Grasp Pose Optimization Experiments (ContactDescribe + HO3D):
| Method | MPJPE↓ | Inter.↓ | Cover.↑ | Pr.↑ | Re.↑ |
|---|---|---|---|---|---|
| Perturbed Pose | 79.9 | 8.4 | 2.3% | 9.9% | 11.5% |
| ContactNet | 45.2 | 15.6 | 18.4% | 31.6% | 47.6% |
| ContactOpt | 25.1 | 12.8 | 19.7% | 38.7% | 54.8% |
| S2Contact | 29.4 | 12.2 | 22.2% | 42.5% | 56.1% |
| NL2Contact | 21.7 | 7.1 | 30.5% | 49.2% | 59.9% |
Grasp Generation Experiments:
| Method | Inter.↓ | Cover.↑ | Diversity↑ | SD↓ |
|---|---|---|---|---|
| GrabNet | 15.50 | 99% | 2.06 | 2.34 |
| GraspTTA | 7.37 | 76% | 1.43 | 5.34 |
| ContactGen | 9.96 | 97% | 5.04 | 2.70 |
| NL2Contact | 5.89 | 99% | 5.91 | 2.31 |
Ablation Study¶
Generational experiments on the HO3D dataset demonstrate that NL2Contact still achieves a near-optimal penetration volume on unseen objects (4.39 vs. S2Contact's 3.52) while having the lowest MPJPE (8.4mm), which proves the robust generalization capability of hand-centric contact descriptions.
Key Findings¶
- Compared to ContactOpt, NL2Contact reduces MPJPE by 4.4mm, while simultaneously reducing the penetration volume (whereas other methods tend to increase penetration).
- Language guidance endows contact generation with controllability, enabling the generation of specific finger contact patterns consistent with the descriptions.
- Jointly achieves the lowest penetration, highest coverage, and highest diversity in grasp generation.
Highlights & Insights¶
- Pioneering Task Definition: Natural language guided 3D hand-object contact modeling, where linguistic descriptions provide more precise control than verbs or affordance labels.
- LLM-Assisted Annotation: Cleverly utilizes ChatGPT to generate diverse natural language descriptions from structured prompts, avoiding the high cost and inconsistent quality of manual annotation.
- Reasonable Staged Design: Generates hand poses first and then contact maps, decoupling the cross-modal complexity of text-to-3D interaction.
- High Practicality: Contact map generation takes only about 3 seconds per instance, and training takes around 11 hours on a single V100 GPU.
Limitations & Future Work¶
- Relies on initial hand pose inputs; optimization can be challenging for severely erroneous initial poses.
- The ContactDescribe dataset is limited in scale (25 types of objects); generalization to more object categories needs further validation.
- The ambiguity of natural language descriptions may introduce uncertainty into the contact generation.
Rating¶
⭐⭐⭐⭐ (4/5)
- Novelty: ★★★★★ — Pioneering the language-guided contact modeling task + first fine-grained contact language dataset
- Technical Quality: ★★★★ — Reasonable staged diffusion design, effective cross-modal fusion
- Experiments: ★★★★ — Multi-task validation (optimization + generation), multi-dataset evaluation
- Writing Quality: ★★★★ — Clear structure, intuitive illustrations