AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis¶
Conference: CVPR 2026 arXiv: 2603.08021 Code: Project Page Area: 3D Vision / Hand-Object Interaction Keywords: grasp generation, affordance, cross-modal diffusion, hand-object interaction, semantic instruction
TL;DR¶
AffordGrasp proposes a diffusion-based cross-modal framework that generates physically plausible and semantically consistent hand grasp poses from text instructions and object point clouds, via affordance-guided latent diffusion and a Distribution Adjustment Module (DAM), significantly outperforming existing methods on four benchmarks.
Background & Motivation¶
Background: Semantic grasp generation aims to synthesize hand poses for object interaction conditioned on user instructions, which is a critical capability for AR/VR and embodied intelligence.
Limitations of Prior Work: - A large modality gap exists between 3D geometry and natural language, making fine-grained geometry–semantic alignment (e.g., distinguishing "grasp the handle" from "grasp the rim") difficult to achieve through direct fusion; - Existing diffusion pipelines lack explicit spatial and semantic constraints, frequently producing physically implausible or semantically inconsistent grasps.
Key Challenge: How can generated grasp poses precisely correspond to the interaction intent described by a language instruction while remaining physically feasible?
Key Insight: Object affordance is introduced as a cross-modal bridge, linking linguistic semantics with 3D geometry through affordance regions, supplemented by a Distribution Adjustment Module that enforces physical and semantic consistency after sampling.
Core Idea: Affordance-driven latent diffusion + Distribution Adjustment Module = physical plausibility + semantic precision.
Method¶
Overall Architecture¶
Input: object point cloud \(P_g\) + text instruction \(I\) → Affordance Generator predicts affordance map \(P_a\) → multimodal encoding fusion \(f = \{f_I, f_{pg}, f_{pa}\}\) → latent diffusion model generates hand latent code → DAM refinement → MANO layer outputs hand mesh \(h_m\).
Key Designs¶
-
Affordance Generator:
- Function: Predicts a per-point affordance probability map on the object point cloud conditioned on the instruction.
- Mechanism: Built on the LASO architecture, initially trained on AffordPose, then extended to OakInk and GRAB via a self-training loop. Focal Loss + Dice Loss is used to handle positive–negative sample imbalance: \(\mathcal{L} = \mathcal{L}_{\text{focal}} + \lambda_1 \mathcal{L}_{\text{dice}}\)
- Design Motivation: Affordance regions provide an explicit intermediate representation that decouples the spatial information of "where to grasp" from the language instruction, reducing the difficulty of cross-modal fusion.
-
Cross-Modal Latent Diffusion Model:
- Function: Learns the conditional distribution in the latent space of hand poses.
- Mechanism:
- A pretrained VAE encodes hand mesh vertices \(h_v^{gt} \in \mathbb{R}^{778 \times 3}\) into a compact latent code \(h_z\)
- Forward diffusion: \(z^t = \sqrt{\alpha_t} z^0 + \sqrt{1 - \alpha_t} \epsilon\)
- A conditional U-Net is trained to predict noise: \(L_{LDM} = \mathbb{E}\|\epsilon - \epsilon_\theta(z^t, f, t)\|_2^2\)
- Conditioning features \(f = \{f_I, f_{pg}, f_{pa}\}\) are encoded by RoBERTa and two independent PointNets, respectively
- Design Motivation: Operating in latent space is more efficient and better preserves the spatial structure of hand poses.
-
Distribution Adjustment Module (DAM):
- Function: Refines the latent code after diffusion sampling to enforce physical constraints and semantic alignment.
- Mechanism:
- Recovers the latent hand representation from noise prediction: \(\hat{h}_z = \frac{1}{\sqrt{\alpha_t}}(z^t - \sqrt{1-\alpha_t}\epsilon_\theta(z^t, f, t))\)
- Fuses spatial features: \(f_{\text{spatial}} = \text{Norm}(f_{pg} + f_{pa}) + \hat{h}_z\)
- Multi-head attention interacts with the instruction: \(f_{\text{align}} = \text{Attention}(f_I, f_{\text{spatial}}, f_{\text{spatial}}) + f_I\)
- Dual residual refinement: \(\tilde{h}_z = \text{Norm}(\text{MLP}(f_{\text{align}}) + \hat{h}_z)\)
- Design Motivation: Denoised outputs from diffusion models may be insufficiently precise regarding contact constraints and semantic details. DAM corrects these in a single forward pass as a lightweight post-processing module, avoiding the high computational cost of gradient-based sampling correction.
Loss & Training¶
- Two-stage training: the diffusion model is trained first (DAM frozen), then the diffusion model is frozen and DAM is trained.
- DAM loss: \(\mathcal{L} = \mathcal{L}_{\text{recon}}(h_v, h_p, h_v^{gt}, h_p^{gt}) + \lambda_2 \mathcal{L}_{\text{physical}}(h_m, h_m^{gt}, P_g)\)
- Reconstruction loss covers MANO parameter and vertex alignment; physical loss penalizes interpenetration and contact inconsistency.
Key Experimental Results¶
Main Results¶
| Dataset | Method | Penetration Vol.↓ | Displacement↓ | Contact Rate↑ | Semantic Acc.↑ |
|---|---|---|---|---|---|
| OakInk | FastGrasp | 7.88 | 2.27 | 88% | 78.05% |
| OakInk | AffordGrasp | 7.31 | 1.43 | 98% | 80.08% |
| GRAB | FastGrasp | 4.61 | 1.20 | 94% | 61.50% |
| GRAB | AffordGrasp | 3.06 | 1.66 | 94% | 62.50% |
| HO-3D (OOD) | D-VQVAE | 13.12 | 2.33 | 95% | 64.00% |
| HO-3D (OOD) | AffordGrasp | 7.38 | 2.33 | 97% | 72.00% |
Ablation Study¶
| Configuration | Penetration Vol.↓ | Contact Rate↑ | Semantic Acc.↑ | Note |
|---|---|---|---|---|
| w/o affordance | 8.27 | 97% | 76.56% | Removing affordance increases penetration |
| w/o DAM | 8.12 | 97% | 79.11% | Removing DAM slightly reduces semantics |
| Full pipeline | 7.31 | 98% | 80.08% | Both modules synergize optimally |
Key Findings¶
- Outstanding cross-domain generalization: the model trained on GRAB substantially outperforms baselines in zero-shot settings on HO-3D and AffordPose.
- Affordance regions effectively reduce penetration volume (object–hand collision), demonstrating the importance of spatial guidance.
- DAM's dual residual mechanism preserves the core structure of the diffusion output while effectively correcting local details.
Highlights & Insights¶
- The concept of affordance as a cross-modal bridge is concise and effective, avoiding the instability of multi-round VLM inference.
- The self-training annotation pipeline addresses the scarcity of affordance-labeled data.
- DAM, as a lightweight post-processing module, is transferable to other conditional generation tasks.
Limitations & Future Work¶
- Physical priors (gravity, friction) are not explicitly modeled, which may limit performance in certain real-world scenarios.
- AffordPose is excluded from training due to the absence of MANO parameters, limiting object diversity.
- Multi-step DDIM sampling is still required at inference, leaving room for improvement in real-time applicability.
Related Work & Insights¶
- Compared to SemGrasp, the method avoids 2D projection occlusion issues by operating directly in 3D space.
- The DAM design is conceptually similar to ControlNet's post-processing refinement, but requires no retraining of the base model.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combined design of affordance guidance and DAM is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four datasets, in-domain and out-of-domain evaluation, extensive ablations, and physics simulation validation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with high-quality figures and tables.
- Value: ⭐⭐⭐⭐ Practically advances semantic grasping for embodied intelligence.