AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis¶

Conference: CVPR 2026 arXiv: 2603.08021 Code: Project Page Area: 3D Vision / Hand-Object Interaction Keywords: grasp generation, affordance, cross-modal diffusion, hand-object interaction, semantic instruction

TL;DR¶

AffordGrasp proposes a diffusion-based cross-modal framework that generates physically plausible and semantically consistent hand grasp poses from text instructions and object point clouds, via affordance-guided latent diffusion and a Distribution Adjustment Module (DAM), significantly outperforming existing methods on four benchmarks.

Background & Motivation¶

Background: Semantic grasp generation aims to synthesize hand poses for object interaction conditioned on user instructions, which is a critical capability for AR/VR and embodied intelligence.

Limitations of Prior Work: - A large modality gap exists between 3D geometry and natural language, making fine-grained geometry–semantic alignment (e.g., distinguishing "grasp the handle" from "grasp the rim") difficult to achieve through direct fusion; - Existing diffusion pipelines lack explicit spatial and semantic constraints, frequently producing physically implausible or semantically inconsistent grasps.

Key Challenge: How can generated grasp poses precisely correspond to the interaction intent described by a language instruction while remaining physically feasible?

Key Insight: Object affordance is introduced as a cross-modal bridge, linking linguistic semantics with 3D geometry through affordance regions, supplemented by a Distribution Adjustment Module that enforces physical and semantic consistency after sampling.

Core Idea: Affordance-driven latent diffusion + Distribution Adjustment Module = physical plausibility + semantic precision.

Method¶

Overall Architecture¶

Input: object point cloud \(P_g\) + text instruction \(I\) → Affordance Generator predicts affordance map \(P_a\) → multimodal encoding fusion \(f = \{f_I, f_{pg}, f_{pa}\}\) → latent diffusion model generates hand latent code → DAM refinement → MANO layer outputs hand mesh \(h_m\).

Key Designs¶

Affordance Generator:
- Function: Predicts a per-point affordance probability map on the object point cloud conditioned on the instruction.
- Mechanism: Built on the LASO architecture, initially trained on AffordPose, then extended to OakInk and GRAB via a self-training loop. Focal Loss + Dice Loss is used to handle positive–negative sample imbalance: \(\mathcal{L} = \mathcal{L}_{\text{focal}} + \lambda_1 \mathcal{L}_{\text{dice}}\)
- Design Motivation: Affordance regions provide an explicit intermediate representation that decouples the spatial information of "where to grasp" from the language instruction, reducing the difficulty of cross-modal fusion.
Cross-Modal Latent Diffusion Model:
- Function: Learns the conditional distribution in the latent space of hand poses.
- Mechanism:
  - A pretrained VAE encodes hand mesh vertices \(h_v^{gt} \in \mathbb{R}^{778 \times 3}\) into a compact latent code \(h_z\)
  - Forward diffusion: \(z^t = \sqrt{\alpha_t} z^0 + \sqrt{1 - \alpha_t} \epsilon\)
  - A conditional U-Net is trained to predict noise: \(L_{LDM} = \mathbb{E}\|\epsilon - \epsilon_\theta(z^t, f, t)\|_2^2\)
  - Conditioning features \(f = \{f_I, f_{pg}, f_{pa}\}\) are encoded by RoBERTa and two independent PointNets, respectively
- Design Motivation: Operating in latent space is more efficient and better preserves the spatial structure of hand poses.
Distribution Adjustment Module (DAM):
- Function: Refines the latent code after diffusion sampling to enforce physical constraints and semantic alignment.
- Mechanism:
  - Recovers the latent hand representation from noise prediction: \(\hat{h}_z = \frac{1}{\sqrt{\alpha_t}}(z^t - \sqrt{1-\alpha_t}\epsilon_\theta(z^t, f, t))\)
  - Fuses spatial features: \(f_{\text{spatial}} = \text{Norm}(f_{pg} + f_{pa}) + \hat{h}_z\)
  - Multi-head attention interacts with the instruction: \(f_{\text{align}} = \text{Attention}(f_I, f_{\text{spatial}}, f_{\text{spatial}}) + f_I\)
  - Dual residual refinement: \(\tilde{h}_z = \text{Norm}(\text{MLP}(f_{\text{align}}) + \hat{h}_z)\)
- Design Motivation: Denoised outputs from diffusion models may be insufficiently precise regarding contact constraints and semantic details. DAM corrects these in a single forward pass as a lightweight post-processing module, avoiding the high computational cost of gradient-based sampling correction.

Loss & Training¶

Two-stage training: the diffusion model is trained first (DAM frozen), then the diffusion model is frozen and DAM is trained.
DAM loss: \(\mathcal{L} = \mathcal{L}_{\text{recon}}(h_v, h_p, h_v^{gt}, h_p^{gt}) + \lambda_2 \mathcal{L}_{\text{physical}}(h_m, h_m^{gt}, P_g)\)
Reconstruction loss covers MANO parameter and vertex alignment; physical loss penalizes interpenetration and contact inconsistency.

Key Experimental Results¶

Main Results¶

Dataset	Method	Penetration Vol.↓	Displacement↓	Contact Rate↑	Semantic Acc.↑
OakInk	FastGrasp	7.88	2.27	88%	78.05%
OakInk	AffordGrasp	7.31	1.43	98%	80.08%
GRAB	FastGrasp	4.61	1.20	94%	61.50%
GRAB	AffordGrasp	3.06	1.66	94%	62.50%
HO-3D (OOD)	D-VQVAE	13.12	2.33	95%	64.00%
HO-3D (OOD)	AffordGrasp	7.38	2.33	97%	72.00%

Ablation Study¶

Configuration	Penetration Vol.↓	Contact Rate↑	Semantic Acc.↑	Note
w/o affordance	8.27	97%	76.56%	Removing affordance increases penetration
w/o DAM	8.12	97%	79.11%	Removing DAM slightly reduces semantics
Full pipeline	7.31	98%	80.08%	Both modules synergize optimally

Key Findings¶

Outstanding cross-domain generalization: the model trained on GRAB substantially outperforms baselines in zero-shot settings on HO-3D and AffordPose.
Affordance regions effectively reduce penetration volume (object–hand collision), demonstrating the importance of spatial guidance.
DAM's dual residual mechanism preserves the core structure of the diffusion output while effectively correcting local details.

Highlights & Insights¶

The concept of affordance as a cross-modal bridge is concise and effective, avoiding the instability of multi-round VLM inference.
The self-training annotation pipeline addresses the scarcity of affordance-labeled data.
DAM, as a lightweight post-processing module, is transferable to other conditional generation tasks.

Limitations & Future Work¶

Physical priors (gravity, friction) are not explicitly modeled, which may limit performance in certain real-world scenarios.
AffordPose is excluded from training due to the absence of MANO parameters, limiting object diversity.
Multi-step DDIM sampling is still required at inference, leaving room for improvement in real-time applicability.

Compared to SemGrasp, the method avoids 2D projection occlusion issues by operating directly in 3D space.
The DAM design is conceptually similar to ControlNet's post-processing refinement, but requires no retraining of the base model.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of affordance guidance and DAM is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four datasets, in-domain and out-of-domain evaluation, extensive ablations, and physics simulation validation.
Writing Quality: ⭐⭐⭐⭐ Clear structure with high-quality figures and tables.
Value: ⭐⭐⭐⭐ Practically advances semantic grasping for embodied intelligence.