Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions¶
Conference: CVPR 2025
arXiv: 2504.04744
Code: Project Page
Area: 3D Vision
Keywords: 3D affordance grounding, multi-modal fusion, point cloud, VLM, embodied intelligence
TL;DR¶
This work proposes the first multimodal multi-view 3D affordance grounding task and the AGPIL dataset (containing 30,972 point cloud-image-language triplets). It designs LMAffordance3D, a VLM-based framework that fuses 2D/3D spatial features with linguistic semantics to generalize from full-view to partial/rotation-view scenarios.
Background & Motivation¶
Background: Affordance grounding aims to identify manipulable regions of objects, serving as a key link between perception and action in embodied intelligence. Prior research has primarily focused on 2D images or single-modality 3D point clouds.
Limitations of Prior Work: - 2D affordance methods are difficult to directly map to 3D space for robotic manipulation tasks. - 3D methods (such as 3D AffordanceNet) rely solely on geometric information, leading to limited generalization capability and confusion when handling similar objects. - Existing datasets either utilize only single-modality input, lack guidance from language instructions, or overlook the issue of incomplete point clouds caused by occlusion and rotation in real-world scenarios.
Key Challenge: In the real 3D world, object observations are often limited to partial point clouds due to viewpoints, occlusions, and rotations; however, humans learn new object affordances via linguistic instructions, visual demonstrations, and interactions—which existing methods fail to utilize simultaneously.
Key Insight: Inspired by cognitive science, this work models 3D affordance grounding as a multimodal task (incorporating language, images, and point clouds) and constructs a comprehensive benchmark covering three viewpoints (full, partial, and rotation) and two settings (seen and unseen).
Method¶
Overall Architecture¶
LMAffordance3D is an end-to-end single-stage framework consisting of four core components: 1. Vision Encoder: Processes images (ResNet18 \(\to\) 2D features) and point clouds (PointNet++ \(\to\) 3D features), fusing them via an MLP and Self-Attention to obtain multimodal spatial features \(F_S\). 2. VLM Core: Employs LLaVA-7B as the backbone. A tokenizer encodes language instructions into \(F_T\), and an Adapter (a two-layer MLP with an activation layer) maps spatial features \(F_S\) into the semantic space \(F_{SP}\). These are then concatenated and fed into the VLM. 3. Decoder: Based on cross-attention, it uses spatial features as Query, instruction features as Key, and semantic features as Value to decode the affordance features \(F_A\). 4. Segmentation Head: Employs upsampling, two linear layers, Batch Normalization (BN), and Sigmoid to output point-wise affordance probabilities of shape \((B, 2048, 1)\).
Key Designs¶
1. Multimodal Vision Encoder Design - Function: Extracts 2D and 3D features using ResNet18 and PointNet++ respectively, then fuses them via an MLP and Self-Attention. - Mechanism: RGB images contain color, scene, and interaction information, while point clouds capture shape, scale, and geometric characteristics. The two modalities are complementary and aligned through a shared semantic space. - Design Motivation: Instead of directly utilizing CLIP (which has large parameters and is difficult to deploy), a lightweight design is adopted to suit robotic deployment scenarios.
2. Cross-Attention-Based Decoder - Function: Splits the VLM output into instruction features and semantic features, fusing spatial and semantic information via cross-attention. - Mechanism: Spatial features (Query) query the semantic features (Value), with the attention distribution guided by the instruction features (Key). - Design Motivation: To ensure that different language instructions can guide the model to focus on different affordance regions of the same object.
3. AGPIL Dataset Construction - Function: Builds the first multimodal, multi-view 3D affordance dataset, containing 30,972 images, 41,628 point clouds, and 30,972 language instructions. - Mechanism: Point clouds are acquired from 3D AffordanceNet (covering full, partial, and rotation views), images are sourced from AGD20K and PIAD, and language instructions are generated by GPT-4o combined with images and then manually filtered. - Design Motivation: Covers 23 object categories and 17 affordance classes. Each annotation is represented as a \((2048, 17)\) probability matrix, and seen/unseen splits are defined to thoroughly evaluate generalization capabilities.
Loss & Training¶
- Focal Loss: Handles the imbalance between positive and negative samples.
- Dice Loss: Optimizes the overlapping segmentation regions.
Key Experimental Results¶
Main Results¶
| Method | Full-view AUC↑ | Full-view SIM↑ | Partial AUC↑ | Rotation AUC↑ |
|---|---|---|---|---|
| 3D AffordanceNet | 0.807 | 0.483 | 0.761 | 0.595 |
| IAG | 0.849 | 0.545 | 0.809 | 0.679 |
| OpenAD | 0.858 | 0.587 | 0.815 | 0.733 |
| PointRefer | 0.877 | 0.595 | 0.821 | 0.756 |
| Ours | 0.890 | 0.610 | 0.848 | 0.782 |
The advantages are more prominent under the unseen setting: Full-view AUC achieves 0.774 vs. PointRefer's 0.755, and MAE reaches 0.095 vs. 0.118.
Ablation Study¶
- Among the 17 affordance categories, "stab" achieves the highest AUC of 0.997, while "wrapping" is the most challenging with an AUC of only 0.689.
- Performance progressively decreases from Full \(\to\) Partial \(\to\) Rotation views, highlighting the challenges posed by incomplete point clouds.
- Performance drops by approximately 10-15% from Seen \(\to\) Unseen configurations, yet Ours exhibits a larger performance advantage on Unseen data compared to Seen data.
Key Findings¶
- Multimodal fusion (images + point clouds + language) significantly outperforms single-modality methods, improving the AUC by 3-8%.
- Linguistic instruction guidance enables the model to distinguish between different functional regions of the same object.
- The largest improvements are observed in rotation-view scenarios (AUC +2.6%), where the semantic understanding of the VLM compensates for geometric uncertainty.
- Generalization capability on unseen objects is significantly enhanced, demonstrating that multimodal fusion improves knowledge transferability.
Highlights & Insights¶
- Introduces the first 3D affordance grounding task formulation that simultaneously leverages language instructions, visual observations, and interactions.
- The AGPIL dataset fills the gap in multimodal, multi-view, and probabilistic annotation resources.
- The paradigm of incorporating VLMs into 3D affordance tasks is noteworthy, as prior knowledge from the VLM substantially boosts generalization on unseen categories.
- Features an end-to-end, single-stage design (without requiring 2D detection bounding boxes), offering superior scalability.
Limitations & Future Work¶
- Images and point clouds originate from different scenes (matched via category-level associations), leading to visual-geometric inconsistency.
- LLaVA-7B introduces substantial inference overhead, which is unfavorable for real-world robot deployment.
- The rotation-view setting remains a performance bottleneck; rotation-equivariant networks could be considered for enhancement.
- Only static object affordance is supported, without considering dynamic scenes.
- The language instructions are phrase-level in granularity; more complex instruction understanding remains unexplored.
Related Work & Insights¶
- 3D AffordanceNet pioneered 3D affordance datasets, and this work extends it towards multimodality and multi-view configurations.
- AffordanceLLM validated the effectiveness of VLMs in 2D affordance, and this work extends it to 3D for the first time.
- Insight: The visual-semantic alignment capability of VLMs can serve as a cross-modal bridge, inspiring future exploration of more advanced 3D understanding tasks.
Rating¶
⭐⭐⭐⭐ — The task definition is novel and the dataset construction is solid. The technical approach is reasonable, though the innovation is moderate (mostly modular combination). The completeness of the multi-view and unseen settings is highly commendable.