Weakly-Supervised Learning of Dense Functional Correspondences¶

Conference: ICCV 2025 arXiv: 2509.03893 Code: Project Page Area: Robotics Keywords: Dense Functional Correspondence, Weakly-Supervised Learning, Vision-Language Models, Contrastive Learning, Robotic Manipulation

TL;DR¶

This paper defines the task of Dense Functional Correspondence—establishing pixel-level dense correspondences between objects of different categories based on shared functionality (e.g., "pouring")—and proposes a weakly-supervised learning framework that distills functional and structural knowledge into a new model via VLM-based pseudo-labeling of functional parts combined with multi-view contrastive learning.

Background & Motivation¶

Establishing pixel-level correspondences across images is fundamental to shape reconstruction, image editing, and robotic manipulation. Existing approaches face increasing challenges across three levels of difficulty:

Same object, different viewpoints: Multi-view correspondence; relatively straightforward.

Same category, different instances: E.g., correspondence between two cats; existing methods (NOCS, keypoint detection) already provide reasonable solutions.

Different categories: E.g., functional correspondence between a kettle and a bottle; the most challenging yet practically important scenario.

Key Insight: "Form follows function"—parts of objects that perform similar functions (e.g., a kettle's spout and a bottle's mouth) tend to share shape and appearance similarities, even when the overall objects look very different. This provides a natural bridge for establishing dense correspondences across object categories.

Limitations of prior work:

Self-supervised representations (DINOv2, Stable Diffusion): Effective for intra-category correspondence but suffer significant accuracy degradation across categories.
Vision-language models (CogVLM, ManipVQA): Capable of zero-shot detection of functional part bounding boxes, but unable to perform fine-grained pixel-level correspondence reasoning.
Keypoint methods (Lai et al.): Define only 5 keypoints, insufficient to capture subtle similarities between highly dissimilar objects.
Affordance learning: Identifies interaction regions within a single image; cannot establish dense correspondences across images.

Method¶

Overall Architecture¶

The method consists of three stages:

Evaluation dataset construction: 2D dense functional correspondence annotations are derived via 3D object alignment.
Training dataset construction: Large-scale functional part labels are obtained using VLM (CogVLM) pseudo-labeling with GPT-4-generated prompts.
Model training: A function-conditioned MLP is trained on top of frozen DINOv2 features, combining a functional part contrastive loss and a multi-view spatial contrastive loss.

Key Designs¶

Formal Definition of Dense Functional Correspondence:
- Provides a rigorous mathematical definition that transforms "functional similarity" into a computable 3D distance.
- Mechanism: Given a function $\mathcal{F}$ and an image pair $(I_1, I_2)$, a functional correspondence mapping $f(I_1, I_2; \mathcal{F}): M(I_1;\mathcal{F}) \to M(I_2;\mathcal{F})$ is defined to minimize $\sum_{p \in M(I_1;\mathcal{F})} \|\pi^{-1}(p) - \pi^{-1}(f(p))\|_2$, where $\pi^{-1}$ denotes back-projection from pixel to 3D surface point.
- Design Motivation: Defining 2D correspondences through 3D alignment avoids the infeasibility of manual dense annotation and naturally provides an evaluation benchmark.
VLM Pseudo-Labeling Pipeline:
- Function: Automatically annotates functional parts of Objaverse 3D assets using large-scale pre-trained VLMs.
- Mechanism: GPT-4 generates category–function–part text prompts → CogVLM predicts bounding boxes on multi-view renderings → 2D labels are back-projected and aggregated onto 3D point clouds → post-processing yields 2D pixel-level pseudo-labels. The pipeline covers 24 functional categories, 160 object categories, and 8,285 3D assets.
- Design Motivation: Manual annotation of dense correspondences is infeasible; the pipeline leverages the zero-shot capability of VLMs for pseudo-labeling, with multi-view aggregation and 3D consistency to improve label quality.
Function-Conditioned MLP + Dual Contrastive Learning:
- Function: Trains a function-text-conditioned feature extraction network to jointly learn functional semantics and spatial structure.
- Mechanism: The model $g_\theta(p|I,\mathcal{F})$ consists of a 3-layer MLP on top of weighted multi-layer DINOv2 features and CLIP text embeddings. Training incorporates:
  - Functional part contrastive loss $\mathcal{L}_{func}$: InfoNCE loss where functional part pixels form positive pairs and non-functional part pixels form negative pairs, with non-functional pixels also mutually repelled: $$\mathcal{L}_{func} = -\log\frac{e^{\text{sim}(p_1^+, p_2^+)/\tau}}{e^{\text{sim}(p_1^+, p_2^+)/\tau} + e^{\text{sim}(p_1^+, p_2^-)/\tau} + e^{\text{sim}(p_1^-, p_2^-)/\tau}}$$
  - Multi-view spatial contrastive loss $\mathcal{L}_{spatial}$: Corresponding pixels of the same object across different viewpoints serve as positive pairs, preventing feature collapse: $$\mathcal{L}_{spatial} = -\log\frac{e^{\text{sim}(q, q_+^\prime)/\tau}}{e^{\text{sim}(q, q_+^\prime)/\tau} + e^{\text{sim}(q, q_-^\prime)/\tau}}$$
  - An optional mask prediction loss $\mathcal{L}_{mask}$.
- Design Motivation: Using the functional contrastive loss alone leads to mode collapse (all spout features become identical); the spatial contrastive loss preserves structural information (top and bottom of a spout should have distinct features). The two losses are complementary and indispensable.

Loss & Training¶

Final loss: $\mathcal{L} = \mathcal{L}_{func} + \lambda_{spatial}\mathcal{L}_{spatial} + \lambda_{mask}\mathcal{L}_{mask}$
$\lambda_{spatial} = 10$, $\lambda_{mask} = 1$
DINOv2-B backbone is frozen; only the MLP (3 layers, 1024-dim hidden) is trained.
Adam optimizer, learning rate $1 \times 10^{-4}$, batch size of 50 image pairs, 128 sampled points per image.
Random color background augmentation is applied.

Key Experimental Results¶

Main Results¶

Synthetic evaluation dataset (1,800+ image pairs, 24 functions, 85% cross-category):

Method	Norm.Dist↓	PCK@23p↑	Best F1@23p↑	AP@23p↑
Chance	0.310	0.165	0.416	0.256
DINO	0.212	0.381	0.578	0.381
SD-DINO	0.227	0.376	0.563	0.341
CogVLM + DINO	0.180	0.416	0.678	0.556
Ours (full)	0.170	0.486	0.768	0.685

Real-world evaluation dataset (HANDAL, 500+ image pairs, 13 functions):

Method	Norm.Dist↓	PCK@23p↑	Best F1@23p↑	AP@23p↑
DINO	0.206	0.408	0.589	0.382
CogVLM + DINO	0.172	0.440	0.695	0.561
Ours (full w/ mask)	0.153	0.501	0.808	0.730

Ablation Study¶

Configuration	Norm.Dist↓	PCK@23p↑	AP@23p↑	Note
Functional only	0.228	0.287	0.441	Mode collapse; structural information lost
Spatial only	0.204	0.470	0.412	Lacks functional semantics
Full (w/o mask)	0.170	0.486	0.685	Functional + spatial are complementary
Full (w/ mask)	0.172	0.480	0.684	Mask loss improves real-world performance

Key Findings¶

Training on purely synthetic data generalizes to real images: Models trained on synthetic Objaverse data perform well on the HANDAL real-world dataset.
Functional and spatial contrastive losses are mutually necessary: Using functional loss alone causes mode collapse (PCK only 0.287); spatial loss alone lacks functional understanding (AP only 0.412).
Inference speed advantage: The model runs approximately 50× faster than CogVLM and approximately 1,000× faster than ManipVQA.
Prompting ManipVQA with function names (ManipVQA-F) performs substantially worse than prompting with part names (ManipVQA-P), indicating that zero-shot functional reasoning remains difficult.

Highlights & Insights¶

Task definition contribution: The paper is the first to formally define the Dense Functional Correspondence task, filling the gap between sparse functional keypoint correspondence and intra-category dense correspondence.
Clever data strategy: GPT-4 and CogVLM are combined for large-scale pseudo-labeling, with 3D consistency verification to minimize human intervention.
Complementary distillation: The semantic understanding capability of VLMs and the spatial correspondence capability of DINOv2 are jointly distilled into a lightweight MLP.
Evaluation methodology: Ground-truth 2D dense correspondences are automatically derived via 3D object alignment, providing a scalable evaluation pipeline.
Practical value: The approach has direct applicability to robotic imitation learning, e.g., transferring manipulation demonstrations from one object to another of a different category.

Limitations & Future Work¶

The method assumes input images have already been segmented; additional segmentation modules would be required in practical deployment.
The granularity of functional categories (24 functions) may be insufficient to cover all real-world needs.
Objaverse asset quality varies; functional part annotations for some assets may contain noise.
Evaluation is limited to tool-type and container-type objects; applicability to broader object categories (e.g., furniture, electronic devices) remains to be validated.
The practical effectiveness of functional correspondences in downstream robotic manipulation tasks has not been explored.

Compared to intra-category correspondence methods such as NOCS, functional correspondence operates at a higher level of abstraction that transcends category boundaries.
The paradigm of VLM pseudo-labeling combined with contrastive learning distillation is generalizable to other tasks requiring dense annotations with high annotation costs.
The idea of defining 2D correspondences via 3D alignment is extensible to other correspondence tasks involving 3D structure.
Distinction from affordance grounding: affordance focuses on "how to interact with an object," whereas functional correspondence focuses on "aligning functionally equivalent parts across different objects."

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Novel task definition + innovative data construction pipeline + elegant training methodology
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on both synthetic and real-world datasets with clear ablations, but lacks downstream task validation
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear, method pipeline is coherent, and figures are well-crafted
Value: ⭐⭐⭐⭐⭐ Opens a new research direction with significant application potential in robotic manipulation