Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction¶

Conference: CVPR 2025
arXiv: 2505.13091
Code: None
Area: 3D Vision
Keywords: Tactile Sensing, 3D Reconstruction, Diffusion Models, Reinforcement Learning, Active Exploration

TL;DR¶

This paper proposes Touch2Shape, which utilizes a touch-conditioned diffusion model to generate compact shape representations in a low-dimensional latent space. Combined with reinforcement learning to train a touch exploration policy, it achieves active 3D shape exploration and reconstruction from tactile images, guiding the next touch location without requiring complete shape generation at each step.

Background & Motivation¶

Background: 3D generation and reconstruction are core tasks in computer vision. Current approaches primarily rely on visual inputs (single-view/multi-view RGB, depth images), and diffusion models have demonstrated strong capabilities in 3D generation (e.g., SDFusion, DiffusionSDF). However, these methods are mainly targeted at global shape prediction given partial observations.

Limitations of Prior Work: (1) Vision-based methods rely on pre-defined partial observations and cannot actively explore targets; (2) Although visual methods can estimate global shapes, they easily ignore local details, making them struggle with complex shapes; (3) In real-world scenarios, occlusions and lighting conditions can severely affect visual data acquisition. Tactile sensing is free from these constraints and can obtain precise local 3D contact information, but current tactile-based 3D reconstruction methods (such as TouchSDF) lack global shape understanding capabilities.

Key Challenge: Vision is strong globally but weak with local details and constrained by environments, while touch excels at local details but lacks global information and requires active planning. How to combine the strong generative power of diffusion models with the precise local information of tactile sensing, and design an efficient exploration policy?

Goal: Design a unified framework that (1) utilizes a touch-conditioned diffusion model for shape reconstruction, (2) uses the latent representations generated by the diffusion model to guide active touch exploration policies, and (3) generates the complete shape only at the final step, avoiding high computational overhead at each step.

Key Insight: Diffusion models can encode compact shape representations in a low-dimensional latent space. This representation can be used both for final shape decoding and as input to a reinforcement learning policy to predict the next touch location. This achieves the unification of "generation" and "exploration" in the latent space.

Core Idea: Generate low-dimensional latent vectors using a touch-conditioned diffusion model. This vector simultaneously drives shape reconstruction and the exploration policy, removing the need to generate a full T-SDF volume at each step, and thereby decoupling shape decoding from exploration.

Method¶

Overall Architecture¶

The system consists of four pre-trained modules and an inference pipeline used jointly. First, a VQ-VAE is pre-trained to encode 3D shapes (T-SDF volume) into a low-dimensional latent space and decode them. Then, TouchCNN is pre-trained to transform tactile images into touch charts (3D coordinates of contact patches). Next, a Contrastive Touch Encoder is trained using contrastive learning to align touch and shape encodings in a shared space. During the training phase, the pre-trained modules are frozen, and the touch-conditioned diffusion model and Touch Shape Fusion are trained. During inference, the robotic arm touches the target object to obtain tactile images, the diffusion model denoises based on tactile conditions to obtain the latent vector, and the policy network predicts the next touch location based on this vector. In the final step, the shape decoder + touch shape fusion are used to generate the complete 3D shape.

Key Designs¶

Touch-conditioned Diffusion:
- Function: Generate shape representations in a low-dimensional latent space based on tactile information.
- Mechanism: Given a latent vector \(z\) encoded by a pre-trained shape encoder, noise is added at a random time step, and the denoising network \(E_\theta\) is trained using tactile conditions \(C(T_0,...,T_{n-1})\). The loss function is \(L_{diff}(t,n) = \|E_\theta(z_t, r(t), C(T_0,...,T_{n-1})) - \epsilon_t\|_2\). The Touch Embedding extracts features from up to N tactile image charts (an \(N \times M \times 4\) tensor containing coordinates and touch states) via positional encoding and convolutions, generating N tokens. It supports both touch-only and vision-touch modes, the latter extracting image feature tokens via ResNet and concatenating them with touch tokens before feeding them to the denoising network.
- Design Motivation: Operating in the latent space avoids the huge overhead of generating a \(64^3\) T-SDF volume at each step; tactile images are inherently local information, and the generative ability of diffusion models can infer global structure from local inputs.
Touch Shape Fusion Module:
- Function: Optimize the shape details generated by the diffusion model using tactile information.
- Mechanism: Merge all historical touch details into a global touch shape, voxelize it, and extract multi-scale features with an additional voxel encoder. These features are fused with features of different scales during the VQ-VAE decoding process. The fusion uses a softmax weighting mechanism: \(M_1(c,k,j,i) = \frac{\lambda \cdot F_3^e(c,k,j,i) \cdot e^{F_1^d(c,k,j,i)}}{\sum_{c'} e^{F_1^d(c',k,j,i)}}\), where \(\lambda\) is a learnable weight. The decoder is derived from the pre-trained VQ-VAE and is fine-tuned during fusion training.
- Design Motivation: Shapes generated by diffusion models in the latent space are globally consistent but may lose local details due to low-dimensional compression. The precise local 3D information provided by tactile charts can directly supplement details during the decoding phase.
Reinforcement Learning-based Exploration Policy Training:
- Function: Learn the optimal sequence of touch locations to maximize reconstruction quality.
- Mechanism: At each step, the denoised latent vector \(z'\) from the diffusion model is fed into the policy network. The policy network encodes the difference between the initial and current latent vectors, integrates the embeddings of 50 candidate actions (position indices on a sphere), and predicts the Q-value for each action using fully connected layers. The reward function is based on the change in diffusion loss: \(R = H(L_{diff}(t,n-1) - L_{diff}(t,n))\), meaning a positive reward is given if the new touch makes the diffusion model's denoising more accurate. It is trained using DQN. The key innovation is not needing to generate full shapes and compute Chamfer Distance at every step, but only comparing diffusion losses in the latent space.
- Design Motivation: Prior methods (such as ActiveVT) required generating a full mesh and computing CD as the reward at each step, which is computationally expensive. This work defines the reward in the latent space, decoupling shape decoding from the exploration policy.

Loss & Training¶

Training is divided into three stages: (1) VQ-VAE pre-training (on ABC+ShapeNet); (2) Diffusion model training (1M iterations, lr=1e-5, batch=12) + Touch Shape Fusion training (250k iterations, lr=1e-4, batch=8), which can run in parallel; (3) Policy training (200 epochs, lr=3e-4, batch=16). The entire process takes about one week on an RTX 4090. Contrastive learning uses MoCo, where tactile features are queries and shape features are keys.

Key Experimental Results¶

Main Results¶

Dataset	Mode	Method	Grasp #0	Grasp #1
ABC (CD↓)	Touch-only T	VTRecon	25.586	9.016
ABC (CD↓)	Touch-only T	ActiveVT	24.864	8.220
ABC (CD↓)	Touch-only T	Touch2Shape	40.283	6.794
ABC (CD↓)	Vision-Touch T+V	VTRecon	2.653	2.637
ABC (CD↓)	Vision-Touch T+V	ActiveVT	2.538	2.486
ABC (CD↓)	Vision-Touch T+V	Touch2Shape	1.475	1.406

Dataset	Method	1 touch	10 touches	20 touches
ShapeNet (EMD↓)	TouchSDF	0.136	0.112	0.081
ShapeNet (EMD↓)	Ours (T)	0.124	0.056	0.053
ShapeNet (EMD↓)	Ours (T+V)	0.048	0.046	0.042

Ablation Study¶

Mode	Contrastive Learning CL	Fusion	CD↓
Touch-only T	✗	✗	4.430
Touch-only T	✓	✗	3.298
Touch-only T	✓	✓	3.134
Vision-only V	✗	✗	2.242
Vision-only V	✓	✗	2.068
Vision-Touch T+V	✓	✓	1.304

Key Findings¶

Significant performance from vision-tactile fusion: In T+V mode, the CD is only 1.304, which is significantly better than vision-only (2.068) or touch-only (3.134), validating the complementarity of the two modalities.
Contrastive touch encoder contributes the most: In touch-only mode, adding CL reduces the CD from 4.430 to 3.298 (a 25.6% gain), showing that aligned embedding of touch and shape is crucial.
Effective exploration policy: The RL policy reduces the CD after 5 grasps to 6.63% of the initial value (T-only mode), outperforming the random policy (8.14%) and the uniform policy (7.44%).
Touch Shape Fusion continuously improves: CD decreases from 3.298 to 3.134, proving that precise tactile geometric information corrects local details in the decoder output.
Higher initial CD: At the initial touch, Touch2Shape has a higher CD (40.283 vs ActiveVT 24.864) because global shapes generated by the diffusion model with minimal information deviate heavily, but it quickly outperforms other methods as more touches are made.

Highlights & Insights¶

Unified exploration and reconstruction in latent space: By unifying both shape reconstruction and exploration policies within the latent space of the diffusion model, the high computational cost of generating complete 3D shapes at each step is avoided. This elegant design turns the diffusion model from a mere "generator" into the core "perception-planning" engine.
Diffusion loss-based reward design: Using changes in \(L_{diff}\) as the RL reward instead of calculating CD at each step. This not only reduces computation but also leverages the uncertainty estimation of the diffusion model—if a touch makes the model more "certain" about the target shape, it indicates high information gain.
Bridging touch and shape modalities via contrastive learning: Using MoCo to align tactile charts and shape latents in the same space allows the diffusion model to directly leverage tactile conditions to generate latents close to the ground-truth shape.

Limitations & Future Work¶

Only validated in simulation environments without deployment to real robot platforms; the sim-to-real gap may impact practical usability.
The VQ-VAE's resolution of \(64^3\) limits reconstruction accuracy; higher resolutions would incur larger computational costs.
The action space for the exploration policy is limited to 50 spherical positions, which may not be fine-grained enough for complex shapes (e.g., objects with holes).
Currently only handles single-object reconstruction; extending this to scene-level tactile exploration is a more challenging direction.
Future work could integrate neural rendering techniques, using active tactile sensing to synthesize multi-view visual images.

vs ActiveVT: ActiveVT requires generating a full mesh, encoding latents, and computing CD at each step for policy input and rewards, whereas Touch2Shape operates exclusively in the latent space and only decodes the shape at the final step. In T+V mode, Touch2Shape significantly leads with a CD of 1.406 vs 2.486.
vs TouchSDF: TouchSDF maps tactile images to local SDFs and joins them via implicit neural functions, lacking the ability for global shape generation. Touch2Shape offsets this drawback by leveraging the global generation capabilities of diffusion models, yielding an EMD of 0.053 vs 0.081 at 20 touches.
vs SDFusion: SDFusion also utilizes latent diffusion for 3D generation but is based on vision/text conditioning. Touch2Shape introduces tactile signals as a new conditioning modality, extending the application scenarios of 3D diffusion models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to apply diffusion models to touch-conditioned 3D shape exploration; the idea of unifying exploration and reconstruction in the latent space is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Complete experiments conducted on two datasets (ABC and ShapeNet) with various modality settings, exploration policy comparisons, and thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ The methodology is clearly described and diagrams are intuitive, though some equation formatting is slightly disorganized.
Value: ⭐⭐⭐⭐ Outlines a new paradigm for the intersection of tactile sensing and 3D reconstruction, providing valuable insights for active robotic perception.