Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions¶
Conference: CVPR 2026 arXiv: 2604.11579 Code: https://mm.kaist.ac.kr/projects/SeeingThroughTouch/ Area: Multimodal VLM Keywords: tactile localization, visual-tactile alignment, material segmentation, cross-modal learning, dataset
TL;DR¶
This paper introduces the tactile localization task—identifying regions in an image that share the same material properties as a given tactile input—and addresses it via local visual-tactile alignment and a material-diversity pairing strategy for learning dense cross-modal features. Two new tactile-material segmentation datasets are also constructed.
Background & Motivation¶
Background: Visual-tactile learning has primarily focused on global alignment (determining whether an image and a tactile signal correspond to the same material), lacking spatial localization capability—i.e., the ability to identify regions in a visual scene that "feel the same."
Limitations of Prior Work: (1) Global alignment methods cannot localize material regions; (2) existing datasets predominantly consist of close-up captures where visual frames exhibit minimal variation and a single material fills the entire frame, with no scene-level multi-material images; (3) no evaluation benchmark exists for tactile-material segmentation.
Key Challenge: Tactile localization requires fine-grained local cross-modal correspondence, whereas existing methods and data only provide coarse-grained global alignment.
Core Idea: Learn local visual-tactile alignment to produce tactile saliency maps, and expand effective training pairs through material-diversity pairing.
Method¶
Overall Architecture¶
A tactile encoder extracts tactile features (global pooling) and a visual encoder extracts spatial feature maps → a dense similarity map is computed as \(M[h,w] = \bar{f}_t \cdot f_v[h,w]\) → max-pooling yields a similarity score for contrastive learning → at inference, the similarity map is directly used for tactile localization.
Key Designs¶
-
Local Visual-Tactile Alignment:
- Function: Learn spatially resolved cross-modal features.
- Mechanism: The tactile feature is globally pooled into a 1D vector, which is dot-producted with each spatial position of the visual feature map to produce a similarity map; max-pooling is then applied for contrastive learning. DINOv2 serves as the shared encoder backbone, with the visual backbone frozen and only the aligners trained.
- Design Motivation: Max-pooling directs the model to attend to the best-matching region in the image rather than averaging responses over all regions, making it naturally suited for localization.
-
Material-Diversity Pairing Strategy:
- Function: Expand effective training pairs and enhance cross-instance generalization.
- Mechanism: In-domain pairing—different tactile instances and different visual frames from the same material category can be combined across instances as positive pairs; out-of-domain pairing—scene-level web images are collected and matched to tactile samples based on material category, exploiting the assumption that "similar materials produce similar tactile signals."
- Design Motivation: In Touch-and-Go, visual frames from the same instance are nearly identical, resulting in very few effective training pairs; cross-instance and cross-domain pairing substantially increases diversity.
-
In-the-Wild Image Collection and Filtering:
- Function: Supplement scene-level multi-material images.
- Mechanism: An LLM generates diverse search phrases for each material category (e.g., "brick chimney in a cozy living room"); images are collected from search engines, filtered for misclassified samples using CLIP similarity, and supplemented with images from the MINC material dataset.
- Design Motivation: Images in the TG dataset are too close-range and single-material to support training for scene-level localization.
Loss & Training¶
Symmetric contrastive learning loss (InfoNCE); the visual backbone is frozen while the tactile encoder and two aligner modules are trained.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | STT | Prev. SOTA | Gain |
|---|---|---|---|---|
| TG-Seg (new) | mIoU | Significantly better | ImageBind | Large |
| Web-Mat-Seg (new) | mIoU | Significantly better | UniTouch | Large |
| OpenSurfaces | F1 | Better | Baseline | Improved |
Ablation Study¶
| Configuration | mIoU | Notes |
|---|---|---|
| Full (in-domain + out-of-domain) | Best | Full model |
| In-domain pairing only | Second | Lacks scene-level generalization |
| Standard pairing only | Poor | Too few effective training pairs |
| Global alignment substitute | Poor | No spatial localization capability |
Key Findings¶
- Local alignment substantially outperforms global alignment, confirming that spatially resolved cross-modal features are critical for localization.
- Material-diversity pairing—especially out-of-domain images—is the key factor for generalizing to scene-level images.
- The model's ability to handle weak tactile signals (e.g., light touch or ambiguous material) improves significantly with the addition of out-of-domain data.
Highlights & Insights¶
- New Task Definition: Tactile localization is a natural yet formally unstudied problem that can inspire broader research into sensory interaction.
- Exploiting "Similar Materials, Similar Touch": This simple assumption substantially expands training data and represents a general strategy for addressing data scarcity in cross-modal learning.
Limitations & Future Work¶
- The tactile sensor type is fixed (GelSight); cross-sensor generalization has not been verified.
- Material category granularity is limited (18 classes).
- Future work could extend to finer-grained material attributes (e.g., roughness, hardness).
Related Work & Insights¶
- vs. ImageBind/UniTouch: Global alignment methods with no localization capability.
- vs. TaRF: Performs tactile localization in 3D NeRF, restricted to reconstructed scenes.
Rating¶
- Novelty: ⭐⭐⭐⭐ Novel task definition with a practical data strategy.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two new datasets constructed for evaluation.
- Writing Quality: ⭐⭐⭐⭐ Motivation and methodology are clearly described.
- Value: ⭐⭐⭐⭐ Opens a new direction for tactile localization research.