Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions¶

Conference: CVPR 2026 arXiv: 2604.11579 Code: https://mm.kaist.ac.kr/projects/SeeingThroughTouch/ Area: Multimodal VLM Keywords: tactile localization, visual-tactile alignment, material segmentation, cross-modal learning, dataset

TL;DR¶

This paper introduces the tactile localization task—identifying regions in an image that share the same material properties as a given tactile input—and addresses it via local visual-tactile alignment and a material-diversity pairing strategy for learning dense cross-modal features. Two new tactile-material segmentation datasets are also constructed.

Background & Motivation¶

Background: Visual-tactile learning has primarily focused on global alignment (determining whether an image and a tactile signal correspond to the same material), lacking spatial localization capability—i.e., the ability to identify regions in a visual scene that "feel the same."

Limitations of Prior Work: (1) Global alignment methods cannot localize material regions; (2) existing datasets predominantly consist of close-up captures where visual frames exhibit minimal variation and a single material fills the entire frame, with no scene-level multi-material images; (3) no evaluation benchmark exists for tactile-material segmentation.

Key Challenge: Tactile localization requires fine-grained local cross-modal correspondence, whereas existing methods and data only provide coarse-grained global alignment.

Core Idea: Learn local visual-tactile alignment to produce tactile saliency maps, and expand effective training pairs through material-diversity pairing.

Method¶

Overall Architecture¶

A tactile encoder extracts tactile features (global pooling) and a visual encoder extracts spatial feature maps → a dense similarity map is computed as \(M[h,w] = \bar{f}_t \cdot f_v[h,w]\) → max-pooling yields a similarity score for contrastive learning → at inference, the similarity map is directly used for tactile localization.

Key Designs¶

Local Visual-Tactile Alignment:
- Function: Learn spatially resolved cross-modal features.
- Mechanism: The tactile feature is globally pooled into a 1D vector, which is dot-producted with each spatial position of the visual feature map to produce a similarity map; max-pooling is then applied for contrastive learning. DINOv2 serves as the shared encoder backbone, with the visual backbone frozen and only the aligners trained.
- Design Motivation: Max-pooling directs the model to attend to the best-matching region in the image rather than averaging responses over all regions, making it naturally suited for localization.
Material-Diversity Pairing Strategy:
- Function: Expand effective training pairs and enhance cross-instance generalization.
- Mechanism: In-domain pairing—different tactile instances and different visual frames from the same material category can be combined across instances as positive pairs; out-of-domain pairing—scene-level web images are collected and matched to tactile samples based on material category, exploiting the assumption that "similar materials produce similar tactile signals."
- Design Motivation: In Touch-and-Go, visual frames from the same instance are nearly identical, resulting in very few effective training pairs; cross-instance and cross-domain pairing substantially increases diversity.
In-the-Wild Image Collection and Filtering:
- Function: Supplement scene-level multi-material images.
- Mechanism: An LLM generates diverse search phrases for each material category (e.g., "brick chimney in a cozy living room"); images are collected from search engines, filtered for misclassified samples using CLIP similarity, and supplemented with images from the MINC material dataset.
- Design Motivation: Images in the TG dataset are too close-range and single-material to support training for scene-level localization.

Loss & Training¶

Symmetric contrastive learning loss (InfoNCE); the visual backbone is frozen while the tactile encoder and two aligner modules are trained.

Key Experimental Results¶

Main Results¶

Dataset	Metric	STT	Prev. SOTA	Gain
TG-Seg (new)	mIoU	Significantly better	ImageBind	Large
Web-Mat-Seg (new)	mIoU	Significantly better	UniTouch	Large
OpenSurfaces	F1	Better	Baseline	Improved

Ablation Study¶

Configuration	mIoU	Notes
Full (in-domain + out-of-domain)	Best	Full model
In-domain pairing only	Second	Lacks scene-level generalization
Standard pairing only	Poor	Too few effective training pairs
Global alignment substitute	Poor	No spatial localization capability

Key Findings¶

Local alignment substantially outperforms global alignment, confirming that spatially resolved cross-modal features are critical for localization.
Material-diversity pairing—especially out-of-domain images—is the key factor for generalizing to scene-level images.
The model's ability to handle weak tactile signals (e.g., light touch or ambiguous material) improves significantly with the addition of out-of-domain data.

Highlights & Insights¶

New Task Definition: Tactile localization is a natural yet formally unstudied problem that can inspire broader research into sensory interaction.
Exploiting "Similar Materials, Similar Touch": This simple assumption substantially expands training data and represents a general strategy for addressing data scarcity in cross-modal learning.

Limitations & Future Work¶

The tactile sensor type is fixed (GelSight); cross-sensor generalization has not been verified.
Material category granularity is limited (18 classes).
Future work could extend to finer-grained material attributes (e.g., roughness, hardness).

vs. ImageBind/UniTouch: Global alignment methods with no localization capability.
vs. TaRF: Performs tactile localization in 3D NeRF, restricted to reconstructed scenes.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel task definition with a practical data strategy.
Experimental Thoroughness: ⭐⭐⭐⭐ Two new datasets constructed for evaluation.
Writing Quality: ⭐⭐⭐⭐ Motivation and methodology are clearly described.
Value: ⭐⭐⭐⭐ Opens a new direction for tactile localization research.