Skip to content

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Conference: ACL 2026 arXiv: 2603.07474 Code: https://github.com/sally-xu-42/cross-modal-taxonomic-gen Area: Causal Inference Keywords: Cross-Modal Generalization, Taxonomic Knowledge, Hypernymy, Vision-Language Model, Visual Coherence

TL;DR

This paper systematically studies whether LMs in VLMs can cross-modally generalize purely text-learned taxonomic knowledge (hypernym relations) to visual inputs, finding that even without any visual-language hypernym supervision, pretrained LMs can identify hypernym categories in images, but this generalization requires visual coherence among category members.

Method

Key Designs

  1. Random Hypernym Ablation: Systematically removes 10-100% of leaf-node-image-hypernym mappings, measuring generalization as evidence decreases.

  2. Systematic Hypernym Ablation: Completely removes entire hypernym categories from training data — stricter than random ablation.

  3. Counterfactual Shuffling Experiments: Cross-category shuffling (breaks visual coherence) vs within-category shuffling (preserves visual coherence). If LM executes arbitrary rules ("IF crow THEN bird"), both should perform similarly.

Key Experimental Results

  • Pretrained LM significantly above chance even at zero hypernym supervision; randomly initialized LM near chance
  • Cross-category shuffling causes generalization collapse; within-category shuffling maintains performance — proving visual coherence is a necessary condition
  • DINOv2 (no text training) and SigLIP (text-trained) show no significant difference as image encoders, confirming generalization comes from LM

Highlights & Insights

  • Exquisite experimental design: systematically isolates each factor through controlled variables
  • Visual coherence is the bridge: LM doesn't blindly execute "IF crow THEN bird" rules but requires bird category members to actually cluster in representation space
  • Provides empirical support for the "Platonic Representation Hypothesis"

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐