Cross-Modal Taxonomic Generalization in (Vision-) Language Models¶

Conference: ACL 2026 arXiv: 2603.07474 Code: https://github.com/sally-xu-42/cross-modal-taxonomic-gen Area: Causal Inference Keywords: Cross-Modal Generalization, Taxonomic Knowledge, Hypernymy, Vision-Language Model, Visual Coherence

TL;DR¶

This paper systematically studies whether LMs in VLMs can cross-modally generalize purely text-learned taxonomic knowledge (hypernym relations) to visual inputs, finding that even without any visual-language hypernym supervision, pretrained LMs can identify hypernym categories in images, but this generalization requires visual coherence among category members.

Method¶

Key Designs¶

Random Hypernym Ablation: Systematically removes 10-100% of leaf-node-image-hypernym mappings, measuring generalization as evidence decreases.
Systematic Hypernym Ablation: Completely removes entire hypernym categories from training data — stricter than random ablation.
Counterfactual Shuffling Experiments: Cross-category shuffling (breaks visual coherence) vs within-category shuffling (preserves visual coherence). If LM executes arbitrary rules ("IF crow THEN bird"), both should perform similarly.

Key Experimental Results¶

Pretrained LM significantly above chance even at zero hypernym supervision; randomly initialized LM near chance
Cross-category shuffling causes generalization collapse; within-category shuffling maintains performance — proving visual coherence is a necessary condition
DINOv2 (no text training) and SigLIP (text-trained) show no significant difference as image encoders, confirming generalization comes from LM

Highlights & Insights¶

Exquisite experimental design: systematically isolates each factor through controlled variables
Visual coherence is the bridge: LM doesn't blindly execute "IF crow THEN bird" rules but requires bird category members to actually cluster in representation space
Provides empirical support for the "Platonic Representation Hypothesis"

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐