Cross-Modal Taxonomic Generalization in (Vision-) Language Models¶
Conference: ACL 2026 arXiv: 2603.07474 Code: https://github.com/sally-xu-42/cross-modal-taxonomic-gen Area: Causal Inference Keywords: Cross-Modal Generalization, Taxonomic Knowledge, Hypernymy, Vision-Language Model, Visual Coherence
TL;DR¶
This paper systematically studies whether LMs in VLMs can cross-modally generalize purely text-learned taxonomic knowledge (hypernym relations) to visual inputs, finding that even without any visual-language hypernym supervision, pretrained LMs can identify hypernym categories in images, but this generalization requires visual coherence among category members.
Method¶
Key Designs¶
-
Random Hypernym Ablation: Systematically removes 10-100% of leaf-node-image-hypernym mappings, measuring generalization as evidence decreases.
-
Systematic Hypernym Ablation: Completely removes entire hypernym categories from training data — stricter than random ablation.
-
Counterfactual Shuffling Experiments: Cross-category shuffling (breaks visual coherence) vs within-category shuffling (preserves visual coherence). If LM executes arbitrary rules ("IF crow THEN bird"), both should perform similarly.
Key Experimental Results¶
- Pretrained LM significantly above chance even at zero hypernym supervision; randomly initialized LM near chance
- Cross-category shuffling causes generalization collapse; within-category shuffling maintains performance — proving visual coherence is a necessary condition
- DINOv2 (no text training) and SigLIP (text-trained) show no significant difference as image encoders, confirming generalization comes from LM
Highlights & Insights¶
- Exquisite experimental design: systematically isolates each factor through controlled variables
- Visual coherence is the bridge: LM doesn't blindly execute "IF crow THEN bird" rules but requires bird category members to actually cluster in representation space
- Provides empirical support for the "Platonic Representation Hypothesis"
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐