The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition¶

Conference: CVPR2026 arXiv: 2505.24840 Code: yuanqing-ai.github.io/llm-hierarchy Area: Multimodal VLM Keywords: Hierarchical Visual Recognition, Classification Consistency, LLM Bottleneck, Taxonomy Knowledge, Visual Question Answering

TL;DR¶

This paper reveals that open-source LLMs lack hierarchical taxonomic knowledge about the visual world (often failing to recognize even basic biological classification systems), making the LLM the bottleneck for hierarchical visual recognition in Vision LLMs.

Background & Motivation¶

Root Cause¶

Key Challenge: Taxonomies are central to visual recognition — e.g., Boston Terrier → Terrier → Dog → Mammal → Animal forms a semantic path. An ideal general-purpose visual recognition system should be capable of mapping inputs to both leaf nodes and internal nodes of a taxonomy while maintaining hierarchical consistency.

Vision LLMs (VLLMs) unify diverse visual tasks and hold the potential to build such systems, yet existing benchmarks focus primarily on leaf-node classification accuracy and overlook hierarchical consistency.

Core findings and contradictions:

State of the Field¶

Background: Both open-source and commercial VLLMs exhibit severely inconsistent hierarchical recognition (e.g., Qwen2.5-VL-72B produces errors on 67%+ of paths in the iNaturalist taxonomy).

Limitations of Prior Work¶

Limitations of Prior Work: The root cause lies not in the visual encoder or projector — which retain highly discriminative and well-structured features — but in the LLM, which lacks taxonomic knowledge.

Starting Point¶

Key Insight: Fine-tuning VLLMs can help but does not fundamentally resolve the issue. Moreover, fine-tuning yields greater improvements in the LLM's textual hierarchical consistency than in the VLLM's visual hierarchical consistency, further confirming the LLM bottleneck effect.

Method¶

Overall Architecture¶

This is an analytical paper rather than a methods paper. The authors construct approximately one million four-choice VQA tasks based on 6 taxonomies and 4 image datasets to systematically evaluate the hierarchical visual recognition capabilities of VLLMs.

Key Designs¶

Evaluation Metrics:
- HCA (Hierarchical Consistent Accuracy): \(HCA = \frac{1}{N}\sum_{i=1}^N \prod_{j=1}^{L^i} \mathbb{1}[f_\theta(x^i; Y_j) = y_j^i]\) — requires all nodes along the path to be predicted correctly.
- Leaf Accuracy \(Acc_{leaf}\): considers only the finest-grained prediction. \(Acc_{leaf}\) serves as an upper bound for HCA.
VQA Task Construction:
- 6 taxonomies: iNat21-Animal, iNat21-Plant, ImgNet-Artifact, ImgNet-Animal, CUB-200, Oxford-Pets.
- Four-choice questions are generated for each taxonomic level, with distractors sampled from the same level.
- Coverage spans all levels from coarse-grained (e.g., Vertebrate/Invertebrate) to fine-grained (e.g., specific species).
Bottleneck Localization Analysis:
- Probing the visual encoder embeddings of VLLMs reveals that they retain discriminative features and hierarchical structure.
- Probing LLM embeddings reveals that although sufficient hierarchical cues are encoded in an orthogonal structure, the model fails to decode them.
- Fine-tuning experiments show that VLLM fine-tuning improves both the LLM's textual hierarchical consistency and the VLLM's visual hierarchical consistency, but the former improves more substantially.

Loss & Training¶

Fine-tuning experiments adopt standard SFT using the constructed VQA data.

Key Experimental Results¶

Main Results¶

Model	iNat21-Animal HCA	iNat21-Plant HCA	CUB-200 HCA	ImgNet-Animal HCA
Qwen2.5-VL-72B	35.73	32.82	66.36	64.08
GPT-4o	42.95	35.53	81.96	67.69
BioCLIP2	41.84	37.91	55.80	8.34
LLaVA-OV-7B	4.53	4.46	11.51	34.36

Ablation Study¶

Configuration	Key Metric	Description
Leaf Accuracy vs. HCA Gap	Large	e.g., Qwen2.5-VL-72B: 54.20 leaf accuracy vs. 35.73 HCA
BioCLIP2 Leaf Accuracy	95.94	Domain-specialist model achieves very high leaf accuracy but HCA remains only 41.84
Visual Encoder Probing	Highly Discriminative	Bottleneck does not reside on the visual side

Key Findings¶

A large gap exists between leaf accuracy and HCA: models can identify specific species but are unaware of their higher-level taxonomic categories.
Domain-specific CLIP models (BioCLIP2) outperform VLLMs in leaf accuracy but achieve similarly low HCA scores.
A significant performance gap remains between open-source VLLMs and GPT-4o.
VLLMs perform better on ImgNet-Artifact than on biological taxonomies, as hierarchical knowledge of tools and everyday objects is more commonly represented.

Highlights & Insights¶

The paper raises a previously overlooked yet important research question: the hierarchical visual recognition capability of VLLMs.
The conclusion that "the LLM is the bottleneck" has direct implications for VLLM development — improving the visual encoder alone is insufficient; enriching the LLM's taxonomic knowledge is equally necessary.
HCA is a stricter and more practically meaningful evaluation metric than leaf accuracy.
The finding that hierarchical information is encoded but not decodable in LLM embeddings suggests that targeted training strategies may be able to activate this latent knowledge.

Limitations & Future Work¶

The authors explicitly note that their conclusions apply primarily to open-source LLMs and should not be extrapolated to commercial LLMs, as their internal representations are inaccessible.
The four-choice VQA evaluation format may underestimate hierarchical inconsistency in open-ended generation settings.
The paper does not deeply explore effective methods for injecting taxonomic knowledge into LLMs.
While fine-tuning provides some benefit, it does not fundamentally resolve the issue; more principled solutions are needed.

CLIP-based models also exhibit hierarchical consistency issues, though domain-specific BioCLIP2 achieves exceptionally high leaf accuracy.
This work is complementary to VR-FGVC studies such as Zhang et al. and Liu et al., as this paper focuses on hierarchy rather than fine-grained recognition alone.
The findings have implications for agent system design: if the LLM does not understand hierarchical structure, it will struggle in tasks requiring multi-granularity understanding.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic evaluation of hierarchical visual recognition in VLLMs.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ ~1M VQA tasks, 10+ models, 6 taxonomies, and in-depth probing analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Clear paper structure with strong and carefully qualified conclusions.
Value: ⭐⭐⭐⭐⭐ Identifies a fundamental weakness of VLLMs with important implications for the community.

Additional Notes¶

Ten open-source VLLMs are evaluated, including LLaVA-OV, InternVL, Qwen2.5-VL, and Qwen3-VL, alongside GPT-4o.
Four CLIP models (OpenCLIP, SigLIP, BioCLIP, BioCLIP2) serve as non-LLM baselines.
The 6 taxonomies cover both biological and artifact categories, with hierarchical depths ranging from 2 to 7 levels.
HCA scores on iNaturalist taxonomies are universally very low (the best-performing GPT-4o achieves only 42.95%), highlighting this as a challenging and underexplored problem.
On the Oxford-Pets dataset, BioCLIP2 achieves an HCA of 58%+, demonstrating the benefit of domain-specific training.