On Large Multimodal Models as Open-World Image Classifiers¶

Conference: ICCV 2025 arXiv: 2503.21851 Code: GitHub Area: Multimodal VLM Keywords: large multimodal models, open-world classification, evaluation protocol, LMM, image classification

TL;DR¶

This paper systematically evaluates 13 large multimodal models (LMMs) on open-world image classification, proposes an evaluation protocol comprising four complementary metrics, and reveals systematic error patterns in LMMs regarding granularity judgment and fine-grained discrimination.

Background & Motivation¶

Traditional image classification requires a predefined set of categories (closed-world), whereas LMMs natively support open-ended output—directly generating category names in response to queries such as "What is the object in this image?" without requiring a fixed candidate list. While this capability is highly practical, existing research exhibits notable shortcomings:

Problem 1: Most LMM classification evaluations remain confined to the closed-world setting (providing candidate category lists to the model), failing to reflect the genuine open-world capability of LMMs. Although Zhang et al. (2024) attempted open-world evaluation, it was limited to four datasets and a single metric (text inclusion).

Problem 2: Existing evaluation metrics are overly simplistic. The text inclusion metric merely checks whether the correct label string appears within the prediction, and cannot handle the following scenarios: - Semantically equivalent but textually different predictions (e.g., sofa vs. couch) - Different but reasonable levels of granularity (e.g., dog vs. pug) - Literally correct but semantically erroneous matches (e.g., can spuriously matching trash can)

Goal: To provide the first large-scale, multi-dimensional benchmark for evaluating LMMs on open-world classification, analyze error patterns through multiple metrics, and offer directional guidance for future research.

Method¶

Overall Architecture¶

Rather than introducing a new model, this paper constructs a comprehensive evaluation framework for open-world classification: 1. Formally defines the open-world classification task 2. Proposes four complementary evaluation metrics 3. Conducts large-scale experiments across 10 datasets and 13 models 4. Analyzes error types and mitigation strategies through metric combinations

Key Designs¶

Task Formalization (Open-World Classification):
- Function: Defines the LMM as a function \(f_{\text{LMM}}: \mathcal{X} \times \mathcal{T} \rightarrow \mathcal{T}\), taking an image and a query text as input and producing a textual prediction as output.
- Mechanism: In the closed-world setting, the query includes a candidate set \(\mathcal{C}\); in the open-world setting, the output space is unconstrained, and the model freely predicts from all possible semantic concepts \(\mathcal{Y}\), where \(|\mathcal{C}| \ll |\mathcal{Y}|\).
- Design Motivation: The generative capacity of LMMs should not be constrained by predefined category lists; the open-world setting better reflects real-world application scenarios.
Four Evaluation Metrics:
- Function: Measure the alignment between predictions and ground truth from different perspectives.
- Mechanism:
  - Text Inclusion (TI): Checks whether the ground-truth label is a substring of the predicted text, \(\text{TI}(y, \hat{y}) = \mathbf{1}[y \subseteq \hat{y}]\). Simple but overly strict; semantically equivalent but textually different predictions are penalized.
  - Llama Inclusion (LI): Uses Llama 3.2 3B as a judge to determine whether the prediction is semantically consistent with the ground truth. A classification-specific instantiation of the LLM-as-judge paradigm.
  - Semantic Similarity (SS): \(\text{SS} = \langle g_{\text{emb}}(\hat{y}), g_{\text{emb}}(y) \rangle\), computing the cosine similarity between Sentence-BERT embeddings of the prediction and ground truth, yielding a continuous score in \([0, 1]\).
  - Concept Similarity (CS): \(\text{CS} = \max_{p \in \text{split}(\hat{y})} \langle g_{\text{emb}}(p), g_{\text{emb}}(y) \rangle\), segmenting the prediction with spaCy and taking the maximum cosine similarity between any segment and the ground truth, addressing the dilution problem caused by evaluating verbose predictions holistically.
- Design Motivation: No single metric provides a complete assessment; inconsistencies across metrics can expose specific error patterns.
Error Analysis Framework:
- Function: Uses metric discrepancies to localize the type of model error.
- Mechanism:
  - High CS but low LI → granularity error (correct but too generic, e.g., predicting "animal" instead of "Labrador")
  - High LI but low CS → annotation ambiguity or multi-label issues
  - Both low → fine-grained discrimination failure (wrong but specific, e.g., confusing two similar flower species)
- Design Motivation: Rather than simply reporting accuracy, the combined metrics provide actionable directions for improvement.

Loss & Training¶

This paper is an evaluation study and does not involve model training. All experiments adopt the standard prompt "What type of object is in this image?" for zero-shot inference in a unified manner.

Key Experimental Results¶

Main Results¶

LMM open-world classification performance (grouped mean by dataset granularity):

Model	Prototypical TI	Prototypical LI	Fine-grained TI	Very Fine TI	Notes
Qwen2VL 7B	46.4	78.7	34.6	0.8	Best LMM
InternVL2 8B	40.6	74.4	22.3	2.3	Second best
LLaVA-1.5 7B	34.6	63.1	8.4	0.0	Weaker performance
CaSED (OW baseline)	24.5	46.3	27.4	0.7	Contrastive OW method
CLIP (closed-world)	76.4	—	85.0	—	Closed-world upper bound
SigLIP (closed-world)	81.8	—	92.6	—	Closed-world upper bound

Ablation Study¶

Effect of prompt design on granularity:

Prompt Strategy	CS Gain	Notes
Default prompt	—	"What type of object?"
Fine-grained prompt	+5–10% CS	Adds guidance such as "be specific"
CoT reasoning	+2–5%	Adds Chain-of-Thought steps

Effect of inference strategy on fine-grained discrimination:

Strategy	Fine-grained Improvement	Notes
Direct prediction	baseline	Default mode
Structured reasoning	+3–8% LI	Reduces confusion between similar categories

Key Findings¶

LMMs outperform contrastive baselines in the open world: Generative LMMs perform better than category-list-free methods such as CaSED and CLIP retrieval, though they remain significantly behind closed-world models that receive candidate lists.
Performance degrades sharply with finer granularity: Accuracy drops precipitously from prototypical (~46% TI) to very fine-grained (~1% TI).
Granularity errors are the dominant failure mode: Models tend to produce overly generic predictions (e.g., "bird" instead of "scarlet tanager"); prompt-based guidance can partially mitigate this.
Model scale helps but is not decisive: InternVL2 2B→8B shows consistent gains, whereas LLaVA-OV 0.5B→7B regresses on some metrics.
Annotation ambiguity is pervasive: Verification with a tagging model reveals that many "incorrect" predictions are in fact valid, as images contain multiple plausible labels.

Highlights & Insights¶

The contribution lies in evaluation rather than method design, yet is equally important—this is the first work to elevate LMM open-world classification from "running a quick benchmark" to rigorous evaluation science.
The complementary design of the four metrics is elegant: TI captures strict matching, LI captures semantic correctness, SS/CS provide continuous scores, and inter-metric discrepancies diagnose specific failure modes.
The granularity–performance cliff reveals a fundamental bottleneck in LMM visual perception—not an inability to recognize objects, but an inability to articulate precise names.
The analysis of annotation ambiguity is honest and valuable, cautioning the community against over-interpreting "errors" in open-world evaluation.

Limitations & Future Work¶

The most recent LMMs (e.g., GPT-4o, Gemini 2) are not included; whether conclusions hold for stronger models requires further verification.
The evaluation prompt is uniformly set to "What type of object," which is unnatural for non-object datasets (e.g., DTD textures, UCF101 actions).
The Llama inclusion metric relies on the judgment capability of Llama 3.2 3B, which may itself introduce bias.
Few-shot or retrieval-augmented strategies as potential systematic improvements to fine-grained recognition remain unexplored.

Comparison with CaSED (CVPR 2023) highlights the complementary strengths and weaknesses of generative and contrastive models for open-world recognition.
The connection between granularity errors and prompt engineering suggests the practical value of task-specific prompt template libraries.
The evaluation framework can be directly applied to open-world evaluation of other vision-language understanding tasks.
Zhang et al. (2024) is the most direct predecessor; the present work surpasses it comprehensively in both scale and depth.

Rating¶

Novelty: ⭐⭐⭐ — An evaluation study rather than a methodological contribution, though the task formalization and metric design exhibit originality.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 13 models × 10 datasets × 4 metrics; coverage is exceptionally broad.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, progressively layered analysis, and rich figures and tables.
Value: ⭐⭐⭐⭐ — Provides the LMM community with much-needed evaluation infrastructure and insightful error analysis.