RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/cvl-umass/RealBirdID
Area: Multimodal VLM
Keywords: Fine-grained recognition, abstention, bird benchmark, taxonomic hierarchy, MLLM calibration

TL;DR¶

RealBirdID is a fine-grained bird identification benchmark focused on "identifying species if possible, and providing reasons if not." By mining \(3.4\text{k}\) "unanswerable" images from real iNaturalist disputes (categorized into three abstention reasons: need vocalization / angle or occlusion / poor quality) paired with "answerable" samples from the same genus, the study evaluates models using three metrics. Results show that top-tier MLLMs like GPT-5 and Gemini-2.5 Pro achieve less than \(13\%\) species-level accuracy, struggle to distinguish answerable from unanswerable samples, and mostly provide incorrect reasons for abstention.

Background & Motivation¶

Background: Bird identification has long served as a yardstick for fine-grained visual recognition (FGVR), with datasets like CUB-200, NABirds, and iNaturalist driving modeling for parts, attributes, and taxonomic hierarchies. Recently, MLLMs and open-vocabulary prompting have significantly improved zero-shot classification, suggesting that fine-grained recognition might be "solved" in the era of large models.

Limitations of Prior Work: Existing benchmarks almost exclusively contain "answerable" (in-schema) samples, where every image has a ground-truth species label. In real-world deployment, many images cannot be identified to the species level from a single frame: critical clues might be non-visual (requiring vocalization) or obscured by occlusion, angle, or low resolution. When forced to choose, models confidently hallucinate, which is dangerous in medical or legal scenarios.

Key Challenge: Current evaluations only reward "confident prediction" and fail to examine whether a model can "stay silent when necessary and explain why." While text-based abstention benchmarks like SQuAD 2.0 or AbstentionBench exist, they focus on unanswerable questions. In contrast, this study requires the prompt to remain fixed while the model must judge whether to abstain based solely on visual evidence, a gap in the vision-language domain.

Goal: To evaluate two simultaneous sub-capabilities: (1) exhaustively identifying species given a list within a genus, and (2) abstaining with correct reasons (vocalization/angle/quality) when faced with true "unanswerable" samples from that genus.

Key Insight: Instead of generating synthetic hard samples, the authors mine observations from iNaturalist where community experts engaged in genuine disputes and could only reach a genus-level consensus. These human discussion records naturally provide labels for "unanswerable with reasons."

Core Idea: Redefine the fine-grained recognition task as "identify species or abstain with evidence," curate a benchmark with partially labeled taxonomy (where leaf nodes may lack ground truth), and design metrics to evaluate the trade-off between accuracy and abstention across the hierarchical tree to expose the weaknesses of current MLLMs.

Method¶

This represents a benchmark study where the core contribution is the task definition, dataset construction, and evaluation metrics rather than a specific model pipeline.

Overall Architecture¶

The evaluation loop of RealBirdID involves preparing a subset pair for each genus: an Answerable set (A) (exhaustively sampled images with species ground truth) and an Unanswerable set (UA) (images identified only to the genus level with human-provided abstention reasons). The model outputs a species prediction with uncertainty or free-form text. The evaluation scans uncertainty thresholds to shift predictions from leaf species to intermediate genus nodes (abstention), plotting three types of trade-off curves and using the Area Under the Curve (AUC) as summary metrics. The benchmark covers \(248\) genera, \(3442\) species, and \(35138\) images (\(31885\) in A + \(3253\) in UA) to score both CLIP-style encoders and MLLMs.

Key Designs¶

1. Task Redefinition: Species Identification or Reasoned Abstention with Fixed Prompts The task is shifted from "Image \(\to\) Species" to "Image \(\to\) Species OR Abstention + Reason." Unlike text-based benchmarks that modify the question to create unanswerability, RealBirdID keeps the prompt functionally equivalent ("What is the species of this bird?"), forcing the model to infer strictly from visual content whether the task is impossible. Abstention reasons are converged into three categories: Vocalization, Angle/Occlusion, and Quality. This design decouples abstention ability from language understanding of the prompt.

2. Mining "Unanswerable with Reasons" from iNaturalist Disputes UA samples are extracted from real iNaturalist community disputes. Starting from \(1.4\) million verifiable observations (deliberately not restricted to Research Grade to capture species-level disagreements), samples are filtered for bird presence (YOLOv3) and image quality (MANIQA), resulting in \(410\text{k}\). Observations with at least a genus-level prediction and \(\ge 2\) contributors are retained, excluding dead birds/eggs/feathers. Regular expressions and heuristics are used to parse comments and identification histories for ambiguity signals (range, sexual dimorphism, molt, view, hybrid, taxonomic uncertainty, quality). After mapping to a schema and expert verification using Birds of the World, \(3.4\text{k}\) UA samples with reasons are finalized.

The Answerable set is constructed by exhaustively sampling all descendant species within the same genus using the iNaturalist Taxon API. Additionally, a SINR geo-species model is used to map coordinates to species occurrence probabilities, providing a "likely candidate list" for range-map experiments.

3. Hierarchical Metrics and Hierarchical Probability Aggregation (TreeGT) Since UA images lack species-level ground truth, standard accuracy fails. Three threshold-based metrics are introduced: - Metric 1 — UA/A Trade-off: Measures the proportion of abstention in A vs. UA sets. An ideal model abstains on all UA and none of A. - Metric 2 — IG (Information Gain): Based on an "Accuracy vs. Information Gain" curve. Predictions are aggregated up the taxonomy (species \(\to\) genus \(\to\) class). Information gain corresponds to taxonomic depth (species \(>\) genus). This avoids rewarding trivial models that abstain on everything. - Metric 3 — Calibration AUC: Measures species/genus accuracy at a fixed abstention rate to see if accuracy increases as high-entropy samples are discarded.

For encoders like CLIP, which lack an "abstain" class, the authors use TreeGT hierarchical aggregation: the probability of a genus is the sum of the softmax probabilities of all its child species (e.g., \(P(\text{Genus}) = \sum P(\text{Species}_i)\)). For MLLMs, free-form text is parsed to extract species/reasons, and species probability vectors are reconstructed via retrieval.

Key Experimental Results¶

Evaluated models include CLIP, MetaCLIP, WildCLIP, SigLIP, and BioCLIP, alongside MLLMs like InternVL3-8B, Qwen2.5-VL-7B, Gemma-3-12B, Llama-3.2-11B, Gemini-2.5 Pro, and GPT-5.

Main Results: Classification Performance and Abstention Trade-off¶

Model	IG (Performance↑)	UA/A (Trade-off↑)
BioCLIP	68.9	49.6
MetaCLIP-L/14	66.0	42.5
SigLIP-so400m	53.7	53.2
CLIP-L/14	62.0	48.1
Gemini-2.5 Pro	57.7	46.2
GPT-5	56.4	44.1
Qwen2.5-VL-7B	54.2	41.7

Key Observation: BioCLIP, while strongest in classification (IG 68.9), is outperformed by SigLIP (53.2) in abstention trade-off. There is no significant positive correlation (\(r=0.60\)) between classification accuracy and abstention capability. Increasing model or data size within the same family improves IG but not necessarily UA/A.

Species/Genus Accuracy and Calibration¶

Model	A-Species Acc	A-Genus Acc	UA-Genus Acc
BioCLIP	17.0	57.0	57.6
MetaCLIP-L/14	11.8	56.1	63.6
GPT-5	10.4	45.6	58.6
Gemini-2.5 Pro	12.7	52.8	60.1

Species-level accuracy across \(3442\) classes remains low (\(3.7\%–17\%\)), with top MLLMs \(\le 13\%\). MLLMs generally lag behind specialized encoders.

Correctness of Abstention Reasons¶

GT Reason	Qwen2.5-VL	Llama-3.2V	InternVL3	Gemma-3
Quality	0.158	0.086	0.279	0.041
Angle/Occlusion	0.144	0.080	0.291	0.052
Vocalization	0.098	0.077	0.278	0.044

Models exhibit a systematic bias: almost all failures are attributed to "Quality." For instance, Qwen2.5-VL correctly labels \(100\%\) of true quality issues but misidentifies \(42.4\%\) of "Angle/Occlusion" as quality. Vocalization is rarely predicted, as visual training bias prevents MLLMs from identifying missing audio cues as a reason for failure.

Key Findings¶

Abstaining \(\neq\) Abstaining Well: Abstention capability is decoupled from classification accuracy. Improving standard FGVR accuracy does not automatically improve reliable abstention.
Abstention Rates are Fragile: Across 15 equivalent instructions, InternVL3's abstention rate varies significantly (\(\sigma=10.10\)). Without explicit permission to abstain, MLLMs rarely do so (\(1\%–2\%\)).
Range Maps Aid Classification, Not Abstention: Incorporating geographic priors improves IG (\(57.2 \to 88.1\)) but has minimal impact on UA/A (\(45.8 \to 47.7\)) and can even degrade MLLM abstention performance.

Highlights & Insights¶

Abstention labels from expert disputes: Mining real-world community disagreements where experts fail to reach species-level consensus provides the most authentic "unanswerable with reasons" labels.
Visual-only abstention: By fixing the prompt, the benchmark isolates the model's ability to "admit ignorance" based purely on visual evidence.
Hierarchical partial labeling: Using "Accuracy vs. Taxonomic Depth" effectively handles images without leaf-level ground truth.
Vocalization blind spot: MLLMs systematically fail to recognize the need for audio cues, highlighting a specific gap in multi-modal training data.

Limitations & Future Work¶

The dataset is for validation only; without a training set, it is difficult to directly train models to abstain.
Single-frame processing might underestimate answerability where subsequent frames in an observation provide clearer views.
The three coarse reason categories collapse more nuanced ambiguities (molt, hybrids, etc.).
Dependence on constrained decoding and text parsing for MLLM evaluation may introduce noise.

vs. Textual Abstention (AbstentionBench): Those focus on unanswerable questions via text; RealBirdID focuses on unanswerable images via visual evidence.
vs. Hierarchical Classification: Unlike methods assuming full labels at every node, RealBirdID handles partially labeled taxonomies where leaf nodes may be missing.
vs. CUB-200 / iNaturalist: These assume all samples are answerable (\(UA=0\)). RealBirdID is the first bird benchmark to pair A/UA sets to evaluate the trade-off.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐