Skip to content

UNICBench: UNIfied Counting Benchmark for MLLM

Conference: CVPR 2026
arXiv: 2603.00595
Code: Public evaluation toolkit
Area: Multimodal benchmark / MLLM evaluation
Keywords: counting benchmark, multimodal LLM, image-text-audio, unified evaluation, stratified difficulty

TL;DR

Introducing UNICBench, the first unified cross-modal (Image/Text/Audio) multi-level counting benchmark, containing 14,301 QA pairs (5,508+5,888+2,905) categorized by three capability levels (Pattern/Semantic/Reasoning) × three difficulty levels (Easy/Medium/Hard). Systematic evaluation of 45 SOTA MLLMs reveals that basic counting tasks are approaching human level, while significant gaps remain in reasoning-level and difficult tasks.

Background & Motivation

Background: Counting is a core cognitive ability of multimodal large models, related to number sense (a basic cognitive capability in humans and animals). While MLLMs have progressed rapidly in general VQA/reasoning benchmarks, they lack a benchmark for systematic cross-modal evaluation of "counting" as an independent capability.

Limitations of Prior Work: (1) Image counting dataset annotation formats are inconsistent (points/boxes/density maps), making them difficult to use directly for MLLM QA evaluation; (2) Text and audio counting data are extremely scarce—almost no public QA datasets exist for document deduplication counting or audio event counting; (3) Evaluation protocols are inconsistent—splits, prompts, seeds, and matching rules vary across works, making results incomparable; (4) High API costs and rate limits for closed-source models hinder fair cross-model comparison.

Key Challenge: Counting ability spans three levels: perceptual localization, semantic filtering, and rule-based reasoning. Existing benchmarks either cover only a single modality or fail to distinguish capability levels, making it impossible to systematically locate MLLM counting bottlenecks.

Goal: To establish a counting benchmark covering image/text/audio modalities with a unified QA format and evaluation protocol, capable of diagnosing capability shortcomings hierarchically.

Key Insight: Design a cross-classification system with three capability levels (Pattern/Semantic/Reasoning) and three difficulty levels (Easy/Medium/Hard), complemented by evidence-first GT and deterministic numerical parsing.

Core Idea: Decompose counting capability into three levels: perception counting → semantic filtering → rule-based reasoning. Evaluate uniformly across image/text/audio, using metrics like MAE/HitRate to hierarchically diagnose MLLM counting bottlenecks.

Method

Overall Architecture

UNICBench addresses the lack of a unified yardstick for "counting" in multimodal large models—where image, text, and audio modalities each have their own annotation formats and evaluation protocols, making horizontal comparisons impossible. It standardizes all tri-modal counting problems into a unified "Question—Evidence—Answer" structure: each problem includes an input (an image, a document, or an audio clip), a natural language question, an integer answer, and a traceable piece of evidence. This unified corpus is then stratified by two structures: first, each problem is cross-labeled by "counting capability required" and "scene difficulty"; second, a fixed evaluation protocol is established (standardizing splits, prompts, and seeds with modality-specific answer matching rules). Finally, results are reported across the "Capability × Difficulty × Modality" dimensions for direct performance diagnosis.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 420, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    subgraph T["3-level Capability × 3-level Difficulty Cross-classification"]
        direction TB
        T1["Capability Axis L1 Pattern / L2 Semantic / L3 Reasoning"]
        T2["Difficulty Axis Easy 1–10 / Medium 11–100 / Hard >100"]
    end
    A["Tri-modal Raw Data<br/>Multi-source Image / Text / Audio Collection"]
    subgraph B["Evidence-first Annotation + Unified Cross-modal Schema"]
        direction TB
        B1["Unified Pre-processing<br/>Points/Boxes/Timestamps → Coordinates / Char Spans / Timestamps"] --> B2["Dual Annotation + Arbitration<br/>gt_count + Structured gt_evidence"]
    end
    T -. Each QA labeled by grid .-> B
    A --> B
    B --> C
    subgraph C["Standardized Evaluation Protocol + Deterministic Numerical Parsing"]
        direction TB
        C1["Fixed split/prompt/seed + Modality-specific matching rules"] --> C2["Integer Parsing from Response<br/>MAE / MSE / HitRate / SuccessRate"]
    end
    C --> D["3D Cross-reporting<br/>Capability × Difficulty × Modality"]

Key Designs

1. Capability × Difficulty Cross-classification: Localizing "Incorrect Counting"

Reporting only an overall accuracy cannot explain whether a model fails due to poor perception or calculation. UNICBench slices counting capability into three levels along an increasing difficulty gradient: Pattern (L1) is direct perceptual counting where the answer is the size of the instance set \(y=|E|\), e.g., "How many people are in the image?"; Semantic (L2) requires filtering by attributes or deduplication before counting, \(y=|\{e \in E \mid P(e)\}|\), e.g., "How many people are wearing red clothes?"; Reasoning (L3) involves combinatorial counting according to rules, \(y=g(|S_1|,\ldots)\), e.g., "How many folders were modified in 2022?". Difficulty levels are mapped to Easy (1–10), Medium (11–100), and Hard (>100) based on objective measures (target density, occlusion, repetition rate). This cross-labeling refines diagnosis to determine if the perception layer fails in dense scenes or if the reasoning layer fails even in simple scenes.

2. Evidence-first Annotation + Unified Cross-modal Schema: Making Every Answer Traceable

Meaningful cross-modal comparison requires that ground truth across the three modalities is the same type of data. UNICBench stores not just a gt_count for each question but also a structured gt_evidence—coordinates for images, character-level spans for text, and timestamps for audio. This logs the reasoning for the count, allowing answers to be verified individually. Question templates are also handled hierarchically: L1 uses deterministic templates to suppress linguistic variation, while L2/L3 allow free-form text with explicitly stated filtering rules to avoid ambiguity. Quality control involves independent double annotation plus multi-stage arbitration to achieve 100% consistency. This unified schema ensures ground truth is verifiable and allows "image counts" to be compared directly with "audio counts."

3. Standardized Evaluation Protocol + Deterministic Numerical Parsing: Eliminating Incomparability

Inconsistent splits, prompts, seeds, and matching rules in previous works made scores impossible to align. UNICBench standardizes these: splits, prompts, and random seeds are fixed, and matching rules are customized by modality (exact matching for numerical classes, \(\epsilon\)-tolerance for continuous quantities). Since model outputs are natural language, a deterministic numerical parser is employed to stably extract the integer from the response, preventing misjudgment due to formatting. Finally, a set of complementary metrics is reported—MAE and MSE to measure numerical deviation, SuccessRate to measure the model's ability to return parsable numbers, and HitRate@100%/@90%/@80% to measure hits within different error tolerances, evaluating both accuracy and robustness.

Metric Definitions

UNICBench is an evaluation benchmark and does not involve model training. Core metrics are defined as: \(MAE = \frac{1}{N}\sum|y_i - \hat{y}_i|\), \(MSE = \frac{1}{N}\sum(y_i - \hat{y}_i)^2\) to measure deviation from ground truth; HitRate@X% is the accuracy within an X% error margin; SuccessRate is the ratio of models successfully returning a parsable number.

Key Experimental Results

Main Results (Top-10 Models in Image Modality)

Model Overall MAE↓ Easy MAE↓ Hard MAE↓ Pattern MAE↓ Reasoning MAE↓
GPT-5-mini 29.8 2.1 155.0 25.4 5.3
o4-mini 42.9 2.2 239.1 39.1 4.1
GPT-4o 43.2 2.4 238.4 41.7 5.4
GPT-o3 49.0 2.8 277.1 44.3 4.4
GPT-5 54.1 2.5 312.4 55.1 5.9
Claude-Sonnet-4 78.1 5.4 444.6 68.8 4.4
Gemini-2.5-Pro 90.0 4.3 504.9 71.1 4.6
Gemini-2.5-Flash 140.5 12.0 694.2 131.4 6.7
GLM-4.1V-9B 97.9 3.0 542.2 90.0 3.1
GPT-4o-mini 73.3 2.3 424.6 72.7 5.3

Cross-modal/Cross-difficulty Analysis

Dimension Finding
Easy vs Hard Easy MAE 2-5, Hard MAE 100-700, a gap of over 100x
Pattern vs Reasoning Image Reasoning MAE is low (3-7) but sample size is small (4.6%); high Pattern MAE comes from high-density scenes
Text Modality Reasoning has the highest ratio (43.7%); models generally perform poorly on deduplication/cross-paragraph aggregation
Audio Modality Ambient sound event density is low (1.56/sample); meeting speech density is extremely high (81.51/sample)
Long-tail Distribution GT count distribution is heavily right-skewed/long-tailed; model error explodes in high-count regions

Key Findings

  • On simple counting tasks (L1+Easy), models converge, with an Easy MAE gap of only 2-12.
  • Large performance gaps exist in the Hard partition—the best (GPT-5-mini 155) and worst (Gemini-2.5-Flash 694) differ by 4.5x.
  • Reasoning tasks in the text modality (deduplicating citations, cross-paragraph statistics) are currently the biggest shortcoming for MLLMs.
  • Open-source models perform surprisingly well on Reasoning (GLM-4.1V MAE 3.1), but show significant gaps in Pattern.

Highlights & Insights

  • First unified counting benchmark across three modalities—evaluating "counting" as an independent core cognitive capability.
  • 3-level Capability × 3-level Difficulty cross-classification enables precise diagnosis, identifying specific failure points.
  • Evidence-first GT design ensures every answer is traceable and verifiable.
  • Long-tail distribution analysis reveals systematic failures in high-count scenarios, indicating cognitive blind spots rather than random errors.
  • Extensive evaluation of 45 models provides statistically significant and broad conclusions.

Limitations & Future Work

  • Audio counting data volume is relatively small (2,069 samples vs. 5,300 for image), limiting the robustness of audio-dimension findings.
  • High evaluation costs for closed-source APIs (GPT-5 level) restrict replication and expansion.
  • Cross-modal joint counting (e.g., using both vision and audio in video) is not yet addressed.
  • Reasoning in the image modality accounts for only 4.6%, resulting in a small sample size for that level's conclusions.
  • The impact of enhancement strategies like few-shot or chain-of-thought on counting performance has not been explored.
  • vs MMBench/MMMU: General benchmarks do not systematically evaluate counting; UNICBench fills the gap for deep evaluation of this specific capability.
  • vs FSC-147/ShanghaiTech: Traditional datasets use density maps or point annotations; UNICBench unifies them into QA format for MLLMs.
  • vs DocVQA/ChartQA: These involve counting but not as a core capability; UNICBench focuses on counting and provides hierarchical diagnosis.
  • The hierarchical evaluation paradigm (Capability × Difficulty × Modality) can be extended to benchmark other specific capabilities like spatial reasoning or temporal understanding.
  • Systematic failures under long-tail distributions suggest that MLLMs may lack true "counting" ability and rely more on pattern matching.

Rating

  • Novelty: ⭐⭐⭐⭐ First unified cross-modal counting benchmark with a rational classification system.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation of 45 models with three-dimensional cross-analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear classification system and rich visualizations.
  • Value: ⭐⭐⭐⭐ Reveals systematic flaws in MLLM counting capabilities; the benchmark has long-term utility.