CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods¶
Conference: CVPR 2026
arXiv: 2508.18753
Code: github.com/ChelsieLei/CrossHOI-Bench
Area: Multimodal VLM / Human-Object Interaction Detection
Keywords: HOI Detection, VLM Evaluation, Multiple Choice Benchmark, Cross-Paradigm Comparison
TL;DR¶
Ours proposes CrossHOI-Bench, the first HOI benchmark for unified evaluation of VLMs and HOI-specific models via multiple-choice questions. By avoiding erroneous penalties from incomplete annotations through curated positive and negative examples, it reveals that large VLMs outperform SOTA HOI methods by \(+5.18\%\) in Instance-F1 zero-shot, while identifying systematic weaknesses in multi-action recognition and cross-human attribution.
Background & Motivation¶
Background: HOI detection has long been dominated by task-specific models (ADA-CM, CMMP, HOLa, etc.). However, large generative VLMs (Qwen2.5-VL, InternVL3) have demonstrated strong open-scene understanding, raising the core question: "Can VLMs directly perform HOI detection?"
Limitations of Prior Work: (1) Existing HOI benchmarks (e.g., HICO-DET) rely on exact label matching and suffer from incomplete annotations—any correct but unlabelled prediction is penalized. (2) Ambiguity cannot be resolved by exhaustive annotation: single images often lack visual evidence to distinguish specific actions (e.g., "boarding" vs "exiting"). (3) The distribution of HICO-DET training and test sets is highly similar (\(KL=0.088\)), making it difficult to assess true model capability in complex tail-class scenarios.
Key Challenge: A unified protocol is needed for fair comparison across both paradigms. Existing evaluation methodologies fundamentally underestimate true capabilities: HOI-specific methods report \(<50\%\) mAP, while VLMs achieve only approximately \(15\%\) on HICO-DET.
Goal: Design an unbiased, cross-paradigm HOI evaluation benchmark to reveal the performance gaps and complementary strengths of VLMs and HOI-specific methods.
Key Insight: Reformulate HOI detection as a multi-answer multiple-choice task, utilizing curated positive and negative examples to eliminate the issue of incomplete annotation.
Core Idea: An MCQ format combined with curated negative examples and three evaluation settings enables the first fair and direct comparison between VLMs and HOI-specific models.
Method¶
Overall Architecture¶
CrossHOI-Bench aims to address whether large VLMs can perform HOI detection directly and how to compare them fairly with HOI-specific models. It reformulates HOI detection into a "Multi-Answer Multiple Choice" task—each image is paired with a four-option question. Positive examples are derived from ground truth plus manual supplements, while negative examples are curated "plausible-but-incorrect" actions. This ensures that models identifying unlabelled but correct actions are not penalized, bypassing systemic underestimation. The pipeline consists of two stages: "Three-stage Data Construction" (Task Reformulation \(\rightarrow\) Automated Negative Filtering \(\rightarrow\) Manual Refinement) and "Three Evaluation Settings" for unified scoring.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
A["HICO-DET Images"] --> BUILD
subgraph BUILD["Three-stage Data Construction"]
direction TB
B["Task Reformulation: HOI to MCQ<br/>Allows multiple correct answers per question"] --> C["Automated Negative Filtering<br/>GPT-4.1 filters semantic inconsistency → Consensus from Qwen2.5-VL-32B + GPT-4o"]
C --> D["Manual Refinement: Supplement hard positive/negative examples<br/>Remove easy scenes → Increase Test Set KL from 0.088 to 0.629"]
end
BUILD --> E["Main Benchmark: 1,274 Qs<br/>+ V-COCO / SWiG-HOI Expansion (Total 3,773)"]
E --> EVAL
subgraph EVAL["Three Evaluation Settings"]
direction TB
F["Setting 1: Full Detection (Human IoU ≥ 0.5 + Recognition)"]
G["Setting 2: Given Box Recognition (Isolated detection error)"]
H["Setting 3: Image-level (Recognize all interactions in view)"]
end
EVAL --> I["Multi-dimensional Scoring<br/>Macro / Instance / Micro-F1 · EM · Precision / Recall"]
Key Designs¶
1. Three-stage Data Construction: Decoupling "Incorrect Answers" from "Missing Labels"
Exact-matching benchmarks penalize models that predict correct but unlabelled actions, leading to artificially low scores (approx. \(15\%\)) for VLMs. CrossHOI-Bench uses a three-stage construction to eliminate this penalty. Stage 1 reformulates the task into four-option questions, allowing multiple correct answers (e.g., both "hold knife" and "cut with knife"). Stage 2 automates negative filtering: GPT-4.1 separates semantically consistent/inconsistent candidates, followed by consensus verification from Qwen2.5-VL-32B and GPT-4o. Only actions identified as incorrect by both models are retained as negative examples. Stage 3 involves manual refinement: removing simple scenarios (single person, plain background) and adding hard positives (ambiguous actions like "boarding" vs "exiting" are both categorized as correct) and hard negatives (actions of nearby people, fine-grained similarities like "holding" vs "hugging"). The test set was redistributed to increase its KL divergence from the training set from \(0.088\) to \(0.629\).
2. Three Complementary Evaluation Settings: Diagnostic Error Isolation
To determine whether VLM failures stem from localization or recognition, the benchmark employs three settings. Setting 1 (Full HOI Detection) tests end-to-end capability (IoU \(\ge 0.5\) + Recognition). Setting 2 (Given Box Recognition) isolates recognition ability by providing human bounding boxes. Setting 3 (Image-level) evaluates global scene understanding by identifying all human interactions in the image. This tiered approach reveals that small VLMs match HOI methods in pure recognition but fail during self-detection. Scoring includes Macro-F1 (class balance), Instance-F1 (per-question), Micro-F1 (global), Exact Match (EM), and Average Precision/Recall.
Key Experimental Results¶
Main Results: Setting 1 (Full HOI Detection)¶
| Method | Type | Macro-F1 | Instance-F1 | EM | Avg.Prec | Avg.Rec |
|---|---|---|---|---|---|---|
| ADA-CM | HOI | 43.02 | 47.76 | 19.15 | 76.25 | 51.80 |
| HOLa | HOI | 43.61 | 47.12 | 19.78 | 74.31 | 52.15 |
| CMD-SE | HOI | 47.49 | 44.66 | 20.33 | 78.33 | 46.96 |
| InternVL3-38B | VLM | 38.04 | 38.68 | 20.33 | 84.72 | 33.56 |
| Qwen2.5-VL-32B | VLM | 50.71 | 52.94 | 26.06 | 75.03 | 51.97 |
| Qwen2.5-VL-7B | VLM | 29.73 | 30.53 | 14.29 | 75.92 | 24.89 |
Main Results: Setting 2 (Given Human Box, Pure Recognition)¶
| Method | Type | Macro-F1 | Instance-F1 | EM | Avg.Prec | Avg.Rec |
|---|---|---|---|---|---|---|
| InternVL3-38B | VLM | 58.94 | 67.41 | 35.64 | 81.90 | 57.85 |
| Qwen2.5-VL-32B | VLM | 62.90 | 69.52 | 35.01 | 75.30 | 66.61 |
| Qwen2.5-VL-7B | VLM | 48.93 | 57.25 | 25.98 | 74.49 | 46.87 |
Ablation Study¶
| Dimension | VLM Strengths | HOI-Specific Strengths |
|---|---|---|
| Overall F1 | Large models outperform SOTA zero-shot | Small models match specialized methods |
| Multi-action Recognition | Tends to predict only one action | Better at identifying co-occurring actions |
| Cross-human Attribution | Often attributes neighbors' actions to target | More accurate human-action binding |
| Precision | Often higher (84.72%) | Stable (~76%) |
| Recall | High volatility (33%-52%) | More balanced (~52%) |
Key Findings¶
- Qwen2.5-VL-32B zero-shot outperforms all HOI methods by \(+5.18\%\) in Instance-F1 (52.94 vs 47.76).
- Small VLMs (7-8B) match HOI methods in pure recognition settings but suffer significant performance drops when detection is required.
- Systematic VLM weaknesses: Under-prediction of multiple actions and misattribution of actions between different human subjects.
- Qwen3-VL-30B exhibited extreme behavior: Avg.Prec \(= 100\%\) but Recall \(\approx 0\%\), indicating a near-total failure to make predictions.
Highlights & Insights¶
- First comparison of VLMs and HOI-specific methods under a unified protocol, providing convincing and exemplary conclusions.
- Reveals that existing HOI benchmarks systematically underestimate model performance due to design flaws.
- The MCQ format and curated negative methodology can be transferred to other tasks with incomplete annotations.
- The three complementary settings and sub-benchmarks provide a rigorous, multi-dimensional analysis of model capabilities.
Limitations & Future Work¶
- The benchmark scale is relatively small (1,274 main images), potentially insufficient to cover all HOI scenarios.
- The MCQ format may simplify the true complexity of open-world HOI understanding.
- Negative example curation relies on VLM consensus, which may introduce model-specific biases.
- The impact of VLM fine-tuning on benchmark performance was not deeply analyzed.
Related Work & Insights¶
- vs HICO-DET/V-COCO: The fundamental difference lies in using multi-answer MCQ vs exact matching, eliminating systematic penalties for incomplete annotation.
- vs SWiG-HOI: While SWiG-HOI expands the label space (5,500+ classes), CrossHOI-Bench focuses on fair cross-paradigm comparison.
- Evaluation Methodology: VLM capabilities across many tasks may be systematically underestimated due to benchmark design; restructuring evaluation may lead to new discoveries.
- VLM Potential: The HOI understanding of large VLMs has been significantly underrated. Future work should prioritize multi-action recognition and cross-human attribution.
Rating¶
⭐⭐⭐⭐⭐ (4.5/5)
- Novelty ⭐⭐⭐⭐: Benchmark design for unified cross-paradigm evaluation fills a critical gap.
- Experimental Thoroughness ⭐⭐⭐⭐⭐: Comprehensive multi-model, multi-setting, and multi-dimensional analysis.
- Writing Quality ⭐⭐⭐⭐⭐: Clear problem statement, rigorous design, and compelling conclusions.
- Value ⭐⭐⭐⭐⭐: Uncovers significant evaluation biases and paradigm complementarities for the community.