VDRP: Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection¶

Conference: NeurIPS 2025 arXiv: 2510.25094 Code: https://github.com/mlvlab/VDRP Area: Video/Image Understanding Keywords: HOI Detection, Zero-shot Learning, Prompt Learning, CLIP, Visual Diversity

TL;DR¶

This paper proposes the VDRP framework, which addresses two core challenges in zero-shot HOI detection — intra-class visual diversity and inter-class visual entanglement — through visual diversity-aware prompt learning (via group-level variance injection and Gaussian perturbation) and region-aware prompt augmentation (via LLM-generated regional concept retrieval).

Background & Motivation¶

Human-Object Interaction (HOI) detection requires localizing humans and objects and recognizing the interactions between them. Zero-shot HOI detection demands that models generalize to unseen verb-object combinations at test time, introducing two core visual challenges:

Intra-class visual diversity: The same verb (e.g., "holding a baseball glove") exhibits substantial visual variation across different poses, viewpoints, and scenes. The authors quantitatively show that verb classes have a significantly higher diversity score (\(0.364 \pm 0.060\)) than object classes (\(0.274 \pm 0.048\)), indicating that a single static prompt cannot capture the visual variation of verbs.

Inter-class visual entanglement: Semantically distinct verbs (e.g., "eating," "licking," "sitting next to") produce highly similar visual patterns under global or union-region features. t-SNE visualizations reveal substantial overlap among different verb categories.

Limitations of Prior Work: Most CLIP-based prompt methods (GEN-VLKT, ADA-CM) use only a single static prompt per verb; CMMP incorporates spatial cues but the text prompts remain region-agnostic; EZ-HOI leverages LLM descriptions but ignores intra-class variation.

Core Idea: Simultaneously encode visual variation statistics (variance injection and perturbation) and region-specific semantics (concept retrieval and augmentation) into the prompt embeddings, with the two components complementarily addressing the two challenges above.

Method¶

Overall Architecture¶

A two-stage HOI detection pipeline is adopted: (1) a frozen DETR detector localizes humans and objects; (2) a CLIP-based interaction classifier extracts human (\(\mathbf{x}_h\)), object (\(\mathbf{x}_o\)), and union-region (\(\mathbf{x}_u\)) features via lightweight adapters. The key innovation lies in the text prompt side: visual diversity-aware prompts are first generated, then augmented with regional concepts to produce region-aware prompts, and the final verb classification is performed by averaging logits over the three regions.

Key Designs¶

Visual Diversity-aware Prompt Learning:
- Union-region CLS features are extracted for each verb \(v\) from the training set, and the variance \(\boldsymbol{\sigma}_v^2\) is computed.
- Semantically similar verb groups \(\mathcal{G}(v)\) are constructed based on cosine similarity in CLIP text embedding space, and the group-level variance is computed as \(\bar{\boldsymbol{\sigma}}_v^2 = \frac{1}{|\mathcal{G}(v)|}\sum_{v' \in \mathcal{G}(v)} \boldsymbol{\sigma}_{v'}^2\) (a stabilized estimate, particularly important for rare verbs).
- An MLP maps the group-level variance to a modulation vector \(\mathbf{d}_v\), which is injected into the shared context embedding: \(\hat{\mathbf{E}}_v = \mathbf{E} + \mathbf{d}_v \alpha\).
- After passing through the CLIP text encoder, variance-guided Gaussian perturbation is applied: \(\tilde{\mathbf{t}}^v = \mathbf{t}^v + (\epsilon \odot \tilde{\boldsymbol{\sigma}}_v)\beta\).
Region-aware Prompt Augmentation:
- An LLM (LLaMA-7B / GPT-4) generates \(K\) visual concept descriptions per verb for each region (human / object / union).
- These are encoded into concept pools \(\mathcal{C}_{(\cdot)}^v\) using the CLIP text encoder.
- Given a region feature \(\mathbf{x}_{(\cdot)}\), cosine similarities with the concepts are computed, and Sparsemax (rather than Softmax) is applied to produce sparse weights, retaining only the most relevant concepts.
- A weighted aggregation yields the regional concept vector \(\bar{\mathbf{c}}_{(\cdot)}^v\), which augments the diversity-aware prompt: \(\hat{\mathbf{t}}_{(\cdot)}^v = \mathbf{t}^v + \bar{\mathbf{c}}_{(\cdot)}^v \gamma\).
Spatially Augmented Union-region Features: A SpatialHead fuses union-region features with human and object features along with their bounding boxes, incorporating spatial priors.

Loss & Training¶

Focal Loss is used for multi-label verb classification.
Lightweight adapters are inserted into multiple Transformer blocks of the CLIP visual encoder; only 4.50M parameters are trained.

Key Experimental Results¶

Main Results¶

Zero-shot Setting	Metric	Ours (VDRP)	Prev. SOTA (EZ-HOI)	Gain
NF-UC	HM / Unseen	33.85 / 36.45	31.76 / 33.66	+2.09 / +2.79
RF-UC	HM / Unseen	32.77 / 31.29	31.18 / 29.02	+1.59 / +2.27
UO	HM / Unseen	34.41 / 36.13	32.14 / 33.28	+2.27 / +2.85
UV	HM / Unseen	29.80 / 26.69	29.09 / 25.10	+0.71 / +1.59

Ablation Study¶

Configuration	NF-UC Unseen	RF-UC Unseen	UO Unseen	UV Unseen
BASE	28.32	25.64	28.60	22.41
+ VDP (diversity prompt)	32.19	29.16	33.29	23.78
+ RAP (region prompt)	34.93	26.46	33.90	24.53
+ VDRP (full)	36.45	31.29	36.13	26.69

Key Findings¶

Both VDP and RAP individually yield significant improvements, and their combination achieves the best performance, demonstrating that intra-class diversity and inter-class discriminability are complementary dimensions.
Under the NF-UC setting, VDRP improves Unseen class performance by +8.13 (from 28.32 to 36.45), substantially outperforming each individual module.
Only 4.50M trainable parameters are required, far fewer than CLIP4HOI (56.7M) and HOICLIP (66.18M).

Highlights & Insights¶

Variance as Signal: Converting intra-class visual variance from "noise" into "signal" by injecting it into prompts to guide learning is an elegant design principle.
Sparsemax for Concept Retrieval: Compared to Softmax, Sparsemax assigns exact zero weights to irrelevant concepts, preventing noise interference.
Quantitative Analysis-Driven Design: The methodology of first quantitatively characterizing the problem via diversity scores and t-SNE, then designing targeted solutions, is a valuable practice worth emulating.

Limitations & Future Work¶

Regional concepts depend on LLM generation, and their quality is bounded by the LLM's capability.
The definition of "semantically similar verb groups" for group-level variance relies on CLIP text embedding similarity, which may introduce bias.
Evaluation is conducted only on HICO-DET; validation on other HOI benchmarks such as V-COCO is lacking.
Hyperparameters such as perturbation strengths \(\alpha, \beta, \gamma\) require careful tuning.

vs. EZ-HOI: EZ-HOI uses LLM descriptions to distinguish semantic differences between verbs but ignores intra-class variation; VDRP addresses both dimensions simultaneously.
vs. CMMP: CMMP incorporates spatial cues but the text prompts are region-agnostic; VDRP's regional concept retrieval operates at a finer granularity.
vs. CoOp/CoCoOp: This work extends prompt learning from classification tasks to the multi-region setting of HOI detection.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of variance injection into prompts and region-aware concept retrieval is original.
Experimental Thoroughness: ⭐⭐⭐⭐ Full coverage of four zero-shot settings with thorough ablations, though cross-dataset validation is absent.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is analyzed with quantitative clarity and method diagrams are intuitive.
Value: ⭐⭐⭐⭐ Meaningfully advances zero-shot HOI detection; the variance injection idea is transferable to other visual tasks.