HOI-IDiff: An Image-like Diffusion Method for Human-Object Interaction Detection¶

Conference: CVPR 2025
Institution: Lancaster University
Keywords: Human-Object Interaction Detection, Diffusion Models, Multinomial Diffusion, HOI Image

Background & Motivation¶

Human-Object Interaction Detection (HOI Detection) is a core task in scene understanding, aiming to detect triplets from images. For example, "a person riding a bicycle" requires identifying the human, the bicycle, and the interaction relationship "riding".

Traditional HOI detection methods (such as QPIC, CDN) are typically based on Transformer decoders, utilizing a set of learnable queries to predict HOI triplets. The issue with this approach is that the number of queries is fixed, making it difficult to handle scenes with highly variable numbers of interactions; moreover, one-step prediction lacks the capability for iterative optimization.

Recently, diffusion models have demonstrated outstanding iterative denoising capabilities in generative tasks. Can diffusion models be applied to HOI detection? Existing attempts (such as DiffHOI) directly perform diffusion on bounding box coordinates, but achieve limited effectiveness because: 1. The core of HOI lies not in precise box coordinates, but in who is doing what with whom. 2. Continuous Gaussian diffusion is unnatural for categorical outputs (such as interaction types). 3. There is a lack of mechanism to utilize detector priors.

The core innovation of HOI-IDiff is: re-encoding HOI triplets into an "image" and then applying a specially designed multinomial diffusion process on this image.

Method¶

Core Innovation 1: HOI Image Construction¶

Encode all HOI relationships in each scene into an \(H imes W imes 2\) probability image:

\[I_{ ext{HOI}}[h, w, :] = v_{ ext{obj}}(h) \otimes m_{ ext{int}}(w)\]

where: - \(H\) = number of human-object pairs in the scene - \(W\) = number of interaction categories (117 classes for HICO-DET) - Channel 0: Object class probability \(v_{ ext{obj}} \in \Delta^{|\mathcal{O}|}\) (probability distribution on a simplex) - Channel 1: Interaction type probability \(m_{ ext{int}} \in \{0, 1\}^{|\mathcal{A}|}\) (multi-label binary indicator)

Intuitive Understanding: Each row of the HOI Image corresponds to a human-object pair, and each column corresponds to an interaction type. The pixel value represents the probability of that interaction occurring. This representation transforms structural prediction into an image generation problem.

Core Innovation 2: Multinomial Diffusion¶

Standard Gaussian diffusion adds Gaussian noise to continuous data, but each pixel in the HOI Image is a probability value (summing to 1), and Gaussian noise would violate this constraint.

The forward process of Multinomial Diffusion:

\[q(x_t | x_{t-1}) = ext{Cat}(x_t; (1 - eta_t) x_{t-1} + eta_t / K)\]

where \(K\) is the number of categories. Key differences: - The coefficient is \((1-eta_k)\) instead of \(\sqrt{1-eta_k}\). - The noise term is a uniform distribution \(1/K\) instead of a Gaussian distribution. - The probability sum is always maintained as 1.

Property	Gaussian Diffusion	Multinomial Diffusion
Data type	Continuous value	Probability distribution
Noise type	Gaussian \(\mathcal{N}(0,1)\)	Uniform \(1/K\)
Forward coefficient	\(\sqrt{1-eta_t}\)	\((1-eta_t)\)
Probability constraint	None	Always satisfies \(\sum=1\)
Final state	\(\mathcal{N}(0,I)\)	Uniform distribution

Core Innovation 3: Slice Patchification¶

Traditional ViTs partition images into local patches (such as 16×16), but the semantic structure of the HOI Image is different—each row represents complete human-object pair information, and each column represents complete interaction type information. Local patches would break this row-column semantics.

Slice Patchification proposes slice-based partitioning: - Horizontal Slices: \(H\) row vectors of width \(W\) (each slice is a complete human-object pair) - Vertical Slices: \(W\) column vectors of height \(H\) (each slice is a complete interaction type)

The two sets of slices are processed by Transformers separately and then fused. This guarantees global dependencies within rows and within columns, while establishing relations between rows and columns through cross-attention.

Core Innovation 4: Detector Prior Initialization¶

Standard diffusion starts denoising from pure noise, but HOI detection can leverage the output of an object detector (such as DETR) as a prior:

\[x_T = (1 - lpha) \cdot ext{Uniform} + lpha \cdot ext{DetectorPrior}\]

The detector prior provides initial human-object pairing guesses, significantly reducing the number of denoising steps.

Experimental Results¶

HICO-DET¶

Method	Full mAP	Rare mAP	Non-Rare mAP
QPIC	29.07	21.85	31.23
CDN	32.07	27.19	33.53
GEN-VLKT	33.75	29.25	35.10
HOICLIP	34.69	31.12	35.74
Standard diffusion baseline	42.50	40.12	43.21
HOI-IDiff	47.71	48.36	47.52

V-COCO¶

Method	Scenario 1	Scenario 2
QPIC	58.8	61.0
CDN	63.9	65.9
HOICLIP	66.2	68.5
HOI-IDiff	73.4	76.1

Ablation Study¶

Configuration	HICO-DET Full mAP
Standard Gaussian diffusion	42.50
+ Multinomial diffusion	44.23
+ Slice Patchification	45.89
+ Detector prior	46.84
+ All optimizations	47.71

The step-by-step improvement from 42.50 to 47.71 validates the contribution of each component.

Method Analysis¶

Why is Slice Patchification effective?¶

Traditional patches disrupt the row-column semantic structure of the HOI Image. For example, a 16×16 patch contains partial interaction information for 16 human-object pairs—neither completely representing any single human-object pair nor completely representing any single interaction type. Slices guarantee the completeness of semantic units.

Why is multinomial diffusion better than Gaussian diffusion?¶

The pixels of the HOI Image are probability distributions; Gaussian noise would produce negative and unnormalized values, requiring extra normalization steps. Multinomial diffusion maintains the probability constraints throughout the entire process, generating intermediate results that are all valid probability distributions.

Limitations & Future Work¶

The size of the HOI Image varies with the number of human-object pairs in the scene, which requires padding for batch processing.
The number of denoising steps in multinomial diffusion is still relatively large (typically 100 steps).
The efficiency in dense interaction scenes (>50 human-object pairs) needs to be optimized.

Summary¶

HOI-IDiff cleverly leverages the iterative optimization capability of diffusion models by redefining HOI detection as a "probability image generation" problem. The three major innovations—multinomial diffusion, Slice Patchification, and detector prior—work synergistically to achieve a new SOTA on both HICO-DET and V-COCO. This concept of "transforming structural prediction into image generation" has broad inspiring significance.