Skip to content

SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection

Conference: CVPR 2025
arXiv: 2503.00414
Code: None (The paper states it will be released on GitHub)
Area: AIGC Detection
Keywords: Open-Vocabulary HOI Detection, Multi-granular Feature Alignment, Hierarchical Group Comparison, CLIP Adaptation, LLM-Assisted Classification

TL;DR

This paper proposes the Stratified Granular Comparison Network (SGC-Net), which aggregates multi-layer CLIP visual features via a Granularity-Aware Alignment (GSA) module and recursively generates discriminative descriptions using an LLM within a Hierarchical Group Comparison (HGC) module. This addresses the issues of insufficient feature granularity and semantic confusion in open-vocabulary HOI detection.

Background & Motivation

Open-Vocabulary HOI Detection (OV-HOI) requires recognizing unseen interaction categories training only on base interaction categories. Existing CLIP-based methods face two core challenges:

  1. Insufficient Feature Granularity: Existing methods mainly rely on the final layer of CLIP visual features for text alignment. However, the final layer focuses on high-level semantics while neglecting local details (such as arm poses and facial expressions) captured by intermediate layers, which are critical for HOI detection.

  2. Semantic Similarity Confusion: CLIP is trained on large-scale long-tailed data, leading to bias toward certain categories (e.g., confusing "hug cat" and "hold cat"). Descriptions generated by LLMs solely based on labels also struggle to sufficiently distinguish semantically similar interaction categories (e.g., both "hold cat" and "chase cat" might be described with "arms extended").

Although the existing method CMD-SE attempts to utilize intermediate layer features, its loss function requires minimizing the discrepancy between continuous and discrete variables, which is difficult to optimize.

Method

Overall Architecture

SGC-Net is an end-to-end OV-HOI detection network that does not require a pre-trained object detector. It contains two core modules: (1) Granularity-Aware Alignment (GSA) module, which partitions the CLIP visual encoder into blocks and aggregates multi-granular features using distance-aware Gaussian weights; (2) Hierarchical Group Comparison (HGC) module, which leverages an LLM to recursively construct a category hierarchy and compares HOI representations with text embeddings at each level.

Key Design 1: Granularity-Aware Alignment (GSA) Module

  • Function: Effectively aggregate local details and global semantics from multi-layer CLIP visual features.
  • Mechanism: Divide the 12 layers of CLIP's visual encoder into \(S\) blocks (e.g., {6-8}, {9-11}, {12}). Layer features within each block are fused using distance-aware Gaussian weights (with trainable \(\sigma\)): \(\alpha_l^s = \exp(-\frac{(d-l)^2}{2\sigma^2})\). Block-level features are also aggregated via weighted summation, where the final layer is treated as an independent block with a larger weight to preserve pre-trained vision-language alignment. Meanwhile, visual prompt tuning is used to introduce learnable tokens to facilitate alignment between intermediate layers and text.
  • Design Motivation: Directly aggregating shallow and deep features disrupts CLIP's pre-trained vision-language alignment. The block partitioning strategy ensures small feature variances within each block for safe fusion, while Gaussian weights allow adaptive learning of hierarchical information. This is easier to optimize than CMD-SE and retains pre-trained alignment.

Key Design 2: Hierarchical Group Comparison (HGC) Module

  • Function: Recursively generate discriminative text descriptions to resolve confusion among semantically similar categories.
  • Mechanism: A three-step process: (a) Grouping: use K-means to cluster CLIP text features of initial descriptions generated by the LLM; (b) Comparison: for large groups, summarize group characteristics using the LLM to generate comparative descriptions; for small groups, query the LLM directly for inter-category comparison; (c) Hierarchical Classification: traverse the category hierarchy from top to bottom, compare HOI features with text embeddings at each level, and use an iterative evaluator \(u_i^k = \mathbb{I}(p_i^{k+1} > p_i^k + \tau)\) to filter out unreliable low-level descriptions.
  • Design Motivation: A large number of categories causes the pairwise comparison description matrix to grow quadratively. The grouping strategy controls complexity while maintaining discriminative power. Recursive comparison ensures that the decision boundary is refined step-by-step from coarse to fine.

Key Design 3: Iterative Evaluator and Fusion Strategy

  • Function: Adaptively select the most informative parts of hierarchical descriptions.
  • Mechanism: By monitoring the monotonically increasing sequence of scores, compute a running average \(r(\boldsymbol{x}, i)\), which is finally fused as \(s(\boldsymbol{x}, i) = (1-\lambda)(p_i^1 + \boldsymbol{t} \cdot \boldsymbol{x}^T) + \lambda \cdot r(\boldsymbol{x}, i)\).
  • Design Motivation: Unreliable low-level descriptions introduce errors and redundancy. The automatic evaluation and filtering mechanism ensures that only truly discriminative descriptions are utilized.

Loss & Training

\(\mathcal{L} = \lambda_b \sum_{i \in \{h,o\}} \mathcal{L}_b^i + \lambda_{iou} \sum_{i \in \{h,o\}} \mathcal{L}_{iou}^i + \lambda_{cls} \mathcal{L}_{cls}\), which includes human/object bounding box regression loss, IoU loss, and interaction classification loss, using the Hungarian algorithm for label matching. Here, \(\lambda_b=5, \lambda_{cls}=2, \lambda_{iou}=5\).

Key Experimental Results

Main Results: HICO-DET Dataset (Without Pre-trained Detectors)

Method Pre-trained Detector Unseen Seen Full
THID 15.53 24.32 22.38
CMD-SE 16.70 23.95 22.35
SGC-Net 23.27 28.34 27.22
HOICLIP 23.48 34.47 32.26

Main Results: SWIG-HOI Dataset

Method Non-rare Rare Unseen Full
CMD-SE 21.46 14.64 10.70 15.26
SGC-Net 23.67 16.55 12.46 17.20

Ablation Study

Configuration Non-rare Rare Unseen Full
Base 15.69 11.53 7.32 11.45
+ GSA 22.74 16.00 11.64 16.49
+ HGC 21.18 14.19 10.69 14.81
SGC-Net 23.67 16.55 12.46 17.20

Key Findings

  • The GSA module contributes the most (+5.04 Full mAP), demonstrating that multi-granular feature aggregation is crucial for OV-HOI.
  • SGC-Net without a pre-trained detector achieves competitive or even matching performance on Unseen categories compared to methods using a pre-trained detector.
  • The optimal partitioning strategy is {6-8}, {9-11}, {12}, where the final layer is kept as a separate block to preserve CLIP's pre-trained alignment.
  • Using three blocks significantly outperforms using one or two blocks (Full: 17.20 vs 14.81/14.78).

Highlights & Insights

  1. Elegant Balance Between Multi-Granularity and Alignment: The partitioning + Gaussian weight strategy utilizes intermediate layer details while preserving CLIP's alignment, which is simpler and more effective than the CMD-SE approach.
  2. Recursive Comparison Strategy via LLM: Through a three-step grouping-comparison-hierarchical process, the complexity of generating \(O(n^2)\) descriptions is reduced to a manageable level.
  3. Adaptive Filtering with Iterative Evaluators: Automated identification and sole usage of valid hierarchical descriptions prevent noise propagation.

Limitations & Future Work

  • Absolute performance on SWIG-HOI remains relatively low (Full is only 17.20), suggesting that large-vocabulary OV-HOI remains a challenge.
  • The quality of descriptions generated by the LLM is bounded by prompt engineering, and different LLMs may yield varying outcomes.
  • The recursive depth of hierarchical classification is limited, as exceeding a certain depth may introduce noise.
  • The idea of multi-granular feature aggregation can be generalized to other vision tasks requiring CLIP adaptation.
  • The LLM-assisted class comparison strategy provides valuable insights for fine-grained classification tasks.

Rating

⭐⭐⭐⭐ — Both modules are designed with clear problem drivers and elegant solutions, achieving competitive performance without using a pre-trained detector. The ablation experiments thoroughly validate the contributions of each component.