Skip to content

LINK: Learning Instance-level Knowledge from Vision-Language Models for Human-Object Interaction Detection

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=CTdweIFocz
Code: To be confirmed
Area: Human Understanding / Human-Object Interaction Detection
Keywords: HOI Detection, Vision-Language Models, Knowledge Distillation, Zero-shot, Open-vocabulary, Geometric Encoding

TL;DR

LINK utilizes a plug-and-play two-stage HOI detection framework comprising a "geometric encoder + VLM linking decoder," supplemented by a progressive learning strategy under a teacher-student paradigm. By converting sparse HOI annotations into dense supervision covering all human-object pairs, it achieves SOTA performance across fully-supervised, zero-shot, and open-vocabulary settings.

Background & Motivation

Background: Human-Object Interaction (HOI) detection aims to parse images into <human, action, object> triplets, serving as a fundamental task for robotics and abnormal behavior analysis. Recent works have integrated pre-trained Vision-Language Models (VLMs) like CLIP into HOI detection, leveraging their strong image-text alignment to improve the recognition of rare or unseen interactions, thereby advancing zero-shot and few-shot HOI detection.

Limitations of Prior Work: The authors point out two long-standing contradictions. First is the antinomy between specialization and generalization—specialized architectures perform strongly on fully-supervised benchmarks but collapse in zero-shot or cross-domain scenarios; conversely, zero-shot methods often involve lightweight modifications to CLIP that recognize new categories well but cannot compete with specialized models in fully-supervised settings. Second is sparse supervision—while humans and objects form a dense interaction graph in an image, Ground Truth (GT) only labels a few edges (positive pairs), leaving many "valid but unlabeled positive pairs" and "informative negative samples" wasted.

Key Challenge: VLMs are pre-trained on image-level pairs providing global semantic representations, whereas HOI requires instance-level, fine-grained spatial and semantic discrimination. Transferring the former to the latter under sparse supervision is the fundamental difficulty in adapting VLMs for HOI.

Goal: To build a unified two-stage HOI detector that excels in specialization on standard benchmarks while maintaining generalization in zero-shot/open-vocabulary settings without sacrificing either.

Core Idea: [Architecture Decoupling] Making interaction queries dependent only on VLM features and detection boxes, decoupled from specific detectors to allow plug-and-play functionality with any object detector. [Dense Supervision] Using teacher-student distillation to expand supervision from a few matched pairs to all candidate human-object pairs, enabling the model to learn robust and transferable HOI representations by contrasting subtle spatial and semantic differences between positive and negative instances.

Method

Overall Architecture

LINK follows a two-stage pipeline: first, a commodity detector (DETR / H-Deformable-DETR) generates boxes; then, interaction reasoning is performed on each human-object pair. The architecture consists of a Human-Object Geometric Encoder (injecting spatial awareness and constructing pairwise queries) and a VLM Linking Decoder (aggregating VLM feature maps via dual-path cross-attention: spatial and semantic branches). Training is driven by a Progressive Learning Strategy (training a teacher first, then using the teacher to provide multi-level dense distillation for the student covering all human-object pairs). Crucially, query features are derived solely from ROI Aligned VLM feature maps and boxes, independent of detector-specific queries.

flowchart TD
    A[Input Image] --> B[Object Detector<br/>DETR/H-Def-DETR Boxes Bh,Bo]
    A --> C[VLM Vision Encoder<br/>Feature Map F]
    C --> D[ROI Align<br/>Unary Queries Qh,Qo]
    B --> D
    D --> E[HO Geometric Encoder<br/>+PE+Pairwise Geometry<br/>→ Pairwise Queries Qh-o]
    C --> F[VLM Linking Decoder<br/>Spatial Branch Latent + Semantic Branch Native]
    E --> F
    F --> G[CLIP Text Initialized FFN<br/>→ HOI Logits]
    H[Pre-trained Teacher Isomorphic Network] -. Multi-level KD<br/>map/query/logits .-> F

Key Designs

1. Human-Object Geometric Encoder: Providing spatial awareness to "semantics-only" VLMs. CLIP-like VLMs are pre-trained with image-level contrastive objectives, excelling in global semantics but lacking regional spatial discrimination. The authors supplement each box with positional information: boxes \(B=(x_1,y_1,x_2,y_2)\) are normalized and their centers \(C\) and sizes \(S\) are used for 2D sine positional encoding \(PE(B)=PE(C)\oplus PE(S)\), which is added to unary queries and refined via self-attention \(Q=\text{Self-Attn}(Q+PE(B))\). Pairwise queries are then formed by enumerating combinations \(Q_{h\text{-}o}=\text{Linear}(\mathcal{C}[Q_i,Q_j]),\ i\in H,\ j\in O\cup H\) (including human-human interactions). Following UPT, pairwise spatial relation vectors \(R_{i,j}\) (IoU, directional vectors, relative sizes) are encoded and fused with semantic queries. This decouples the queries from detector features, avoiding detector bias and explicitly injecting spatial dependencies.

2. VLM Linking Decoder: Dual branches for fine-grained reasoning and transferability. Instead of simple cross-attention, the authors split the process. The spatial branch uses a connector to reduce feature map dimensionality into a latent bottleneck \(F^l=\text{MLP}(F)\), forcing queries to focus on geometric relations in a compressed space. It employs box-encoding-guided attention (\(CA_{be}\)) to constrain the attention map for fine-grained spatial reasoning. The semantic branch projects queries into the VLM's native high-dimensional space \(Q^n_{h\text{-}o}=\text{Linear}(Q_{h\text{-}o})\) and performs standard cross-attention \(CA\) to aggregate high-level global semantics. The fused outputs are passed through an FFN initialized with CLIP text embeddings, preserving spatial precision while inheriting VLM open-set semantics.

3. Progressive Teacher-Student Learning: Supplementing sparse GT with dense supervision. In the first stage, a teacher is trained using only GT. In the second stage, the student is trained from scratch, supervised by GT and densely guided by the teacher on all candidate human-object pairs. Since the teacher and student share the same input and architecture, they can be aligned instance-by-instance. Distillation uses KL divergence \(KD_{KL}(f_{stu},f_t)=\text{KL}(\sigma(f_t/\tau)\,\|\,\sigma(f_{stu}/\tau))\) across three levels: Feature map level (\(L^{feat}_{KD}=KD(F_{stu},F''_t)\)), Query level (token-wise alignment across encoder/decoder layers), and Logits level (distilling \(\Psi_s=\log\frac{P}{1+\exp(-O_s)-P}\)). This forces the model to resolve ambiguities by contrasting subtle differences between instances.

Key Experimental Results

Main Results (HICO-DET / V-COCO, Fully-supervised)

Method Backbone / VLM HICO-DET Full Rare V-COCO AP_role
LAIN R50 / CLIP-B 36.02 35.70 65.1
Ours (LINK) R50 / CLIP-B 37.43 37.18 66.5
HOLa R50 / CLIP-L 39.05 38.66 66.0
Ours (LINK) R50 / CLIP-L 42.92 45.03 68.1
BC-HOI R50 / BLIP-2 43.01 45.76 70.6
Ours (LINK) R50 / BLIP-2 43.72 45.82 68.5
HORP Swin-L / CLIP-L 47.53 46.81 68.3
Ours (LINK) Swin-L / CLIP-L 49.06 53.63 69.2

With R50+CLIP-L, Full/Rare scores are +3.87 / +6.37 mAP higher than previous bests (relative +9.9% / +16.5%).

Ablation Study (HICO-DET Fully-supervised, Table 6)

# Encoder Decoder + Distillation Full Rare N-Rare
A1 Self-Attn Cross-Attn (baseline) 36.10 33.67 36.97
A2 Self-Attn VLM-Link 39.23 39.76 39.02
A3 Geometrical Cross-Attn 38.30 35.46 39.31
A4 Geometrical VLM-Link 41.20 41.43 41.13
A5 A4 + Logit-level KD 41.89 43.82 41.27
A6 A5 + Query-level KD 42.34 43.62 41.84
A7 A6 + Map-level KD 42.92 45.03 42.20
A8 A7 + multi-teacher (CLIP+SigLIP) 43.54 45.58 42.93

The Geometric encoder (A3) and VLM-Link decoder (A2) are both effective; their combination (A4) is optimal. Triple-level distillation (A5→A7) provides consistent gains, especially for the Rare subset (33.67→45.03).

Key Findings

  • Zero-shot settings (RF-UC / NF-UC / UO / UV): Achieved two best and two second-best results; RF-UC unseen 32.25 surpasses previous SOTA by +1.64.
  • Open-vocabulary SWiG-HOI: Full set 17.97 mAP, +2.71 (relative +17.8%) over previous best, with Rare subset relative gain of +22.1%.
  • Cross-model Universality: Consistent gains across CLIP/BLIP (contrastive), DINOv2 (self-supervised), and SigLIP2/Florence2 (multitask/multimodal), with the largest gains in long-tail HOI (≤10 samples).
  • Few-shot (1→32-shot) results are best-in-class on both HICO-DET and V-COCO.

Highlights & Insights

  • Detector decoupling is the switch for generalization: By not using DETR-specific queries and relying on VLM feature maps + boxes, the model becomes plug-and-play and removes detector bias—the structural reason it excels in both fully-supervised and zero-shot settings.
  • Isomorphic teacher-student alignment elegantly solves sparse supervision: Sharing the architecture allows instance-to-instance alignment, meaning distillation can legally cover all candidate pairs without introducing noise through simple pseudo-labeling.
  • Dual-branch labor division: The spatial branch ensures fine-grained precision (specialization), while the semantic branch preserves VLM global semantics (generalization).
  • Multi-teacher (CLIP+SigLIP) integration shows the framework can act as a container for aggregating knowledge from heterogeneous foundation models.

Limitations & Future Work

  • The two-stage process with multi-level distillation on all pairs involves significant training and memory costs.
  • Inference remains two-stage, so performance is bounded by the quality of the upstream object detector.
  • The geometric encoding follows established designs (UPT/PViC); innovation lies more in the "decoupling + dense distillation" combination.
  • The benefits and conflicts of even larger heterogeneous teacher ensembles have not been fully explored.
  • Two-stage vs. One-stage: LINK follows the two-stage (decoupled) route for flexibility and modularity, contrasting with query-based one-stage models like GEN-VLKT.
  • VLM Adaptation for HOI: While related to HOICLIP and ADA-CM in using CLIP, LINK differs by focusing on architectural decoupling and dense distillation rather than prompt/memory modification.
  • Inspiration: The isomorphic teacher-student dense distillation approach can be transferred to other tasks with sparse graph structures, such as Scene Graph Generation or Relationship Detection.

Rating

  • Novelty: ⭐⭐⭐ — While components like geometric units and KD are known, the combination for "detector decoupling + isomorphic dense distillation" effectively addresses the specialization/generalization trade-off.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers fully-supervised, zero-shot, and open-vocabulary settings across multiple benchmarks and six foundation models; very solid.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation, systematic methodology, and complete formulas.
  • Value: ⭐⭐⭐⭐ — Plug-and-play nature and significant gains on long-tail/unseen classes provide strong practical value for HOI applications.