Skip to content

CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner

Conference: ECCV 2024
arXiv: 2403.10082
Code: Yes
Area: Video Understanding
Keywords: Skeleton-based Action Recognition, One-Shot Learning, Large Language Model, Cross-Modal Guidance, Global-Local-Global

TL;DR

This paper proposes the CrossGLG framework, which utilizes LLM-generated text descriptions to guide skeleton feature learning in a "global \(\to\) local \(\to\) global" manner, significantly outperforming competitors in one-shot 3D action recognition with only 2.8% of the parameter size of the SOTA model.

Background & Motivation

One-shot skeleton action recognition faces two core challenges:

Loss of Local Information: Existing methods focus only on low-level information such as joint positions, failing to focus on crucial local regions.

Weak Generalization Ability: Lacking high-level semantic guidance makes it difficult to generalize to unseen action classes.

Humans can recognize key movement cues and infer the overall action with only a few observations. Inspired by this, the authors propose leveraging human knowledge-rich text descriptions generated by Large Language Models (LLMs) to guide skeleton feature learning. This is the first framework to introduce LLM text information into one-shot skeleton-based action recognition.

Method

Overall Architecture

CrossGLG adopts a dual-branch architecture: - Skeleton Encoding Branch (blue): Processes skeleton sequences only; used exclusively during inference. - Cross-Modal Guidance Branch (green): Leverages LLM text to guide skeleton feature learning during training.

Both branches share the same classifier, allowing the skeleton branch to implicitly learn the high-level semantics contained in the text during training.

LLM-Generated Knowledge-Rich Action Descriptions

Two prompts are designed to obtain information from ChatGPT: 1. Global Action Description Prompt: Describes which joints/body parts are most important when performing a certain action (e.g., "wave hand" \(\to\) arms, wrists). 2. Joint Motion Description Prompt: Generates fine-grained local motion descriptions for each joint (e.g., "right hand: swing rapidly left and right above the head").

Adapting to different datasets is achieved simply by replacing the action name and joint list in the prompts.

Global-to-Local Guidance

Joint Importance Discriminator (JID): 1. The first \(N_{pre}\) encoding blocks of the skeleton encoder output \(f_{pre}\), which is pooled along the temporal dimension to obtain the overall motion features of the joints. 2. The JID (two linear layers + softmax) predicts the importance of each joint \(k_{out}\). 3. Extract key joint distribution \(k_{gt}\) from the global text (mapping noun phrases extracted using Stanford CoreNLP to joints). 4. Calibration Loss: \(L_{calibrate} = \text{MSE}(k_{out}, k_{gt})\) 5. In subsequent encoding blocks, \(k_{out}\) is used to reweight features after spatial interaction.

Local-to-Global Guidance

Cross-Modal Interaction Module (\(M\)-layer interaction blocks): 1. The text encoder (DeBERTa-V2-XLarge) encodes joint motion descriptions. 2. Text and skeleton features are projected into a common space. 3. Each layer of the interaction block performs three non-local interactions: - Text-Text Self-Attention: Synthesizes semantic context of other joints for each joint's text features. - Text-Guided Skeleton Cross-Attention: Uses text features as queries to guide the integration of skeleton features. - Fusion: Text and skeleton features are added and processed by an MLP.

Dual-Branch Training and Inference

  • During training: Both branches output to a shared MLP classifier.
  • Total Loss: \(L_{overall} = L_s + 0.5 \cdot L_{calibrate} + 0.2 \cdot L_c\)
  • During inference: Completely text-free, using only the skeleton encoding branch with only +0.1M additional parameters.
  • Classification is performed using a distribution calibration method for one-shot learning.

Loss & Training

  • \(L_s\): Cross-entropy classification loss for the skeleton branch.
  • \(L_c\): Cross-entropy classification loss for the cross-modal branch.
  • \(L_{calibrate}\): MSE calibration loss for JID.
  • Gradients of the cross-modal branch can flow back to the encoding blocks, allowing high-level semantic information to be injected into the skeleton encoder during the training process.

Key Experimental Results

Main Results: NTU RGB+D 120

Method 20 Classes 40 Classes 60 Classes 80 Classes 100 Classes Params (M)
APSR 29.1 34.8 39.2 42.8 45.3 -
MotionBERT 35.5 54.3 56.5 52.8 61.0 60.3
InfoGCN 37.0 53.9 58.8 55.7 56.1 1.6
InfoGCN+GAP 35.1 54.8 50.8 53.2 59.9 1.6
InfoGCN+CrossGLG 45.3 56.8 62.1 61.6 62.6 1.7

Plug-and-Play Effectiveness Validation

Skeleton Encoder 20 Classes Baseline \(\to\) +CrossGLG Param Overhead
MotionBERT 35.5 \(\to\) 51.0 (+15.8) +0.5M
HDGCN 39.0 \(\to\) 43.0 (+4.0) +0.1M
InfoGCN 37.0 \(\to\) 45.3 (+8.3) +0.1M

NTU RGB+D 60

Method 10 Classes 20 Classes 30 Classes 40 Classes 50 Classes
MotionBERT 58.3 61.0 70.0 70.3 74.5
InfoGCN 51.1 62.1 65.7 72.1 72.3
InfoGCN+CrossGLG 57.9 67.1 70.9 73.4 75.6

Ablation Study

G2L L2G 20 Classes 60 Classes 100 Classes
x x 37.0 58.8 56.1
v x 43.3 60.9 61.7
x v 42.5 61.7 58.6
v v 45.3 62.1 62.6

Ablation on JID insertion position (\(N_{pre}\)): The 5th encoding block is optimal (too shallow features are not rich enough, too deep limits the joint influence).

Key Findings

  • Both G2L and L2G bring significant improvements individually, and their combination yields the best results.
  • Under the NTU 120 20-class setting, InfoGCN+CrossGLG improves over pure InfoGCN by 8.3%.
  • The model size is only 2.8% of MotionBERT (1.7M vs 60.3M).
  • It also outperforms SOTA on the Kinetics dataset, verifying its generalization ability in complex scenarios.

Highlights & Insights

  1. LLM Knowledge Distillation into Skeleton Models: Using LLM text to guide training while requiring absolutely no text during inference elegantly solves the modal asymmetry issue.
  2. Plug-and-Play Design: Brings significant improvements to encoders like MotionBERT/HDGCN/InfoGCN with only +0.1M parameters.
  3. Global-Local-Global Paradigm: First utilizes global text to focus on key joints (local), and then performs local feature interactions to aggregate global representations.
  4. Visualization Validation: For unseen action classes, the model can focus on the correct key joints without any fine-tuning.

Limitations & Future Work

  • The quality of the LLM-generated text depends on the prompt design and the LLM's capabilities; the descriptions' quality across different actions might be uneven.
  • The joint importance in JID is static (fixed for each action class) and does not account for dynamic changes across different stages of the same action.
  • The framework is evaluated only on NTU and Kinetics datasets, without exploring more challenging fine-grained action recognition scenarios.
  • Although the text encoder (DeBERTa-V2-XLarge) is not used during inference, it is still required during training, which increases training costs.
  • Comparison with GAP (fully supervised text guidance): GAP performs even worse than the baseline in the one-shot setting, proving the superiority of the CrossGLG design in few-shot scenarios.
  • Although APSR introduces semantic information, the amount introduced is too small, and it cannot detect key joints during inference.
  • The dual-branch shared classifier design can be extended to other cross-modal knowledge transfer scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (Introduces LLMs to one-shot skeleton-based recognition for the first time; the global-local-global guidance mechanism is novel)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Validated on three datasets + multiple encoders + detailed ablations)
  • Writing Quality: ⭐⭐⭐⭐ (Clear motivation, intuitive framework diagram)
  • Value: ⭐⭐⭐⭐⭐ (Plug-and-play + extreme efficiency, highly practical)