CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner¶
Conference: ECCV 2024
arXiv: 2403.10082
Code: Yes
Area: Video Understanding
Keywords: Skeleton-based Action Recognition, One-Shot Learning, Large Language Model, Cross-Modal Guidance, Global-Local-Global
TL;DR¶
This paper proposes the CrossGLG framework, which utilizes LLM-generated text descriptions to guide skeleton feature learning in a "global \(\to\) local \(\to\) global" manner, significantly outperforming competitors in one-shot 3D action recognition with only 2.8% of the parameter size of the SOTA model.
Background & Motivation¶
One-shot skeleton action recognition faces two core challenges:
Loss of Local Information: Existing methods focus only on low-level information such as joint positions, failing to focus on crucial local regions.
Weak Generalization Ability: Lacking high-level semantic guidance makes it difficult to generalize to unseen action classes.
Humans can recognize key movement cues and infer the overall action with only a few observations. Inspired by this, the authors propose leveraging human knowledge-rich text descriptions generated by Large Language Models (LLMs) to guide skeleton feature learning. This is the first framework to introduce LLM text information into one-shot skeleton-based action recognition.
Method¶
Overall Architecture¶
CrossGLG adopts a dual-branch architecture: - Skeleton Encoding Branch (blue): Processes skeleton sequences only; used exclusively during inference. - Cross-Modal Guidance Branch (green): Leverages LLM text to guide skeleton feature learning during training.
Both branches share the same classifier, allowing the skeleton branch to implicitly learn the high-level semantics contained in the text during training.
LLM-Generated Knowledge-Rich Action Descriptions¶
Two prompts are designed to obtain information from ChatGPT: 1. Global Action Description Prompt: Describes which joints/body parts are most important when performing a certain action (e.g., "wave hand" \(\to\) arms, wrists). 2. Joint Motion Description Prompt: Generates fine-grained local motion descriptions for each joint (e.g., "right hand: swing rapidly left and right above the head").
Adapting to different datasets is achieved simply by replacing the action name and joint list in the prompts.
Global-to-Local Guidance¶
Joint Importance Discriminator (JID): 1. The first \(N_{pre}\) encoding blocks of the skeleton encoder output \(f_{pre}\), which is pooled along the temporal dimension to obtain the overall motion features of the joints. 2. The JID (two linear layers + softmax) predicts the importance of each joint \(k_{out}\). 3. Extract key joint distribution \(k_{gt}\) from the global text (mapping noun phrases extracted using Stanford CoreNLP to joints). 4. Calibration Loss: \(L_{calibrate} = \text{MSE}(k_{out}, k_{gt})\) 5. In subsequent encoding blocks, \(k_{out}\) is used to reweight features after spatial interaction.
Local-to-Global Guidance¶
Cross-Modal Interaction Module (\(M\)-layer interaction blocks): 1. The text encoder (DeBERTa-V2-XLarge) encodes joint motion descriptions. 2. Text and skeleton features are projected into a common space. 3. Each layer of the interaction block performs three non-local interactions: - Text-Text Self-Attention: Synthesizes semantic context of other joints for each joint's text features. - Text-Guided Skeleton Cross-Attention: Uses text features as queries to guide the integration of skeleton features. - Fusion: Text and skeleton features are added and processed by an MLP.
Dual-Branch Training and Inference¶
- During training: Both branches output to a shared MLP classifier.
- Total Loss: \(L_{overall} = L_s + 0.5 \cdot L_{calibrate} + 0.2 \cdot L_c\)
- During inference: Completely text-free, using only the skeleton encoding branch with only +0.1M additional parameters.
- Classification is performed using a distribution calibration method for one-shot learning.
Loss & Training¶
- \(L_s\): Cross-entropy classification loss for the skeleton branch.
- \(L_c\): Cross-entropy classification loss for the cross-modal branch.
- \(L_{calibrate}\): MSE calibration loss for JID.
- Gradients of the cross-modal branch can flow back to the encoding blocks, allowing high-level semantic information to be injected into the skeleton encoder during the training process.
Key Experimental Results¶
Main Results: NTU RGB+D 120¶
| Method | 20 Classes | 40 Classes | 60 Classes | 80 Classes | 100 Classes | Params (M) |
|---|---|---|---|---|---|---|
| APSR | 29.1 | 34.8 | 39.2 | 42.8 | 45.3 | - |
| MotionBERT | 35.5 | 54.3 | 56.5 | 52.8 | 61.0 | 60.3 |
| InfoGCN | 37.0 | 53.9 | 58.8 | 55.7 | 56.1 | 1.6 |
| InfoGCN+GAP | 35.1 | 54.8 | 50.8 | 53.2 | 59.9 | 1.6 |
| InfoGCN+CrossGLG | 45.3 | 56.8 | 62.1 | 61.6 | 62.6 | 1.7 |
Plug-and-Play Effectiveness Validation¶
| Skeleton Encoder | 20 Classes Baseline \(\to\) +CrossGLG | Param Overhead |
|---|---|---|
| MotionBERT | 35.5 \(\to\) 51.0 (+15.8) | +0.5M |
| HDGCN | 39.0 \(\to\) 43.0 (+4.0) | +0.1M |
| InfoGCN | 37.0 \(\to\) 45.3 (+8.3) | +0.1M |
NTU RGB+D 60¶
| Method | 10 Classes | 20 Classes | 30 Classes | 40 Classes | 50 Classes |
|---|---|---|---|---|---|
| MotionBERT | 58.3 | 61.0 | 70.0 | 70.3 | 74.5 |
| InfoGCN | 51.1 | 62.1 | 65.7 | 72.1 | 72.3 |
| InfoGCN+CrossGLG | 57.9 | 67.1 | 70.9 | 73.4 | 75.6 |
Ablation Study¶
| G2L | L2G | 20 Classes | 60 Classes | 100 Classes |
|---|---|---|---|---|
| x | x | 37.0 | 58.8 | 56.1 |
| v | x | 43.3 | 60.9 | 61.7 |
| x | v | 42.5 | 61.7 | 58.6 |
| v | v | 45.3 | 62.1 | 62.6 |
Ablation on JID insertion position (\(N_{pre}\)): The 5th encoding block is optimal (too shallow features are not rich enough, too deep limits the joint influence).
Key Findings¶
- Both G2L and L2G bring significant improvements individually, and their combination yields the best results.
- Under the NTU 120 20-class setting, InfoGCN+CrossGLG improves over pure InfoGCN by 8.3%.
- The model size is only 2.8% of MotionBERT (1.7M vs 60.3M).
- It also outperforms SOTA on the Kinetics dataset, verifying its generalization ability in complex scenarios.
Highlights & Insights¶
- LLM Knowledge Distillation into Skeleton Models: Using LLM text to guide training while requiring absolutely no text during inference elegantly solves the modal asymmetry issue.
- Plug-and-Play Design: Brings significant improvements to encoders like MotionBERT/HDGCN/InfoGCN with only +0.1M parameters.
- Global-Local-Global Paradigm: First utilizes global text to focus on key joints (local), and then performs local feature interactions to aggregate global representations.
- Visualization Validation: For unseen action classes, the model can focus on the correct key joints without any fine-tuning.
Limitations & Future Work¶
- The quality of the LLM-generated text depends on the prompt design and the LLM's capabilities; the descriptions' quality across different actions might be uneven.
- The joint importance in JID is static (fixed for each action class) and does not account for dynamic changes across different stages of the same action.
- The framework is evaluated only on NTU and Kinetics datasets, without exploring more challenging fine-grained action recognition scenarios.
- Although the text encoder (DeBERTa-V2-XLarge) is not used during inference, it is still required during training, which increases training costs.
Related Work & Insights¶
- Comparison with GAP (fully supervised text guidance): GAP performs even worse than the baseline in the one-shot setting, proving the superiority of the CrossGLG design in few-shot scenarios.
- Although APSR introduces semantic information, the amount introduced is too small, and it cannot detect key joints during inference.
- The dual-branch shared classifier design can be extended to other cross-modal knowledge transfer scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Introduces LLMs to one-shot skeleton-based recognition for the first time; the global-local-global guidance mechanism is novel)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Validated on three datasets + multiple encoders + detailed ablations)
- Writing Quality: ⭐⭐⭐⭐ (Clear motivation, intuitive framework diagram)
- Value: ⭐⭐⭐⭐⭐ (Plug-and-play + extreme efficiency, highly practical)