CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner¶

Conference: ECCV 2024
arXiv: 2403.10082
Code: Yes
Area: Video Understanding
Keywords: Skeleton-based Action Recognition, One-Shot Learning, Large Language Model, Cross-Modal Guidance, Global-Local-Global

TL;DR¶

This paper proposes the CrossGLG framework, which utilizes LLM-generated text descriptions to guide skeleton feature learning in a "global \(\to\) local \(\to\) global" manner, significantly outperforming competitors in one-shot 3D action recognition with only 2.8% of the parameter size of the SOTA model.

Background & Motivation¶

One-shot skeleton action recognition faces two core challenges:

Loss of Local Information: Existing methods focus only on low-level information such as joint positions, failing to focus on crucial local regions.

Weak Generalization Ability: Lacking high-level semantic guidance makes it difficult to generalize to unseen action classes.

Humans can recognize key movement cues and infer the overall action with only a few observations. Inspired by this, the authors propose leveraging human knowledge-rich text descriptions generated by Large Language Models (LLMs) to guide skeleton feature learning. This is the first framework to introduce LLM text information into one-shot skeleton-based action recognition.

Method¶

Overall Architecture¶

CrossGLG adopts a dual-branch architecture: - Skeleton Encoding Branch (blue): Processes skeleton sequences only; used exclusively during inference. - Cross-Modal Guidance Branch (green): Leverages LLM text to guide skeleton feature learning during training.

Both branches share the same classifier, allowing the skeleton branch to implicitly learn the high-level semantics contained in the text during training.

LLM-Generated Knowledge-Rich Action Descriptions¶

Two prompts are designed to obtain information from ChatGPT: 1. Global Action Description Prompt: Describes which joints/body parts are most important when performing a certain action (e.g., "wave hand" \(\to\) arms, wrists). 2. Joint Motion Description Prompt: Generates fine-grained local motion descriptions for each joint (e.g., "right hand: swing rapidly left and right above the head").

Adapting to different datasets is achieved simply by replacing the action name and joint list in the prompts.

Global-to-Local Guidance¶

Joint Importance Discriminator (JID): 1. The first \(N_{pre}\) encoding blocks of the skeleton encoder output \(f_{pre}\), which is pooled along the temporal dimension to obtain the overall motion features of the joints. 2. The JID (two linear layers + softmax) predicts the importance of each joint \(k_{out}\). 3. Extract key joint distribution \(k_{gt}\) from the global text (mapping noun phrases extracted using Stanford CoreNLP to joints). 4. Calibration Loss: \(L_{calibrate} = \text{MSE}(k_{out}, k_{gt})\) 5. In subsequent encoding blocks, \(k_{out}\) is used to reweight features after spatial interaction.

Local-to-Global Guidance¶

Cross-Modal Interaction Module (\(M\)-layer interaction blocks): 1. The text encoder (DeBERTa-V2-XLarge) encodes joint motion descriptions. 2. Text and skeleton features are projected into a common space. 3. Each layer of the interaction block performs three non-local interactions: - Text-Text Self-Attention: Synthesizes semantic context of other joints for each joint's text features. - Text-Guided Skeleton Cross-Attention: Uses text features as queries to guide the integration of skeleton features. - Fusion: Text and skeleton features are added and processed by an MLP.

Dual-Branch Training and Inference¶

During training: Both branches output to a shared MLP classifier.
Total Loss: \(L_{overall} = L_s + 0.5 \cdot L_{calibrate} + 0.2 \cdot L_c\)
During inference: Completely text-free, using only the skeleton encoding branch with only +0.1M additional parameters.
Classification is performed using a distribution calibration method for one-shot learning.

Loss & Training¶

\(L_s\): Cross-entropy classification loss for the skeleton branch.
\(L_c\): Cross-entropy classification loss for the cross-modal branch.
\(L_{calibrate}\): MSE calibration loss for JID.
Gradients of the cross-modal branch can flow back to the encoding blocks, allowing high-level semantic information to be injected into the skeleton encoder during the training process.

Key Experimental Results¶

Main Results: NTU RGB+D 120¶

Method	20 Classes	40 Classes	60 Classes	80 Classes	100 Classes	Params (M)
APSR	29.1	34.8	39.2	42.8	45.3	-
MotionBERT	35.5	54.3	56.5	52.8	61.0	60.3
InfoGCN	37.0	53.9	58.8	55.7	56.1	1.6
InfoGCN+GAP	35.1	54.8	50.8	53.2	59.9	1.6
InfoGCN+CrossGLG	45.3	56.8	62.1	61.6	62.6	1.7

Plug-and-Play Effectiveness Validation¶

Skeleton Encoder	20 Classes Baseline \(\to\) +CrossGLG	Param Overhead
MotionBERT	35.5 \(\to\) 51.0 (+15.8)	+0.5M
HDGCN	39.0 \(\to\) 43.0 (+4.0)	+0.1M
InfoGCN	37.0 \(\to\) 45.3 (+8.3)	+0.1M

NTU RGB+D 60¶

Method	10 Classes	20 Classes	30 Classes	40 Classes	50 Classes
MotionBERT	58.3	61.0	70.0	70.3	74.5
InfoGCN	51.1	62.1	65.7	72.1	72.3
InfoGCN+CrossGLG	57.9	67.1	70.9	73.4	75.6

Ablation Study¶

G2L	L2G	20 Classes	60 Classes	100 Classes
x	x	37.0	58.8	56.1
v	x	43.3	60.9	61.7
x	v	42.5	61.7	58.6
v	v	45.3	62.1	62.6

Ablation on JID insertion position (\(N_{pre}\)): The 5th encoding block is optimal (too shallow features are not rich enough, too deep limits the joint influence).

Key Findings¶

Both G2L and L2G bring significant improvements individually, and their combination yields the best results.
Under the NTU 120 20-class setting, InfoGCN+CrossGLG improves over pure InfoGCN by 8.3%.
The model size is only 2.8% of MotionBERT (1.7M vs 60.3M).
It also outperforms SOTA on the Kinetics dataset, verifying its generalization ability in complex scenarios.

Highlights & Insights¶

LLM Knowledge Distillation into Skeleton Models: Using LLM text to guide training while requiring absolutely no text during inference elegantly solves the modal asymmetry issue.
Plug-and-Play Design: Brings significant improvements to encoders like MotionBERT/HDGCN/InfoGCN with only +0.1M parameters.
Global-Local-Global Paradigm: First utilizes global text to focus on key joints (local), and then performs local feature interactions to aggregate global representations.
Visualization Validation: For unseen action classes, the model can focus on the correct key joints without any fine-tuning.

Limitations & Future Work¶

The quality of the LLM-generated text depends on the prompt design and the LLM's capabilities; the descriptions' quality across different actions might be uneven.
The joint importance in JID is static (fixed for each action class) and does not account for dynamic changes across different stages of the same action.
The framework is evaluated only on NTU and Kinetics datasets, without exploring more challenging fine-grained action recognition scenarios.
Although the text encoder (DeBERTa-V2-XLarge) is not used during inference, it is still required during training, which increases training costs.

Comparison with GAP (fully supervised text guidance): GAP performs even worse than the baseline in the one-shot setting, proving the superiority of the CrossGLG design in few-shot scenarios.
Although APSR introduces semantic information, the amount introduced is too small, and it cannot detect key joints during inference.
The dual-branch shared classifier design can be extended to other cross-modal knowledge transfer scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Introduces LLMs to one-shot skeleton-based recognition for the first time; the global-local-global guidance mechanism is novel)
Experimental Thoroughness: ⭐⭐⭐⭐ (Validated on three datasets + multiple encoders + detailed ablations)
Writing Quality: ⭐⭐⭐⭐ (Clear motivation, intuitive framework diagram)
Value: ⭐⭐⭐⭐⭐ (Plug-and-play + extreme efficiency, highly practical)