On the Utility of 3D Hand Poses for Action Recognition¶

Conference: ECCV2024
arXiv: 2403.09805
Code: https://github.com/s-shamil/HandFormer
Area: Video Understanding
Keywords: 3D hand pose, action recognition, multimodal transformer, skeleton-based, egocentric

TL;DR¶

This paper proposes HandFormer, a lightweight multimodal Transformer that combines densely sampled 3D hand poses (to capture fine-grained actions) with sparsely sampled RGB frames (to provide scene semantics). By efficiently modeling hand-object interactions through micro-action temporal decomposition and trajectory encoding, it achieves state-of-the-art (SOTA) performance on Assembly101 and H2O. Notably, the pose-only model outperforms existing skeleton-based methods with \(5\times\) fewer FLOPs.

Background & Motivation¶

Background: The popularity of AR/VR headsets has driven research in egocentric hand-object interaction recognition. Current SOTA methods (e.g., SlowFast, Video Transformer) mainly rely on multi-view or single-view RGB streams, which suffer from high computational overhead and are unsuitable for resource-constrained AR/VR scenarios. Meanwhile, skeleton-based action recognition fields (such as MS-G3D, ISTA-Net) primarily target whole-body skeletons, making them less suitable for hand poses.

Limitations of Prior Work: - Whole-body skeleton methods are not applicable to hands: In whole-body actions, there exist static reference joints (e.g., the head), and the changes of moving joints relative to the reference joints serve as key clues. However, hand joints are highly coupled (Pearson correlation coefficient of 0.93 vs. 0.33 for whole body) and move together as a whole, lacking static reference points. Consequently, long-range spatio-temporal dependency modeling is of limited effectiveness. - RGB-only methods are computationally expensive: Densely sampling RGB frames is costly, whereas reducing temporal resolution leads to a loss of fine-grained action differentiation (e.g., "screwing tight" vs. "unscrewing"). - Pose-only methods cannot recognize objects: Hand poses excel at identifying verbs but cannot encode information about the interacting objects.

Key Challenge: Hand action recognition requires high temporal resolution to understand subtle movements, as well as visual semantics to identify objects, but balancing both incurs high computational costs.

Goal: How to design a lightweight architecture that can capture fine-grained hand movements at high temporal resolution while incorporating sufficient visual semantics to understand the scene and objects?

Key Insight: Based on the statistical differences between hand and whole-body skeletons (i.e., highly coupled hand joints, dominant whole-skeleton motion, and lack of static reference points), the authors propose temporal decomposition (micro-action) and trajectory encoding to replace traditional long-range spatio-temporal attention. Meanwhile, sparse RGB frames are sufficient to provide object semantics.

Core Idea: High-temporal-resolution hand poses and sparse RGB frames are coupled as micro-action sequences. Spatio-temporal factorization is achieved through trajectory encoding, attaining multimodal SOTA performance with extremely low FLOPs.

Method¶

Overall Architecture¶

Input: A dense 3D hand pose sequence of an action clip (\(\mathcal{T}\) frames, \(J\) keypoints for each hand per frame) + sparsely sampled RGB frames. The entire sequence is partitioned into \(K\) micro-action segments, with each segment containing \(N\) frames of dense poses and 1 frame of RGB. The features of each micro-action are extracted via a Trajectory Encoder (encoding poses) and a Frame Encoder (encoding RGB). After interacting and fusing through a Multimodal Tokenizer, the features are fed into a Temporal Transformer for temporal aggregation, and finally, the action class is outputted.

Key Designs¶

Micro-action temporal decomposition:
- Function: Segments a long sequence into \(K\) short windows of fixed length \(N\), where each window is referred to as a micro-action, analogous to "words" forming a "sentence".
- Mechanism: The pose sequence is adjusted to \(\mathcal{T}'=(K-1)\times R + N\) frames via linear interpolation (default \(\mathcal{T}'=120\), \(N=15\), \(K=8\)), and \(K\) micro-actions are sliced using a sliding window with step size \(R\). Each micro-action \(M_k = [I_{h(k)}, \{P'_{g(k)+i}\}_{i=0}^{N-1}]\) consists of one RGB frame and \(N\) pose frames.
- Design Motivation: Hand joints are highly coupled, and long-range spatio-temporal attention (e.g., CTR-GCN, ISTA-Net) offers limited effectiveness and may cause redundancy. Decomposition into short windows restricts spatio-temporal dependency modeling to local regions, enhancing parameter sharing efficiency. Ablation shows that \(N=15\) is optimal, outperforming the extreme cases of frame-by-frame (\(N=1\)) or full sequence (\(N=120\)) by 4-5%.
Trajectory Encoder:
- Function: Encodes the dense pose sequence within a micro-action into a single feature vector.
- Mechanism: Adopts a Lagrangian perspective—instead of processing all joints frame-by-frame, the \(N\)-frame 3D coordinates of each joint are treated as a trajectory (a \(3\times N\) dimensional vector). A Single-Joint TCN with shared parameters across all joints encodes the trajectories into \(2J\) Local Trajectory Tokens. An independent Wrist-TCN processes the wrist 6D pose (position and orientation) across the entire action sequence to generate a Global Wrist Token as a global motion reference. Self-attention and spatio-temporal average pooling are then applied to these tokens.
- Design Motivation: Whole-skeleton motion dominates hand actions, and the relationships between joints remain relatively stable. Trajectory encoding naturally captures the short-term motion patterns of each joint, while the Global Wrist Token compensates for the lack of static reference joints in the hand (ablation shows that its inclusion improves verb accuracy from 64.17% to 64.90%).
Multimodal Tokenizer:
- Function: Fuses RGB features and pose features to generate enhanced multimodal tokens.
- Mechanism: The frame feature \(f_k^{\text{RGB}}\) and trajectory encoding \(f_k^{\text{Pose}}\) are concatenated and projected to a shared space via an MLP to obtain PoseRGB features, which are then split and added back to the original RGB and pose features. This residual-style interaction allows the two modalities to mutually enhance each other.
- The Frame Encoder uses a frozen DINOv2 ViT (or pre-trained ResNet50) to process a single RGB frame, generating both a \(1.25\times\) expanded hand crop and global scene features.
Temporal Transformer:
- Function: Aggregates the multimodal tokens of \(K\) micro-actions for temporal modeling.
- Mechanism: The \(2K\) tokens (one RGB token and one pose token per micro-action) along with sinusoidal positional encodings and learnable modality embeddings are fed into a standard Transformer (HandFormer-B: \(d=256\), 2 layers; HandFormer-L: \(d=512\), 4 layers). The output of the [CLS] token is used for classification.

Loss & Training¶

The total loss is formulated as \(\mathcal{L} = \mathcal{L}_{cls} + \lambda_1 \mathcal{L}_{verb} + \lambda_2 \mathcal{L}_{obj} + \lambda_3 \mathcal{L}_{ant}\), where:

\(\mathcal{L}_{cls}\): Cross-entropy loss for action classification.
\(\mathcal{L}_{verb}\), \(\mathcal{L}_{obj}\): Auxiliary supervision for verbs and objects respectively. The verb class token only attends to pose encodings, while the object class token only attends to RGB features, enforcing modality decoupling.
\(\mathcal{L}_{ant}\) (Feature Anticipation Loss): Given the PoseRGB feature of the \(k\)-th micro-action, it predicts the RGB feature of the \((k+1)\)-th micro-action (L1 distance), encouraging the model to understand the "visual state changes caused by hand movements".
Hyperparameters: \(\lambda_1=1.0\), \(\lambda_2=1.0\), \(\lambda_3=2.0\); SGD, lr=0.025, 50 epochs, step decay.

Key Experimental Results¶

Main Results¶

Evaluation on Assembly101 (a large-scale multi-view hand assembly dataset with 1380 fine-grained actions) and H2O (36 hand-object interaction actions):

Method	Modality	GFLOPs	Assembly101 Action	Assembly101 Verb	H2O Action
MS-G3D	Pose	21.2	28.78	63.46	50.83
ISTA-Net	Pose	35.2	28.14	62.70	89.09*
TSM	RGB	33.0	35.27	58.27	-
RGBPoseConv3D	Pose+RGB	68.9	33.61	61.99	83.47
MS-G3D + TSM (late fusion)	Pose+RGB	66.2	39.74	65.12	-
HandFormer-B/21×8 (Pose)	Pose	4.2	28.80	65.33	57.44
HandFormer-B/21×8 (Pose+RGB)	Pose+RGB	47.6	41.06	69.23	93.39

*ISTA-Net additionally utilizes 6D object poses on H2O, and is not directly comparable.

Key Conclusions: - The pose-only model outperforms MS-G3D/ISTA-Net in verb recognition with \(5\times\) fewer FLOPs. - The multimodal model outperforms the strong baseline MS-G3D+TSM late fusion, and beats RGBPoseConv3D by about 7.5% action accuracy. - On H2O, the action accuracy reaches 93.39%, outperforming the previous best H2OTR (90.90%) by 2.5%.

Ablation Study¶

Configuration	Verb Accuracy	Description
Full model (21 joints + Global Wrist)	64.90%	Full pose-only model
w/o Global Wrist Token	64.17%	-0.73%, indicating global motion reference is helpful
11 joints	64.77%	Removing secondary joints has minimal impact
6 joints (fingertips + wrist)	63.70%	Highly efficient but acceptable accuracy

Micro-action length \(N\)	1	15	30	60	120
Verb Accuracy	59.12%	63.70%	63.68%	63.51%	62.29%

Module Combination	Assembly101 Action	H2O Action
Baseline (no tokenizer, no auxiliary loss)	38.98%	85.95%
+ Multimodal Tokenizer	40.19%	88.84%
+ Feature Anticipation Loss	40.24%	89.26%
+ Verb & Object Loss	41.06%	93.39%

Key Findings¶

Micro-action length of 15 is optimal: Too short (frame-by-frame) loses trajectory information, while too long (full sequence) loses local fine-grained motion. 15 frames achieves the best balance.
Just 1 RGB frame outperforms video-only TSM: This shows that pose is efficient enough for motion encoding, and RGB only needs to provide object semantics.
3D pose vs. 2D pose: 3D inputs outperform 2D inputs by about 5% (63.70% vs. 58.92%), as hand self-occlusion is severe, causing depth loss in 2D projections.
Single-view + Pose is comparable to multi-view: Single-view RGB + 3D pose achieves a verb recognition accuracy of 69.23%, which is close to the 8-view RGB fusion (70.99%) while utilizing \(5\times\) fewer FLOPs.
Cross-view generalization: Training on view 4 and testing on view 1 still outperforms TSM directly trained on view 1, indicating that 3D pose provides native view invariance.

Highlights & Insights¶

In-depth statistical analysis of hand vs. whole-body skeletons: Quantitatively showing that hand joints are highly coupled via Pearson correlation coefficient (0.93 vs. 0.33) provides a solid motivation for micro-action temporal decomposition. This approach of "analyzing data characteristics first, then designing the architecture" is highly instructive.
Lagrangian trajectory encoding: Treating each joint as an independently tracked entity and characterizing it by its temporal trajectory rather than spatial snapshots elegantly exploits the coupled motion of hand joints while achieving native spatio-temporal factorization.
Complementary design of dense pose + sparse RGB: This paradigm of "cheap modalities providing temporal resolution, expensive modalities providing semantics" can be transferred to other multimodal scenarios (e.g., IMU + sparse video, audio + sparse frames).
Feature Anticipation Loss: Utilizing the causal intuition of "initial visual state + hand motion \(\rightarrow\) prediction of final state" to design self-supervised signals contributes about a 4% improvement on H2O.

Limitations & Future Work¶

Reliance on the quality of hand pose estimation: Pose estimation is noisy under occlusions or when hand goes out of view. The authors acknowledge this issue but do not propose a solution. Future work could design noise-robust trajectory encodings or introduce pose confidence weighting.
Uniform sampling of RGB frames: It is currently assumed that the RGB frames of all micro-actions are equally important, but in reality, key frames (e.g., the moment of grasping) contain more information than transition frames. Adaptive frame sampling strategies could further improve efficiency and accuracy.
Frame Encoder relies on large models: By default, frozen DINOv2 ViT-g/14 is used. Although ResNet50 is proposed as an alternative, the latter requires extra TSM pre-training. There is room for end-to-end lightweight frame encoder improvement.
Validation limited to two datasets: Both Assembly101 and H2O are hand interactions in controlled environments, lacking validation in in-the-wild scenarios (e.g., Ego4D, Epic-Kitchens).

vs. MS-G3D / ISTA-Net (skeleton methods): They model long-range spatio-temporal dependencies through graph convolutions over large receptive fields or full-sequence attention, which is suitable for whole-body skeletons but redundant for hands. HandFormer avoids long-range dependencies via micro-action decomposition, proving more efficient and accurate in hand scenarios.
vs. RGBPoseConv3D (multimodal skeleton + RGB): Directly transferring whole-body skeleton methods to hands yields poor results (Assembly101 action 33.61%). HandFormer achieves 41.06% through dedicated hand trajectory encoding + sparse RGB fusion, demonstrating that hand movements require specialized designs.
vs. TSM / SlowFast (video methods): Video-only methods perform well in action recognition but are computationally expensive. HandFormer delegates motion modeling to the low-cost pose stream, while RGB only needs sparse sampling to provide object semantics.
The concept of micro-action in this paper shares similarities with tokenization in NLP—discretizing continuous motion sequences into semantic units and then performing sequence modeling.

Rating¶

Novelty: ⭐⭐⭐⭐ The analysis of differences between hand and whole-body skeletons is a strong starting point, and the micro-action + trajectory encoding design is novel, though the individual modules (TCN, Transformer, multimodal fusion) consist of existing components.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ The ablation studies are comprehensive (number of joints, micro-action length, loss components, 2D vs. 3D, cross-view, multi-view alternatives, frame count ablations), achieving SOTA performance on both datasets.
Writing Quality: ⭐⭐⭐⭐⭐ The motivation is clearly derived, and the logical flow from statistical analysis to design decisions is highly coherent, with informative figures and tables.
Value: ⭐⭐⭐⭐ Highly practical for hand interaction recognition in AR/VR scenarios; the framework paradigm of dense pose + sparse RGB is highly transferrable.