Prompting Future Driven Diffusion Model for Hand Motion Prediction¶

Conference: ECCV 2024
Code: None
Area: Diffusion Models / Hand Motion Prediction
Keywords: Hand Motion Prediction, Diffusion Models, Prompt Learning, First-person/Third-person, Future-driven

TL;DR¶

This paper proposes PromptFDDM, a prompt-based future-driven diffusion model for hand motion prediction. By combining a Spatial-Temporal Extractor Network (STEN) with the guidance mechanism of a Ground Truth Extractor Network (GTEN) and a Reference Data Generation Network (RDGN), alongside interactive prompt augmentation, the model achieves SOTA performance in both first-person and third-person hand motion prediction.

Background & Motivation¶

Background: Hand motion prediction is crucial for enhancing user experience in AR/VR and ensuring safe control of teleoperated robotic arms. Prior work has primarily focused on whole-body human motion prediction or hand trajectory prediction, with relatively few studies addressing direct hand skeleton motion prediction (joint angles and positions). Hand motion prediction faces unique challenges: the hand skeleton is highly compact (with a large number of joints within a small spatial range), meaning slight prediction errors can result in severe pose distortion.

Limitations of Prior Work: (1) Compactness of hand skeletons: Compared to whole-body skeletons, hand joints are highly dense in space with a low signal-to-noise ratio, making it difficult for traditional trajectory prediction methods to capture fine-grained finger movements; (2) Most existing methods focus solely on the third-person perspective, neglecting the more critical first-person perspective in AR/VR; (3) Deterministic prediction methods cannot model the multi-modality of hand motion—the same observed sequence may correspond to multiple plausible future movements.

Key Challenge: Hand motion features high uncertainty and multi-modality (a single gesture can lead to multiple subsequent movements), yet simultaneously demands precise prediction (due to the compact skeleton and low tolerance for errors). Simply employing diffusion models for probabilistic prediction leads to excessive diversity but insufficient accuracy.

Goal: (1) How to balance diversity and precision in hand motion prediction; (2) How to leverage future information to guide the diffusion model to learn more accurate predictions; (3) How to enhance the model's understanding of observed motions through a prompt mechanism.

Key Insight: Leveraging the ground-truth future motion as a guidance signal during the training phase, while training a Reference Data Generation Network (RDGN) to simulate this guidance. During inference, the RDGN generates "alternative future data" to guide the prediction. Concurrently, interactive prompts are extracted from observed motions to provide additional motion context.

Core Idea: To guide the diffusion model for precise hand motion prediction via a "future-driven" mechanism (guided by ground truth during training, and substituted by generated reference data during inference) and observation-based interactive prompts.

Method¶

Overall Architecture¶

PromptFDDM comprises three main networks: (1) a Spatial-Temporal Extractor Network (STEN), which is the core prediction network that utilizes a diffusion process to generate future hand motions; (2) a Ground Truth Extractor Network (GTEN), which extracts guidance features from real future motions during training; and (3) a Reference Data Generation Network (RDGN), which generates alternative future data during inference to substitute for the unavailable ground truth. Additionally, a prompt generation module extracts interactive prompts from observed motions.

Key Designs¶

Spatial-Temporal Extractor Network (STEN):
- Function: Core prediction network, which predicts future hand motion based on the diffusion process under the conditions of guidance signals and prompts.
- Mechanism: STEN takes three inputs: (a) the noised future motion \(x_t\) (a noise sample in the diffusion process), (b) the guidance feature \(g\) from GTEN (during training) or RDGN (during inference), and (c) the interactive prompt \(p\). Internally, it utilizes a spatial-temporal Transformer architecture to capture dependencies among different joints along the spatial dimension, and to model the dynamic characteristics of motion sequences along the temporal dimension. Guidance features are injected via cross-attention, while prompts are injected via adaptive layer normalization. The training objective is denoising—predicting the noise \(\epsilon\) added to \(x_t\).
- Design Motivation: Dual spatial-temporal modeling is critical for hand motion: spatially, fingers have strong coupling (coordinating together during grasping), and temporally, motions have smoothness constraints. Dual conditioning on both guidance and prompts enables the model to leverage richer context.
Ground Truth Extractor Network (GTEN) and Reference Data Generation Network (RDGN):
- Function: GTEN extracts guidance signals from real future motion during training, whereas RDGN generates alternative guidance signals during inference.
- Mechanism: GTEN is an encoder network that takes the ground-truth future hand motion sequence \(y\) as input and outputs the guidance feature \(g_{gt} = \text{GTEN}(y)\). This guidance feature contains "summary information of future motion," aiding STEN in more accurate denoising. RDGN is a generative network that takes the observed sequence \(x_{obs}\) as input, generates "reference future data" \(\hat{y} = \text{RDGN}(x_{obs})\), which is then encoded by GTEN to produce the guidance feature \(g_{ref} = \text{GTEN}(\hat{y})\). RDGN is implemented using a simple GRU network, optimized during training directly via MSE loss to regress future motion. Although the predictions of RDGN might not be fully precise, they provide a reasonable "directional guidance."
- Design Motivation: This "using GT in training, substituting with generated data in inference" strategy resolves a fundamental contradiction—future information is available to guide learning during training but is absent during inference. RDGN bridges this gap. Even if the output of RDGN is imperfect, the rough directional information it provides is sufficient to assist STEN in fine-grained denoising within the correct region.
Interactive Prompt Generation:
- Function: Extracts contextual information from the observed hand motion sequence as an additional condition to inject into STEN.
- Mechanism: The prompt generation module analyzes motion patterns in the observed sequence to extract key features, such as motion speed, directional trends, and periodic patterns. Specifically, multi-scale temporal encoding is applied to the observed sequence—short-term windows capture local dynamics (e.g., finger bending rate), while long-term windows capture global trends (e.g., arm movement direction). These multi-scale features are concatenated and mapped to a prompt vector via an MLP. Prompts are injected into each layer of STEN via adaptive layer normalization.
- Design Motivation: Observed sequences contain critical clues about future motion—the current motion trend heavily influences the immediate future. The prompt mechanism enables STEN to "first comprehend what is currently happening" before "predicting what will happen next."

Loss & Training¶

STEN utilizes the standard diffusion denoising loss \(\mathcal{L}_{STEN} = \mathbb{E}\|\epsilon - \epsilon_\theta(x_t, t, g, p)\|^2\). RDGN utilizes the MSE reconstruction loss \(\mathcal{L}_{RDGN} = \|y - \text{RDGN}(x_{obs})\|^2\). Training is conducted in two stages: RDGN is trained first, and then STEN+GTEN are trained with RDGN fixed. During inference, RDGN and GTEN perform a single forward pass to generate guidance features, followed by iterative denoising in STEN.

Key Experimental Results¶

Main Results¶

Dataset	Perspective	Metric (MPJPE↓)	PromptFDDM	Prev. SOTA	Gain
FPHA	First-person	MPJPE@80ms	8.2	9.7	-15.5%
FPHA	First-person	MPJPE@400ms	32.1	38.6	-16.8%
HO3D	Third-person	MPJPE@80ms	5.4	6.8	-20.6%
HO3D	Third-person	MPJPE@400ms	24.8	29.3	-15.4%

Ablation Study¶

Configuration	MPJPE@400ms (FPHA)	Description
Full PromptFDDM	32.1	Full model
w/o GTEN+RDGN guidance	39.2	No future-driven guidance, drops 22.1%
w/o Prompt	35.7	No observation prompt, drops 11.2%
w/o RDGN (no guidance at inference)	37.8	Missing guidance signal during inference
Substituting RDGN with direct prediction	36.4	No encoding via GTEN

Key Findings¶

Future-driven guidance (GTEN+RDGN) is the most critical module—removing it degrades performance by over 20%, indicating that guidance signals are essential for directing the diffusion model's denoising path.
The contribution of prompts is more significant in long-term predictions (400ms)—observed motion trend information offers limited help for short-term predictions but is crucial for long-term planning.
Performance improvements are more pronounced under the first-person perspective, potentially because first-person motion is more complex (influenced by egocentric camera motion).
Although the quality of reference data generated by RDGN is not fully precise (with higher MSE), it is sufficiently effective as a guidance signal—proving that the diffusion model only requires "rough directional" guidance.

Highlights & Insights¶

The "learn with answers, infer with estimation" guidance strategy is of high practical significance. The GTEN-RDGN combination cleverly resolves the information leakage issue—utilizing GT to guide the learning of high-quality denoising strategies during training, and approximating the GT guidance effect with generated substitutes during inference. This teacher-student style inference strategy can be extended to other conditional diffusion models.
The Prompt injection method is more flexible than simple conditional concatenation—injecting prompts into each layer via adaptive layer normalization allows the model to utilize observation information across different levels of abstraction.
The application scenario of hand motion prediction is well-chosen—the compactness of hand skeletons amplifies errors in traditional methods, but the probabilistic nature of diffusion models can effectively handle multi-modality.

Limitations & Future Work¶

Evaluated only on FPHA and HO3D datasets, without involving larger-scale hand motion datasets.
Ignores hand-object interaction constraints—the movement when a hand grasps an object is bounded by object geometry, which is overlooked in current methods.
As a simple GRU, the quality of reference data generated by RDGN is limited and could be replaced by more powerful models.
Inference speed might be restricted by the number of diffusion iteration steps, leaving its feasibility for real-time AR/VR requirements uncertain.
Future work could explore incorporating hand image information (e.g., texture, depth maps) to enhance motion prediction.

vs LTD: LTD employs discrete cosine transform for human motion prediction but is not optimized for hand skeleton compactness. PromptFDDM handles the resulting multi-modality of hand motion better through diffusion models.
vs MotionDiff: MotionDiff uses diffusion models for whole-body motion prediction but lacks future-driven guidance and prompt mechanisms. Innovations in these two aspects make PromptFDDM more suitable for hand motion prediction.
vs MDM: MDM uses classifier-free guidance for motion generation. The future-driven guidance in PromptFDDM provides a different guidance paradigm based on future states rather than semantic labels.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of future-driven guidance and interactive prompts is novel.
Experimental Thoroughness: ⭐⭐⭐ Evaluated only on two datasets; real-time analysis is missing.
Writing Quality: ⭐⭐⭐⭐ The methodology description is clear, and the motivation for each module is well-articulated.
Value: ⭐⭐⭐ Although the hand motion prediction scenario is relatively narrow, it holds practical value for AR/VR.