CVPR 2025 Video Understanding Human Motion Understanding Concept Discovery Video-Motion Alignment Velocity Reconstruction VQ-VAE Multimodal LLM

HuMoCon: Concept Discovery for Human Motion Understanding¶

Conference: CVPR 2025
arXiv: 2505.20920
Code: Coming soon
Area: Video Understanding/Human Motion Analysis
Keywords: Human Motion Understanding, Concept Discovery, Video-Motion Alignment, Velocity Reconstruction, VQ-VAE, Multimodal LLM

TL;DR¶

HuMoCon is a motion-video understanding framework designed for human action analysis. Its core innovation lies in the encoder pre-training stage, where it discovers semantic motion concepts (codebook) via explicit video-motion feature alignment and a velocity reconstruction-based high-frequency information preservation mechanism. This significantly enhances the human motion understanding and reasoning capabilities of downstream LLMs.

Background & Motivation¶

Background: Human activity understanding is a fundamental task for building human-centric AI systems. While recent advancements in LLMs have driven progress in video and motion sequence analysis, precise and fine-grained understanding of human actions remains a challenge. Traditional methods rely on pre-defined action categories (classification mainstream), lacking flexibility, while statistical methods (such as action quality assessment) require extensive expert knowledge.

Limitations of Prior Work: (1) Motion sequence data is costly to acquire and annotate, and it lacks environmental context; (2) Although video data is informative and easy to obtain, it contains a large amount of irrelevant noise (e.g., the model might learn biases like "running typically occurs in a stadium"); (3) Pioneering works like MotionLLM handle both video and motion but only achieve implicit alignment via paired data during the LLM fine-tuning stage, leaving the encoders without explicit cross-modal alignment; (4) Although the Masked Autoencoder framework is effective, its masking strategy leads to the loss of high-frequency information, resulting in over-smoothed temporal reconstructions.

Key Challenge: Videos provide rich context but contain noise, while motion sequences are precise but lack context. A method is needed to explicitly fuse the complementary information of both modalities during the encoding stage while preserving high-frequency motion details.

Goal: To design a motion concept discovery framework that simultaneously achieves explicit cross-modal alignment, high-frequency motion information preservation, and semantic motion concept extraction during the encoder pre-training stage.

Method¶

Overall Architecture¶

HuMoCon adopts a two-stage pipeline: (1) Encoder Pre-training—Human motion concept discovery is performed via a VQ-VAE structure. Video and motion encoders are co-trained using four learning objectives (masked reconstruction, discriminative informativeness, actionable informativeness, and feature alignment); (2) LLM Fine-tuning—This consists of two steps: first, training a projection layer to map the encoded features into the LLM space, followed by multimodal instruction tuning to enable the LLM to understand video/motion inputs.

Key Designs¶

VQ-VAE Concept Discovery Framework:
- Function: Encodes video/motion into discrete semantic concept representations.
- Mechanism: The encoder maps input into continuous features, which are then quantized into discrete representations in a codebook via a VQ-VAE. The codebook represents the discovered motion concepts. This discretization not only enhances the semantic nature of the features but also improves representational robustness.
- Applying masks to the quantized discrete features and reconstructing the original inputs through a decoder to learn low-frequency semantic features.
Velocity Reconstruction Mechanism:
- Function: Preserves high-frequency motion information and alleviates the temporal over-smoothing issue of masked autoencoders.
- Mechanism: Defines "state" as frame-level encoded features and "velocity" as the difference between adjacent frame states—where the velocity of a video is optical flow, and the velocity of motion is the joint displacement difference between adjacent frames. Two auxiliary learning objectives are introduced:
  - Discriminative Informativeness: A hypernetwork generates classifiers based on codebook vectors to determine if the input state matches its corresponding concept category, thereby enhancing the discriminability of the concepts.
  - Actionable Informativeness: Utilizes the gradient information from the discriminative hypernetwork to reconstruct velocity, capturing dynamic change details based on the insight that "the gradient of a discriminative function can indicate the direction of state transitions."
- Design Motivation: Masked reconstruction only captures low-frequency/smooth features; explicitly reconstructing velocity (i.e., frame-to-frame shifts) recovers high-frequency dynamic information.
Explicit Cross-Modal Feature Alignment:
- Function: Achieves explicit alignment of video and motion features during the encoding stage.
- Mechanism: Collects paired video-motion data from Motion-X, maps the discrete video and motion features into a shared space via two projection layers, and uses a cosine-similarity-based alignment loss (with temperature-scaled softmax normalization) to pull paired features closer and push unpaired features apart.
- Design Motivation: Videos provide environmental context while motion provides precise human-centric dynamics. Explicit alignment allows the two modalities to complement each other, proving more effective than the implicit alignment in MotionLLM.

Loss & Training¶

The total loss consists of five parts:

\[\mathcal{L}^{\text{total}} = \mathcal{L}^{\text{rec}}_{\text{motion}} + \mathcal{L}^{\text{rec}}_{\text{video}} + \lambda^{\text{dis}}\mathcal{L}^{\text{dis}} + \lambda^{\text{act}}\mathcal{L}^{\text{act}} + \lambda^{\text{align}}\mathcal{L}^{\text{align}}\]

\(\mathcal{L}^{\text{rec}}\): Masked reconstruction loss (L2), ensuring the encoded features retain sufficient input information.
\(\mathcal{L}^{\text{dis}}\): Discriminative informativeness loss (cross-entropy), enhancing the discriminability of the concepts.
\(\mathcal{L}^{\text{act}}\): Actionable informativeness loss (L2), reconstructing velocity via gradient information.
\(\mathcal{L}^{\text{align}}\): Cross-modal alignment loss (cosine similarity + softmax), aligning paired video-motion features.

During the LLM fine-tuning stage, LoRA (rank=8) is used for lightweight adaptation.

Key Experimental Results¶

Benchmark/Metric	HuMoCon	MotionLLM	Gain
Activity-QA (Video)	SOTA	Suboptimal	Significantly outperforms MotionLLM
BABEL-QA (Motion)	SOTA	Suboptimal	Outperforms prior work quantitatively and qualitatively
Concept Discovery Visualization	Clear semantic clustering	Implicit alignment	More meaningful motion concepts

Highlights & Insights¶

"Velocity Reconstruction" solves the over-smoothing issue of masked autoencoders: This is a general insight—masked reconstruction naturally loses high-frequency information, which can be recovered by explicitly reconstructing frame differences (velocity/optical flow). Utilizing the discriminator gradient to assist velocity reconstruction is also an ingenious design.
Concept discovery paradigm: The VQ-VAE codebook is not merely a discretization tool but is endowed with semantic meaning as "motion concepts"—each codeword corresponds to an atomic motion pattern, offering better interpretability than directly feeding continuous features into the LLM.
The difference between explicit and implicit alignment: Experiments clearly show that formulating explicit cross-modal alignment at the encoder stage is far more effective than relying solely on implicit alignment via paired data during LLM fine-tuning.
Borrowing InfoCon ideas from robotic manipulation: Migrating the concepts of discriminative and actionable informativeness from robotic manipulation to human motion understanding represents an inspiring cross-domain knowledge transfer.

Limitations & Future Work¶

The alignment loss is computed only on paired data; unpaired video or motion data cannot leverage this objective.
Encoder pre-training requires simultaneous processing of both video and motion data, leading to relatively high computational costs.
The size of the VQ-VAE codebook is a hyperparameter, which may affect the granularity of the discovered concepts.
The experiments are mainly validated on Activity-QA and BABEL-QA; broader application scenarios (e.g., action prediction, motion correction) remain to be explored.

Human Motion Understanding: MotionCLIP (CLIP-aligned OOD motion generation), MotionGPT (LLM unifying multiple motion tasks), MotionLLM (the first dual-modality video+motion LLM, implicit alignment)
Video Understanding: Video-LLaVA (image+video dual-modality reasoning), with subsequent works extending to faster and more precise reasoning
Multimodal Pre-training: CLIP (image-text contrastive learning), VALOR/VAST (cross-modal alignment for enhanced robustness)

Rating¶

Novelty: ⭐⭐⭐⭐ (The combination of velocity reconstruction and explicit alignment offers a novel encoder pre-training paradigm)
Value: ⭐⭐⭐⭐ (A general human activity analysis framework, with code to be open-sourced soon)
Technical Depth: ⭐⭐⭐⭐⭐ (VQ-VAE concept discovery + discriminative/actionable informativeness + cross-modal alignment, featuring an exquisitely coordinated multi-loss design)
Writing Quality: ⭐⭐⭐⭐ (The system overview is clear, though it requires careful reading due to the formulas)