Skip to content

🧑 Human Understanding

🔬 ICLR2026 · 45 paper notes

📌 Same area in other venues: 📷 CVPR2026 (138) · 🧪 ICML2026 (5) · 🤖 AAAI2026 (20) · 🧠 NeurIPS2025 (21) · 📹 ICCV2025 (41)

🔥 Top topics: Human Pose ×7 · Multimodal/VLM ×4 · Diffusion Models ×3 · LLM ×2

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behaviour Analysis

This paper proposes BAH, the first multimodal dataset for Ambivalence/Hesitancy (A/H) recognition in videos. It contains 1,118 videos (8.26 hours) from 224 participants across 9 Canadian provinces, annotated by behavioral science experts, and provides baseline experimental results at both frame and video levels.

BANZ-FS: BANZSL Fingerspelling Dataset

This paper constructs BANZ-FS, the first large-scale dataset for two-handed fingerspelling in BANZSL (British, Australian, and New Zealand Sign Language). It aggregates over 35K multi-level aligned fingerspelling instances from broadcast news, laboratory recordings, and web vlogs, and systematically benchmarks SOTA models across detection, isolated recognition, and contextual recognition tasks.

CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

CLUTCH utilizes a triad of "32,000 VLM-auto-labeled in-the-wild hand motion data (3D-HIW) + a SHIFT decomposed VQ-VAE that discretizes trajectory/pose and left/right hands separately + LLM fine-tuning with geometric reconstruction loss in the motion space." For the first time, text \(\leftrightarrow\) hand motion modeling is achieved in "in-the-wild" scenarios (e.g., playing piano, kneading dough, writing), achieving SOTA performance in both text-to-motion and motion-to-text tasks.

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

The Q Avatar framework is proposed to quantify source model transferability via cross-domain Bellman consistency. By utilizing an adaptive, hyperparameter-free weighting function to hybridize source and target domain Q-functions, reliable knowledge transfer is achieved in cross-domain RL with different state-action spaces, guaranteeing no negative transfer regardless of source model quality or domain similarity.

Curvature-Guided Task Synergy for Skeleton based Temporal Action Segmentation

CurvSeg addresses the inherent conflict between "temporal invariance for classification" and "temporal sensitivity for boundary localization" in skeleton-based temporal action segmentation. It proposes using the geometric curvature of classification feature trajectories as a boundary prior—where curvature is high within action segments and low at transitions. This establishes a bidirectional closed-loop synergy between classification and localization, complemented by a dual-expert MoE to distill task-specific features, serving as a plug-and-play module that enhances the segmentation accuracy of baselines like DeST/LaSA across four datasets.

DenseMarks: Learning Canonical Embeddings for Head Images via Point Trajectories

DenseMarks uses a ViT embedder to map every pixel of a head image to coordinates in a 3D canonical unit cube. Trained using automated pairs from off-the-shelf point trackers on in-the-wild talking head videos combined with contrastive loss, it achieves a cross-identity, cross-pose consistent, and interpretable dense correspondence representation, reaching SOTA in geometry-aware point matching and monocular head tracking.

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

DHVAE explicitly decomposes dual-person interaction motion into three disentangled latent variables: "Person A action," "Person B action," and "Global interaction context." It applies contrastive learning constraints on the global latent variable to ensure contact plausibility and employs DDIM for diffusion denoising within a hierarchical latent space, achieving new SOTA results on InterHuman and InterX with a smaller and faster model.

EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

EasyTune transforms the fine-tuning paradigm for diffusion models from "calculating reward gradients after the full denoising trajectory" to independently optimizing at each denoising step. This breaks the recursive gradient dependency between steps, reducing VRAM usage from \(O(T)\) to \(O(1)\) and enabling denser optimization. Combined with a Self-refined Preference Learning (SPL) module that converts retrieval models into motion reward models without human annotation, it outperforms DRaFT-50 by 7.7% in alignment (MM-Dist) on HumanML3D, while using only 31.16% of its additional VRAM and speeding up training by 7.3×.

EdgeCAPE: Edge Weight Prediction for Category-Agnostic Pose Estimation

EdgeCAPE introduces a learnable weighted pose graph prediction mechanism for Category-Agnostic Pose Estimation (CAPE) for the first time. By predicting edge weights and new edges for the skeleton graph, and incorporating Markov Attention Bias to enhance spatial dependency modeling, it achieves SOTA on the MP-100 benchmark, with a 1.99% Gain over GraphCape in 1-shot scenarios.

EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

EMBridge proposes using hand poses as high-quality anchors. Through a triple mechanism of Q-Former, Masked Pose Reconstruction Loss (MPRL), and Community-Aware Soft Contrastive Learning (CASCLe), it aligns the representation space of noisy sEMG signals with a semantically structured pose space, achieving zero-shot EMG gesture classification on wearable devices for the first time.

EmoPrefer: Can Large Language Models Understand Human Emotion Preferences?

To address high evaluation costs in Descriptive Multimodal Emotion Recognition (DMER), EmoPrefer is proposed as the first emotion preference dataset and benchmark. It systematically explores whether MLLMs can replace human annotators for emotion preference judgment. The best approach (Qwen2.5-Omni) achieves a 67.21% two-class WAF, leaving room for further improvement.

Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis

The Event-T2M framework is proposed to decompose text prompts into event-level atomic actions. By combining a TMR encoder and an Event-level Cross-Attention (ECA) module with a Conformer-based diffusion model, it significantly improves the quality and semantic alignment of complex multi-event motion generation.

From Pixels to Semantics: Unified Facial Action Representation Learning for Micro-Expression Analysis

This paper proposes D-FACE, which utilizes a conditional VQ-VAE pre-trained on large-scale face videos to discretize facial muscle movements between two frames into "identity- and domain-invariant" semantic-level action tokens. By employing a Transformer with sparse attention pooling and emotion-description guided CLIP alignment for micro-expression recognition, it marks the first shift in MER from relying on pixel-level motion descriptors (optical flow/frame difference) to semantic-level tokens, while simultaneously enabling cross-identity/cross-domain micro-expression generation.

From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

This paper proposes a new paradigm called "Sparse Interleaved Input"—where \(N\) cameras each capture one frame at different time steps instead of synchronous full-frame sampling. The DenseWarper framework (epipolar spatial fusion + deformable convolution temporal completion) is then used to restore sparse interleaved heatmaps into dense, spatio-temporally consistent pose sequences. It outperforms traditional synchronous multi-view inputs with only \(1/N\) of the data and improves the effective output frame rate by \(N\) times.

GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

Ours proposes the Snippet paradigm: organizing gait silhouette sequences into multiple "snippets," where each snippet consists of frames randomly sampled within a continuous interval. This approach balances short-range temporal context with long-range temporal dependencies. Using a 2D convolutional backbone, it achieves 77.5% Rank-1 on Gait3D, surpassing all 3D convolutional methods.

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

GenCape treats the keypoint skeleton structure in Category-Agnostic Pose Estimation (CAPE) as a latent variable to be generated. It employs an iterative structure-aware variational auto-encoder (i-SVAE) to infer instance-specific soft adjacency matrices from support images. A Combined Graph Transfer (CGT) module then performs Bayesian fusion of multiple sampled graphs based on uncertainty and query relevance. This approach completely eliminates the need for pre-defined skeletons and text priors, achieving new SOTA results on MP-100 for both 1-shot and 5-shot settings (mPCK +1.59% over FMMP).

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

A VLM translates high-level instructions into a part-level bipartite graph of "Relative Motion Dynamics (RMD)," automatically constructing target states and reward functions for reinforcement learning. This enables physically simulated characters to complete long-horizon interactions with static, dynamic, and articulated objects without motion capture data or manual reward tuning.

HUMOF: Human Motion Forecasting in Interactive Social Scenes

HUMOF uniformly encodes "human-human interaction" (HHI) and "human-scene interaction" (HSI) in dynamic social scenes into hierarchical features (high-level semantics + low-level geometry). These features are injected layer-by-layer via a coarse-to-fine Transformer reasoning module, achieving state-of-the-art (SOTA) performance on four public datasets.

InclusiveVidPose: Bridging the Pose Estimation Gap for Individuals with Limb Deficiencies in Videos

This paper constructs the first large-scale video human pose estimation dataset specifically for individuals with limb deficiencies (amputation, congenital limb differences, prosthetic users), named InclusiveVidPose. Building on the COCO 17-point schema, it adds 8 new residual limb keypoints and proposes the LiCC metric to quantify a model's ability to distinguish between "actual residual limbs/missing limbs" and "complete limbs," revealing systematic failures of existing SOTA models for this population.

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

InfBaGel aligns Human-Object-Scene Interaction (HOSI) motion generation with the few-step denoising process of a consistency model. By using dynamic perception to iteratively update scene occupancy, bump-aware guidance to suppress interpenetration, and mixed-data training to bypass the scarcity of HOSI labels, the framework achieves real-time generation of long-range interactions—such as carrying large objects while avoiding obstacles—without requiring HOSI-specific annotations.

Instilling an Active Mind in Avatars via Cognitive Simulation

This paper attributes the "monotonous lip-syncing and movements" of digital humans to the exclusive simulation of human "System 1 (Fast Thinking)." It proposes using an MLLM agent as "System 2 (Slow Thinking)" to generate high-level semantic plans. Furthermore, a symmetric MMDiT with a Pseudo Last Frame is designed to integrate text, audio, and image modalities without conflict, enabling avatars to achieve accurate lip-syncing alongside context-aware and emotional performances.

Interaction-aware Representation Modeling With Co-Occurrence Consistency for Egocentric Hand-Object Parsing

For pixel-level segmentation of hands and active objects in egocentric images, this paper proposes InterFormer. It utilizes interaction boundary priors to dynamically generate "interaction-aware queries," purifies decoding features, and incorporates a "conditional co-occurrence loss" to encode the physical common sense that "an active object should not appear if its interacting hand is not detected" into training. It achieves SOTA performance on EgoHOS and cross-domain mini-HOI4D.

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Ours proposes TEMU-VTOFF—a Dual-DiT architecture for the virtual try-off (VTOFF) task. It employs a collaborative division between a feature extractor and a garment generator, utilizing Multi-modal Hybrid Attention (MHA) to fuse image, text, and mask information to resolve visual ambiguity. Additionally, a DINOv2-driven garment aligner is designed to preserve high-frequency details. Ours achieves SOTA performance in multi-category scenarios on both VITON-HD and Dress Code datasets.

KinemaDiff: Towards Diffusion for Coherent and Physically Plausible Human Motion Prediction

KinemaDiff directly embeds human skeletal topology and joint-level dynamics into the diffusion process itself. By replacing the conventional practice of "implicitly encoding priors via network architecture" with a Joint-Adaptive Noise Generator and a Structural Alignment Regularizer, it significantly enhances the physical plausibility and accuracy of stochastic human motion prediction while maintaining diversity.

LINK: Learning Instance-level Knowledge from Vision-Language Models for Human-Object Interaction Detection

LINK utilizes a plug-and-play two-stage HOI detection framework comprising a "geometric encoder + VLM linking decoder," supplemented by a progressive learning strategy under a teacher-student paradigm. By converting sparse HOI annotations into dense supervision covering all human-object pairs, it achieves SOTA performance across fully-supervised, zero-shot, and open-vocabulary settings.

Motion-Aligned Word Embeddings for Text-to-Motion Generation

MATE pushes "motion semantic alignment" down to the word embedding layer of the LLM text encoder. By fine-tuning only this thin layer (3.2M parameters) through motion localization and word-level decoupling, it binds motion-related words like "clockwise" to skeletal movements. This produces a plug-and-play motion-aware text encoder that improves mainstream T2M models like MoMask and MDM to new SOTA levels with virtually no architectural changes.

Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding

Motion-R1 combines a "Decomposed Chain-of-Thought (CoT) Data Engine" with "RL Binding": the former uses an LLM to decompose high-level instructions into temporal/causal sub-action chains for cold-start SFT; the latter uses GRPO to directly incorporate "motion similarity + semantic similarity + format" as rewards, eliminating the need for expensive human preference annotations while generating semantically aligned and realistic 3D human motions.

MotionGPT3: Human Motion as a Second Modality

By treating human motion as a "second modality," this work replaces discrete VQ tokens with a continuous VAE latent space and utilizes a symmetric motion branch with shared attention instead of a single-stream backbone. Combined with a lightweight diffusion head attached to the autoregressive backbone, a unified model performs both text-to-motion generation and motion-to-text understanding, achieving \(2-4\times\) faster training convergence.

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Ours constructs the PersonaX multimodal dataset (containing LLM-inferred Big Five behavior traits, facial embeddings, and biographical metadata) and proposes a two-layer analysis framework: structured independence testing + unstructured causal representation learning (with identifiability theoretical guarantees) to reveal cross-modal causal structures.

Pose-RFT: Aligning MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Addressing the alignment gap where pose-specific MLLMs are forced into "average solutions" under supervised fine-tuning due to one-to-many ambiguity, this paper proposes Pose-RFT. It reformulates 3D human pose generation as a hybrid action reinforcement learning problem of "discrete text + continuous pose," utilizes the HyGRPO algorithm to optimize both output types separately, and incorporates four task-specific rewards, significantly outperforming existing pose-specific MLLMs on multiple benchmarks.

Pose Prior Learner: Unsupervised Categorical Prior Learning for Pose Estimation

This paper proposes the Pose Prior Learner (PPL), which utilizes a hierarchical memory module to learn explicit and visualizable pose priors (keypoint priors + connectivity priors) from scratch using purely self-supervised image reconstruction. These priors constrain and iteratively refine pose estimation for single images. PPL outperforms manual-prior and prior-free baselines on several human and animal datasets and can complete reasonable full-body poses even under heavy occlusion.

PulpMotion: Framing-Aware Multimodal Camera and Human Motion Generation

This paper introduces the text-conditioned joint generation of "human motion + camera trajectories" for the first time. Using a model-agnostic framework, it treats "screen composition (projection of human joints onto the camera view)" as an auxiliary modality bridge. During the sampling stage, generation is guided toward compositional consistency, ensuring characters remain in-frame with a cinematic aesthetic. It achieves new SOTA on this task across both DiT and MAR architectures.

QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture

QuaMo proposes a 3D human kinematics capture method based on Quaternion Differential Equations (QDE). By solving kinematic equations under the unit sphere constraint and introducing a meta-PD controller with second-order acceleration enhancement, the method achieves discontinuity-free, low-jitter online real-time human motion estimation, outperforming state-of-the-art results on datasets like Human3.6M.

ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation

ReactDance utilizes a multi-scale motion representation via Hierarchical Finite Scalar Quantization (HFSQ) to decouple "coarse posture" from "high-frequency details." Combined with non-autoregressive Blocked Local Context (BLC) parallel sampling, it generates high-fidelity and long-term coherent "reactor" dances exceeding 2000 frames (60s+) in under 2 seconds.

Sapiens2: High-Resolution Foundation Models for Human-Centric Vision

Sapiens2 employs a unified pre-training objective of "mask reconstruction + self-distillation contrastive learning" to train 0.4B–5B high-resolution Transformers on 1 billion curated human images. Supporting 4K hierarchical backbones, it sets new SOTA benchmarks across multiple human dense tasks including pose estimation, body part segmentation, surface normals, point clouds, and albedo.

SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

SesaHand employs a controllable diffusion framework with a dual-pronged "semantic + structural alignment" approach to synthesize realistic hand images with 3D mesh labels. The semantic branch uses Chain-of-Thought (CoT) to refine "human behavior semantics" from VLM descriptions, removing irrelevant details, while the structural branch uses hierarchical self-attention fusion for hand-body alignment and a bias term for efficient hand cross-attention enhancement. The generated images significantly improve in-the-wild 3D hand reconstruction (e.g., MPVPE).

Sparkle: A Robust and Versatile Representation for Point Cloud-based Human Motion Capture

Addressing the dilemma in point cloud motion capture where "point-level methods are detail-rich but noise-sensitive, while skeletal methods are robust but lose detail," this paper proposes the Sparkle representation—explicitly decoupling and then unifying 24 skeletal joints (internal kinematics) and 32 surface anchors (external geometry). Coupled with the SparkleMotion framework (Point-aligned Skeleton Tracker + Skeleton-guided Anchor Estimator + Sparkle-based SMPL Solver), it sets new SOTAs across 11 datasets, sensors, and occlusion noise.

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Addressing the lack of public data for "active interactive digital humans," this work constructs SpeakerVid-5M—the first large-scale, high-quality dataset for audio-visual dyadic interactive digital human generation (8,743 hours, 5.2 million single-person clips, 770,000 dialogue pairs). It also introduces an auto-regressive video dialogue baseline and the VidChatBench evaluation benchmark.

Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Text2Interact addresses text-driven two-person 3D interaction generation. It first utilizes InterCompose to synthesize high-quality interaction data from LLMs and single-person motion priors, then employs InterActor with word-level text conditioning, dual-person motion interaction attention, and adaptive interaction loss to enhance motion realism, text alignment, and cross-distribution generalization.

TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Ours proposes the new task of "Free-Form Hand-Object Interaction (HOI) Generation" along with WildO2, an in-the-wild 3D dataset automatically reconstructed from web videos. A three-stage framework, TOUCH (Contact Map Prediction → Multi-level Conditional Diffusion → Physical Constraint Refinement), is designed to move beyond "stable grasping" priors, enabling the generation of diverse and physically plausible hand poses—such as pushing, poking, and rotating—based on fine-grained textual instructions.

TriC-Motion: Tri-domain Causal Modeling for Text-to-Action Generation

TriC-Motion models human motion across temporal, spatial, and frequency domains in parallel within a diffusion denoising framework. It employs a score-guided gating mechanism for tri-domain fusion and introduces causal counterfactual intervention to strip away motion-irrelevant noise cues, achieving a new SOTA R@1 of 0.612 on HumanML3D.

Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

DualFlow utilizes a dual-branch Transformer framework based on Rectified Flow to unify text, music, actor motions, and retrieved dyadic motion exemplars. It supports both interactive dyadic motion generation and actor-reactor reactive motion generation, achieving superior semantic alignment, motion quality, and synchronization on MDD, InterHuman-AS, and DD100 with fewer inference steps.

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

UniHand unifies the long-separated tasks of "estimating hand pose from video" and "generating hand motion under structured conditions" into a single conditional motion synthesis problem. By using a Joint VAE to align MANO parameters and 2D/3D skeletons into a shared latent space, and a latent diffusion model to fuse multiple conditions (including a "hand perceptron" that selects hand-specific tokens from global image features), it achieves SOTA results on DexYCB / HO3D / HOT3D even under severe occlusion and missing frames (DexYCB PA-MPJPE 4.08mm).

Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

LIGHT transforms the "diffusion forcing" mechanism—where each token can have its own noise level—into a classifier-free guidance approach. By allowing the body, hands, and objects to follow different denoising paces, clean modalities guide noisy ones via cross-attention. This generates text-driven human-object interaction (HOI) animations with more realistic contact without relying on manual contact priors.

Zero-Shot Human Pose Estimation Using Diffusion-Based Inverse Solvers

For the sparse pose estimation task of "recovering full-body 22-joint poses from only a VR headset + two controllers (3 upper-body sensors)," this paper proposes InPose: it decomposes the pose into scale-free rotations and scale-dependent joint positions. It uses only rotations as a conditional diffusion prior while treating position measurements as an Inverse Kinematics (IK) likelihood term to guide denoising, achieving zero-shot generalization to users of different body shapes without any fine-tuning.