Skip to content

🧑 Human Understanding

📷 CVPR2026 · 138 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (45) · 🧪 ICML2026 (5) · 🤖 AAAI2026 (20) · 🧠 NeurIPS2025 (21) · 📹 ICCV2025 (41)

🔥 Top topics: Face & Gaze ×15 · Re-Identification ×12 · Human Pose ×12 · Diffusion Models ×10 · Multimodal/VLM ×9

ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

ActAvatar utilizes "structured text prompts + phase-aware cross-attention" to allow talking avatar videos to perform specific actions within designated time windows. Combined with "depth-progressive audio influence" and "two-stage training," it maintains lip-sync, action accuracy, and image quality without relying on pose skeletons, achieving 14B-level effects with a 5B model.

All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark

This paper proposes LIDMark, the first framework to unify deepfake detection, tampering localization, and source tracing into a single proactive forensics system. By embedding a 152-dimensional Landmark-Identity watermark (136D facial landmarks + 16D source ID), it utilizes intrinsic/extrinsic consistency to achieve three-in-one forensics, outperforming existing methods in both PSNR/SSIM and detection accuracy.

AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars

AudioAvatar reconstructs a canonical 3D Gaussian whole-body digital human from a single portrait and allows audio to directly modulate the motion trajectory of each Gaussian particle (skipping the lossy intermediate chain of "audio → parametric pose → rendering"). By leveraging large-scale audio-driven video diffusion models for feature distillation, it significantly outperforms pose-driven baselines in lip synchronization, facial micro-expressions, and gesture naturalness.

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

The authors upgrade "talking head generation" from unidirectional broadcasting to genuine bidirectional conversation. By utilizing causal diffusion forcing in the motion latent space, the model receives user audio/motion while auto-regressively generating avatar head movements. Combined with KV caching, the latency is reduced to ~500ms (6.8x faster than baselines). Furthermore, a label-free DPO (Direct Preference Optimization), which generates negative samples by "dropping user conditions," enables the avatar to learn expressive reactions like nodding and smiling, achieving over an 80% preference rate against the strongest baseline in human evaluations.

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

The AVATAR framework is proposed to improve GRPO through two core components: an off-policy training architecture (stratified replay buffer) and Time Advantage Shaping (TAS, using U-shaped weighting to emphasize the beginning and end of reasoning chains). This approach addresses three major issues of GRPO—data inefficiency, vanishing advantages, and uniform credit assignment—significantly outperforming the GRPO baseline on audio-visual reasoning benchmarks.

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

Addressing the pain point that real-world collection of gait data for "one person wearing hundreds of outfits" is nearly impossible, this paper maps 521 real subjects into a virtual engine. By randomly generating 100 outfits per person, the authors construct an identity-consistent synthetic gait dataset, BarbieGait. A companion clothing-invariant baseline, GaitCLIF, is proposed, achieving SOTA results on BarbieGait and real-world datasets including CCPG, SUSTech1K, Gait3D, and GREW.

Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

The authors model driver gaze as an autoregressive dynamical system: each frame of the traffic scene is encoded into a "gaze-centric" heterogeneous spatio-temporal graph. An Affinity Relational Transformer (ART) models the interaction between the gaze and traffic objects, while an Object-level Density Network (ODN) predicts the next-step gaze distribution, which is autoregressively unrolled into continuous gaze trajectories. This unified model simultaneously generates SOTA-level gaze time series, scanpaths, and saliency maps.

Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding

Addressing the loophole in existing MLLM benchmarks that default to "single-view sufficiency" and only reward single-image recognition, this work constructs CVBench—3,000 human understanding questions where each item is verifiably "unsolvable via single-view, solvable via cross-view" (12 spatio-temporal tasks, 4-way synchronized cameras). Evaluation reveals that even the strongest models lag nearly 50 points behind humans, identifying a systematic failure mechanism across all models: "single-view bias."

BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification

Addressing the large modality gap and infrared sample scarcity in Visible-Infrared Person Re-Identification (VI-ReID), BIT discards the conventional approach of aligning features into a shared space. Instead, it adopts a matching-based paradigm: a bi-directional cross-interaction module allows visible-infrared image pairs to mutually absorb complementary information, followed by a query-aware scoring module that mines reliable reciprocal correspondences at the patch level to compute final similarity. BIT achieves SOTA results on SYSU-MM01, LLCM, and RegDB benchmarks.

BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer

BoostSLT introduces a plug-and-play module that wraps any sign language translation model. It segments long videos into semantic segments based on motion energy, translates segments independently, and reconstructs fragmented translations into coherent long sentences using a Diffusion Language Model. Without relying on gloss annotations, it significantly improves BLEU and ROUGE for long-sentence and document-level translation.

Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection

Addressing the issues of data scarcity, class imbalance, label noise, and confounding bias in Facial Action Unit (AU) detection, this paper proposes the UDCT framework: it models Transformer attention weights as Gaussian distributions to explicitly represent uncertainty, uses this uncertainty to reweight sample losses against noise/imbalance, and employs per-AU causal backdoor adjustment to sever spurious AU correlations caused by confounders. Ours achieves competitive and more robust results on BP4D / DISFA (average F1 of 67.36% on DISFA).

Causal Motion Diffusion Models for Autoregressive Motion Generation

The CMDM framework is proposed, which unifies diffusion denoising and autoregressive generation within a motion-language-aligned causal latent space. By employing frame-level independent noise levels and a causal uncertainty sampling schedule, it achieves high-quality, low-latency text-to-motion generation and long-sequence streaming synthesis.

CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation

Ours proposes CIGPose, a causal intervention graph pose estimation framework. It identifies visual context confounders through structural causal models, utilizes prediction uncertainty to locate keypoints affected by confounding, and replaces them with learned context-free canonical embeddings. A hierarchical GNN then models skeletal anatomical constraints, achieving a new SOTA of 67.0% AP on COCO-WholeBody.

CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition

CLEX suppresses spurious activations by randomly exchanging a subset of non-target logits between a primary and an augmented branch, followed by scale-invariant normalization. It then employs a "complementary suppression loss" to specifically suppress responses of randomly retained non-target classes. Without requiring clean data or noise priors, CLEX achieves SOTA performance across various noise rates on three in-the-wild FER datasets: RAF-DB, AffectNet, and FERPlus.

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

The COG framework is proposed to model cross-view correspondences as a confidence-aware Optimal Transport (OT) problem. By predicting point-wise confidence as transport marginal constraints, it suppresses non-overlapping regions and outliers, achieving single-reference 6DoF pose estimation for novel objects under unsupervised conditions that is comparable to supervised methods.

Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement

Aiming at the natural but ambiguous query of "reference image + short keyword attributes," this paper proposes the CA-ReID task. It utilizes pose-guided "Part-Aware Representation (PAR)" to bind textual attributes to corresponding body regions and employs "Dense Disentanglement Loss (DDL)" to separate identity and attribute dimensions. This approach improves Recall@1 for Hard queries by up to +17% on the self-built composite attribute benchmark.

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

CoordSpeaker first employs a "gesture captioning" framework to offline generate multi-granularity descriptive text for gesture data lacking textual annotations. It then utilizes a conditional latent diffusion model with a "hierarchical condition injection denoiser" to coordinate heterogeneous audio and text conditions. This generates full-body speaker gestures that are both rhythmically aligned with speech and responsive to textual instructions (e.g., "bowing while speaking").

COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification

COPE addresses the deep-seated issues of "feature interference" and "information loss" in occluded ReID using three lightweight modules: Cross-Identity Consistent Occlusion (CICO) imposes the same occlusion across different identities and constrains feature consistency in occluded areas; Prompt-Based Background Filling (PBF) uses CLIP text prompts to locate the foreground and randomly fill the background; and Prompt Similarity Scoring (PSS) post-processes retrieval during inference based on foreground completeness scores. It achieves approximately 82% Rank-1 and 75–76% mAP on Occluded-Duke with almost no additional inference cost.

D³FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Challenges

Aiming at the compound challenge of "visual disturbances (occlusion/pose) + label noise" in in-the-wild facial expression recognition (FER), D³FER feeds weak/strong dual-channel augmentations into a Query-Key momentum dual-branch network. It utilizes a cross-batch dynamic queue to cache both confidence scores for adaptive threshold-based sample filtering and label correction, as well as features for supervised contrastive learning. During inference, the smoother Key branch is used, achieving new SOTA results on RAF-DB/FERPlus/AffectNet and their occlusion/pose/noise subsets.

Decoupled Generative Modeling for Human-Object Interaction Synthesis

DecHOI decomposes "Human-Object Interaction Synthesis" into two lightweight diffusion experts: a Trajectory Generator first plans global paths for the human and object without manual waypoints, followed by an Action Generator that completes fine-grained full-body actions conditioned on these paths. It utilizes an adversarial discriminator targeting end-joint contact dynamics to bridge the realism gap. DecHOI outperforms CHOIS/HOIFHLI on most metrics in FullBodyManipulation and 3D-FUTURE datasets and supports real-time replanning when encountering moving obstacles.

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

The DecoVLN framework is proposed to decouple the three processes of observation, reasoning, and error correction in VLN tasks. By utilizing an adaptive memory optimization mechanism and a state-action pair-based correction fine-tuning strategy, the framework achieves state-of-the-art (SOTA) performance on R2R-CE and RxR-CE using only egocentric RGB input.

DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations

By employing a hybrid motion representation—explicit global transformation for head pose and implicit latent code for facial expressions—alongside dual-branch pose injection and progressive blended CFG, this work achieves high-fidelity disentangled control of pose and expression in one-shot portrait animation for the first time, supporting fine-grained editing of pose or expression independently.

Differentially Private 2D Human Pose Estimation

The first unified differential privacy framework for 2D human pose estimation: it combines two denoising mechanisms, "Gradient Subspace Projection" and "Feature-level Differential Privacy (adding noise only to private features of the raw image)," into Feature-Projective DP. Under formal privacy guarantees, it significantly narrows the accuracy gap with non-private models (achieving 82.61% [email protected] on MPII at \(\varepsilon=0.8\), recovering 73% of the privacy-induced loss).

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

DyaDiT is a multi-modal Diffusion Transformer designed for dyadic dialogue scenarios. It employs an Orthogonalized Cross-Attention (ORCA) module to disentangle two channels of overlapping audio, integrates social conditions such as relationships/personality and a motion dictionary prior, and generates upper-body gestures that align with both dialogue dynamics and social context, surpassing existing dyadic gesture methods in objective metrics and user preference.

Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition

To address the prevalent label noise in Facial Expression Recognition (FER) datasets, this paper proposes the OTP-NS framework. It replaces a single EMA teacher with an "Optimal Teacher Pool" to break parameter coupling and noise accumulation between the teacher and student. Additionally, two sample-level denoising components, Similarity-Aware Label Smoothing (SALS) and Centroid Confidence Weighting (CWL), are integrated. The method outperforms existing SOTA across various noise ratios on multiple benchmarks with zero additional inference overhead.

Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification

To address the issue in Lifelong Person Re-Identification (LReID) where fixed network architectures cannot accommodate continuously accumulating knowledge, leading to catastrophic forgetting, this paper proposes the dynamic expansion framework VIA. It models each new domain independently using cascaded dual LoRA adapters, reuses cross-domain commonalities via a shared expert pool with routing, and adaptively adjusts encoder learning rates based on domain similarity. Ultimately, it improves the average mAP across 5 seen domains from 66.4% of the baseline to 77.7%.

E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

E-3DPSM is proposed as an event-based state machine for egocentric 3D human pose estimation. It models pose estimation as a continuous-time state evolution process, utilizing bidirectional SSMs for temporal modeling and a learnable Kalman-style fusion module to integrate direct and incremental predictions. It achieves 80Hz real-time inference, reduces MPJPE by 19%, and improves temporal stability by 2.7 times.

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

Ours proposes EgoPoseFormer v2 (EPFv2), achieving SOTA accuracy in egocentric 3D human motion estimation (MPJPE 4.02cm, a 15-22% improvement over its predecessor) on the EgoBody3M benchmark with 0.8ms GPU latency. This is realized through an end-to-end Transformer architecture (Single Global Query + Causal Temporal Attention + Conditional Multi-view Cross-Attention) and an automatic labeling system based on uncertainty distillation.

EventGait: Towards Robust Gait Recognition with Event Streams

EventGait performs gait recognition using event cameras and proposes a dual-stream framework: a dynamic stream uses a Mixture of Spiking Experts (MoSE) with varying membrane time constants to adaptively capture multi-scale motion, while a static stream employs Cross-modal Structural Alignment (CroSA) with DINOv2 as a teacher to distill dense shape priors into sparse events. It matches camera-based methods in normal light and significantly outperforms them in low light (+37.3% at night) on both synthetic and real event gait benchmarks.

FaceCoT: Chain-of-Thought Reasoning in MLLMs for Face Anti-Spoofing

This work constructs FaceCoT, the first large-scale VQA dataset for face anti-spoofing (FAS), containing 1.08 million samples across 14 attack types with six-level Chain-of-Thought (CoT) reasoning annotations (from global description to local reasoning to final conclusion). It proposes the CoT-Enhanced Progressive Learning (CEPL) two-stage training strategy, achieving an average AUC improvement of 4.06% and an HTER reduction of 5.00% across 11 benchmarks, surpassing all SOTA methods.

FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling

FisherPoser models the estimation of full-body poses from three 6-DoF signals (HMD + two controllers) as probabilistic inference on the \(SO(3)\) manifold. Instead of a single rotation, each joint outputs a Matrix-Fisher distribution. By utilizing "five-region tokens + parent-to-child recursion along kinematic chains," both pose and uncertainty are propagated hierarchically. This approach sets new SOTA records for MPJPE/MPJRE on the AMASS sparse VR benchmark while providing well-calibrated per-joint confidence.

FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

FlexAvatar is proposed to resolve the entanglement between driving signals and target viewpoints by introducing learnable "bias sinks" tokens to unify training across monocular and multi-view data, enabling the generation of complete, high-quality, and animatable 3D head avatars from a single image.

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

FloodDiffusion tails the diffusion forcing framework from the video domain for text-driven streaming human motion generation. Through three key modifications—lower-triangular time scheduling, bi-directional attention within an active window, and frame-level time-varying text conditioning—it achieves a SOTA streaming FID of 0.057 on HumanML3D, approaching the performance of non-streaming methods for the first time.

FLOW: Optimal Transport-Driven Feature Warping for Generalized Remote Physiological Measurement

FLOW treats "distributional shift" in end-to-end rPPG models during cross-domain deployment as a feature-level Optimal Transport (OT) problem. It first utilizes a lightweight Temporal Refinement Module (TRM) to unify and denoise temporal features across domains, then applies Prototype-based Cross-temporal Optimal Transport (PCOT) using a learnable prototype bank for soft alignment. Coupled with two regularization terms, it achieves cross-domain SOTA on four rPPG benchmarks in a plug-and-play, backbone-agnostic manner.

FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

FlowPalm utilizes RAFT optical flow to statistically derive non-rigid deformation fields from real palmprint pairs, constructing a "deformation library." During diffusion sampling, these deformations are injected into the pipeline via three stages: crease warping for the backbone and warped noise for the texture. This produces geometrically diverse and identity-consistent synthetic palmprints. Notably, a recognition model trained solely on synthetic data (85.20% TAR) outperforms one trained on real data (73.59%).

FMPose3D: monocular 3D pose estimation via flow matching

The authors reformulate monocular 2D-to-3D pose lifting as a "conditional distribution transport" problem. By utilizing Flow Matching to learn an ODE velocity field, the method transports Gaussian noise to valid 3D pose distributions in only 3 integration steps. A Reprojection Error-based Expectation Aggregation (RPEA) module merges multiple hypotheses into a single estimate. This approach outperforms diffusion-based methods on Human3.6M, MPI-INF-3DHP, and animal datasets while being approximately 5 times faster during inference.

Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production

Addressing the common flaw in the Gloss-to-Pose stage of Sign Language Production (SLP)—which typically models only global sequences while ignoring fine-grained joint-level dependencies—this paper proposes the Focal–General Diffusion Model (FGDM). By employing a two-stage denoising structure that "focuses on joints first, then coordinates the global sequence," combined with a frame-wise Adaptive Semantic Graph Convolutional Network (ASGCN) and a Semantic Consistent Guidance (SCG) mechanism that injects CTC-based semantic supervision into diffusion training, the model achieves new SOTA performance on PHOENIX14T and USTC-CSL.

Forecasting 3D Scanpaths in Egocentric Video

This paper extends the task of "predicting where a person will look next" from 2D images to egocentric videos for the first time. It defines a new task of predicting future gaze sequences (3D scanpaths) within a 3D world coordinate system and proposes a Transformer architecture that uses the "last observed camera pose" as a canonical frame to fuse video, head pose, and historical gaze, establishing the first baseline on the Aria Digital Twin dataset.

FrankenMotion: Part-level Human Motion Generation and Composition

Addressing the limitation where text-to-human motion generation only allows sequence or action-level control but lacks control over individual body parts, this paper first utilizes an LLM agent (FrankenAgent) to automatically label existing mocap datasets into a three-level, temporally aligned fine-grained dataset named FrankenStein (Sequence / Atomic Action / Body Part). Subsequently, a diffusion-based model called FrankenMotion is trained, driven by per-frame text prompts for each body part, enabling the composition of complex motions not seen during training (e.g., "raising the left arm while sitting").

From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

The TAR-FAS framework is proposed, reconstructing the Face Anti-Spoofing (FAS) task into a Chain-of-Thought with Visual Tools (CoT-VT) paradigm for the first time. This allows MLLMs to adaptively invoke external visual tools (LBP/FFT/HOG, etc.) during reasoning, upgrading from "intuitive judgment" to "fine-grained investigation," achieving SOTA on the 1-to-11 cross-domain protocol.

FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition

This paper proposes FusionAgent, a Multimodal Large Language Model (MLLM) agent framework for dynamic sample-level model selection in whole-body biometric recognition. By encapsulating expert models (Face Recognition, Gait Recognition, Person Re-ID) as tools, the agent learns to adaptively select the optimal model combination for each test sample through Reinforcement Fine-Tuning (RFT). Combined with a novel ACT score fusion strategy, it significantly outperforms existing SOTA fusion methods.

Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction

To address the long-standing issues in stochastic human motion prediction—sacrificing plausibility for accuracy and diversity and the inability to reliably quantify uncertainty—this paper learns a data-driven Gaussian Mixture Prior via EM in the latent space to decouple different motion modes. It then employs a fully invertible Latent Flow Matching (integrated with a skeleton-aware Transformer) for prediction. This approach enables both precise log-likelihood for uncertainty measurement and SOTA accuracy and plausibility on Human3.6M and AMASS.

Gaze Target Estimation Anywhere with Concepts

This paper introduces "Promptable Gaze Estimation (PGE)," a new task where a specific individual in a scene is designated via natural language or a coordinate, and the model directly produces a heatmap of their gaze location end-to-end. The authors provide the Gaze-Co dataset with 120K concept annotations and the first PGE model, GazeAnywhere, achieving SOTA across multiple benchmarks.

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Using a single upward-facing desktop fisheye camera to capture a 360° scene, GazeOnce360 employs rotational convolutions, eye keypoint supervision, and global-local dual-resolution cross-attention to simultaneously detect and regress 3D gaze directions for multiple people in an end-to-end manner. On the self-built synthetic dataset MPSGaze360, it reduces gaze error from 18.96° (multi-stage pipeline) to 10.39° while achieving a ~4x speedup.

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

Addressing the dilemma of "off-axis near-eye IR cameras + no reliable labels" in VR headsets, this work releases VRGaze (68 subjects, 2.1M frames), the first large-scale off-axis gaze dataset. It proposes GazeShift, which uses "gaze redirection between two frames of the same eye" as an unsupervised proxy task. By decoupling gaze and appearance via standard cross-attention and using the model's own attention maps as soft masks to focus on the eye region, the model achieves a 1.84° error on VRGaze with only 342K parameters and 55 MFLOPs (5ms inference on headset GPUs), approaching supervised performance.

Geometric Neural Distance Fields for Learning Human Motion Priors

This paper proposes NRMF (Neural Riemannian Motion Fields), which models the third-order dynamics of human motion—"pose, velocity, and acceleration"—as the zero-level sets of three conditional neural distance fields. Equipped with a geometric projection algorithm and a geometric integrator, this single unconditional prior robustly handles tasks such as denoising, in-betweening, monocular fitting, and generation. It outperforms VAE and diffusion-based priors on benchmarks like AMASS, 3DPW, and PROX.

Goldilocks Test Sets for Face Verification

While mainstream face verification test sets have reached saturation (LFW 99.8%), this study does not rely on reducing image quality or adding occlusions to create difficulty. Instead, it extracts three types of "natural but difficult" image pairs from high-quality controlled face databases: extreme beard differences (Hadrian), strong exposure differences (Eclipse), and identical twins (ND-Twins). A set of "Goldilocks Three Rules" is proposed to ensure the test sets are "just right" in difficulty, resulting in difficulty levels that surpass artificial benchmarks relying on synthetic masks or reduced resolution.

HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation

Aiming at training oscillations caused by the "source supervision gradient vs. target consistency gradient" conflict in synthetic \(\to\) real domain pose estimation, HamiPose performs orthogonal decomposition of target gradients by keypoints, uses confidence gating to allow only non-conflicting components, and applies a Hamiltonian optimizer with a symplectic integrator to add "controlled momentum" for suppressing high-frequency jitters, achieving SOTA on multiple UDA pose benchmarks.

HandX: Scaling Bimanual Motion and Interaction Generation

HandX is constructed as a unified bimanual motion generation infrastructure (comprising 54.2 hours of motion data + 485,000 fine-grained text annotations). A decoupled automatic annotation strategy (kinematic feature extraction + LLM reasoning for description generation) is proposed. Diffusion and autoregressive generation paradigms are benchmarked, demonstrating clear data and model scaling trends.

Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation

HESP utilizes an Adaptive Gaussian VAE (AG-VAE) that explicitly decomposes the latent space into multiple semantic sub-manifolds, combined with Dynamic Cross-Modal Memory (DCMM) and Hierarchical Cross-modal Attention (HCA). This makes text-driven 3D human motion generation more controllable and interpretable, outperforming baselines such as SALAD, MoMask, and MDM in FID and R-Precision on HumanML3D and KIT-ML.

How to Take a Memorable Picture? Empowering Users with Actionable Feedback

This paper defines a new task of Memorability Feedback (MemFeed) and proposes MemCoach—a training-free activation steering method for Multimodal Large Language Models (MLLMs). By injecting memorability-aware knowledge into the model's activation space using a teacher-student strategy, it enables MLLMs to generate natural language actionable suggestions for improving photo memorability.

HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human-Scene Interaction

HSI-GPT2 is a Large Motion Model (LMM) for "unified understanding + generation" of Human-Scene Interaction (HSI). It employs a dual-granularity motion tokenizer to decouple actions into semantic and detail codebooks. By utilizing an LLM as a semantic planner and a diffusion decoder as a de-tokenizer, the model achieves high physical fidelity. Integrated with a Motion Chain-of-Thought (MoCoT) data engine and Group Relative Policy Optimization (GRPO), the model performs step-by-step reasoning, significantly outperforming HSI-GPT on HumanML3D and HUMANISE benchmarks for generation, description, and completion tasks.

HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations

HUMAPS-4D is a large-scale human motion dataset that synchronizes optical motion capture, multi-view RGB, IMU, instrumented pressure insoles, surface electromyography (sEMG), anthropometry, and three-layer semantic annotations under a unified protocol (32 participants × 30 actions × 10 repetitions × 14 hours = 5.76 million frames). Its goal is to establish a rigorous benchmark for inferring full-body 3D poses/actions from physiological signals like plantar pressure without relying on cameras.

HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph

HyperGait utilizes hypergraph convolution to extract "high-order nonlinear correlations" among body parts and across temporal frame segments within gait parsing sequences (GPS). Using only the single parsing modality as input, it achieves a 80.5% Rank-1 accuracy on the real-world Gait3D dataset, surpassing the previous parsing-based SOTA (MultiGaitP) by 4.1 percentage points.

IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion

IMU-HOI treats "hand-object contact" as a first-class probabilistic signal. Starting from sparse IMUs attached to the body (6 units) and the object (1 unit), a three-stage fusion pipeline simultaneously recovers full-body human poses and 6-DoF object trajectories, reducing object trajectory error by 44%–64% compared to strong baselines across three HOI benchmarks.

InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

InterAgent is the first text-driven, physics-based dual-humanoid agent control framework. It employs a multi-stream autoregressive diffusion Transformer (Inter-DiT) to decouple proprioception, exteroception, and action, and utilizes an "Interaction Graph + Sparse Edge Attention" to characterize fine-grained joint-to-joint relationships, generating physically plausible and semantically faithful dual-humanoid interactions from a single text command.

LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

This work proposes the LabanLite symbolic motion representation and the LaMoGen framework, enabling LLMs to autonomously compose motion sequences through interpretable Laban symbolic reasoning for the first time, surpassing traditional text-motion joint embedding methods in temporal precision and controllability.

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

LAMP leverages the known 6-DoF poses of a headset to lift 2D human keypoints detected across multiple cameras into a unified world-frame 3D ray cloud at an early stage. A spatio-temporal Transformer then fits human SMPL motion directly to this ray cloud. This "lift-then-fit" paradigm completely decouples the wearer's head motion from the observed person's motion, achieving SOTA performance on monocular benchmarks and significantly outperforming baselines in multi-camera egocentric scenarios.

LCA: Large-scale Codec Avatars - The Unreasonable Effectiveness of Large-scale Avatar Pretraining

LCA applies the large-scale pre-training/post-training paradigm to 3D avatar modeling for the first time: pre-training on 1 million in-the-wild videos to learn broad appearance and geometric priors, followed by post-training on high-quality multi-view studio data to enhance fine expressions and fidelity, breaking the inherent trade-off between generalization and fidelity.

Learning Effective Sign Features without Text for Gloss-free Sign Language Translation

This paper proposes SignDINO—a "sign-aware" pre-training strategy adapted from DINO self-distillation. By providing the teacher with only global frames and the student with local masked views of hands/faces, the model is forced to infer discriminative local cues from global frames alone. This enables pre-training of a sign language tokenizer entirely without gloss or text annotations, achieving or exceeding SOTA performance on four public GFSLT datasets that usually rely on text-based pre-training.

Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection

To address the "query overfitting to seen classes" and "diffuse CLIP attention" issues in open-vocabulary human-object interaction (OV-HOI) detection, this paper proposes the SD-IF framework. It utilizes reinforcement learning (RL)-driven semantic perturbations to push queries out of seen semantic clusters and employs an actor-critic mechanism to "focus" attention on actual interaction regions. Experiments show that ours significantly outperforms previous SOTAs in unseen class mAP on HICO-DET and SWIG-HOI.

LiveGesture: Streamable Co-Speech Gesture Generation Model

This paper proposes LiveGesture—the first fully streamable, zero look-ahead speech-driven full-body gesture generation framework. It employs a streaming vector quantized motion tokenizer (SVQ, featuring asymmetric bidirectional encoding + causal decoding) to discretize each body region into causal motion tokens. A Hierarchical Autoregressive Transformer (region experts xAR + causal spatio-temporal fusion xAR-Fuse) then generates SMPL-X full-body gestures frame-by-frame while receiving audio, achieving or exceeding offline SOTA performance on BEAT2 under strict streaming constraints.

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

LLaMo extends pretrained LLMs into a unified large model capable of both "motion-to-text" (understanding) and "text-to-motion" (generation) using "modality-split Mixture-of-Transformers + continuous causal motion tokens + flow matching decoding heads + exit heads." The key is freezing text modules to preserve the original language capabilities of the LLM, while supporting real-time (≥30 FPS) streaming generation of arbitrary length.

M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction

M4Human is the largest multimodal mmWave radar benchmark for Human Mesh Reconstruction (HMR) to date, featuring 661k frames, 50 actions, and 20 subjects. It provides synchronized RGB, Depth, Raw Radar Tensor (RT), and Radar Point Cloud (RPC) modalities with high-fidelity 3D mesh annotations based on optical motion capture (MoCap). It also introduces RT-Mesh, the first lightweight baseline for direct HMR from RT.

MAMMA: Markerless Accurate Multi-person Motion Acquisition

MAMMA is a markerless multi-person motion capture pipeline: starting from multi-view videos, it uses a Transformer (MammaNet) with independent queries for each landmark to predict 512 contact-aware and visibility-aware dense 2D surface landmarks. By fitting SMPL-X to these landmarks, it achieves an accuracy within 0.862mm of commercial marker-based systems (Vicon) in close-range two-person interaction scenarios, while eliminating the need for tedious marking and manual data cleaning.

MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision

MatchED proposes a lightweight (approx. 21K parameters) plug-and-play module that generates crisp (single-pixel wide) edge maps through one-to-one bipartite matching based on spatial distance and confidence during training. It can be attached to any edge detector for end-to-end training and, for the first time, matches or exceeds standard post-processing methods without relying on NMS and thinning.

MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

To address the challenge in visible-infrared person re-identification where "illumination differences span across multiple frequency bands and the optimal band varies by sample," MFEN utilizes a Mixture-of-Experts (MoE) structure with multiple frequency band experts and a gating mechanism to adaptively fuse frequency domain cues per sample. Complemented by image-level Random Frequency Augmentation (RFA) and optimization-level Frequency-assisted Optimization (FAO), it achieves or approaches SOTA performance on three VI-ReID datasets.

MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs

To address the highly ill-posed problem of directly regressing dense hand poses from sparse IMUs due to the semantic gap, MGDHand pre-trains a MANO-IMU fusion teacher to encode priors into three categories: static shape, dynamic pose, and temporal motion. It then employs multi-granularity decoupled distillation (SSD/DPD/TMD) to transfer these priors to an IMU-only student in their respective semantic domains. On the VIHand dataset, this reduces MPJPE by 40.7% compared to a student without distillation.

Miburi: Towards Expressive Interactive Gesture Synthesis

Miburi is proposed as the first online causal framework that directly utilizes the internal token streams of the speech-text foundation model Moshi and a 2D causal Transformer to achieve real-time synchronized full-body gesture and facial expression synthesis.

MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation

MimicTalker focuses on 3D head motion generation for "real-time dyadic conversations." It employs frame-by-frame causal processing combined with Gated Multi-scale Memory (MICE) to achieve zero-latency perception of the interloutor. Intent and topic semantics extracted by an LLM are used for Semantics-augmented Dynamic Interaction (SDI) to modulate speaker features. Furthermore, a Semantics-guided Motion Style Memory (MSM), utilizing an "intent-as-key, style-as-value" external memory bank, maintains motion style consistency during long conversations. This allows for the generation of natural, coherent, and style-consistent real-time reactions across both 25-second short clips and 6-minute long dialogues, achieving 10%–30% improvements over methods like DualTalk on most metrics.

MMGait: Towards Multi-Modal Gait Recognition

MMGait constructs the most comprehensive multi-modal gait recognition benchmark dataset to date (5 sensors, 12 modalities, 725 subjects, 334K sequences) and proposes a new task of Omni-modal Gait Recognition along with a unified baseline model, OmniGait.

Mobile-VTON: High-Fidelity On-Device Virtual Try-On

The first fully offline mobile-side diffusion-based virtual try-on framework, based on the TeacherNet-GarmentNet-TryonNet (TGT) architecture. Through Feature-Guided Adversarial Distillation (FGA), the capabilities of SD3.5 Large are transferred to a lightweight student network with 415M parameters. It matches or even exceeds server-side baselines on VITON-HD and DressCode at 1024×768 resolution, with an end-to-end inference time of approximately 80 seconds (Xiaomi 17 Pro Max).

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

MoBind utilizes hierarchical contrastive learning to align wearable IMU signals with 2D skeleton motion extracted from video. By aligning IMUs with "skeleton motion" instead of raw pixels to filter out irrelevant backgrounds, decomposing the body into parts for specific IMU pairing, and employing a three-level (token/local/global) contrastive strategy with a masked token prediction task, MoBind significantly outperforms strong baselines in cross-modal retrieval, sub-second temporal synchronization, person/part localization, and action recognition.

Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Mocap-2-to-3 reformulates "recovering 3D motion from monocular 2D poses" as a multi-view synthesis problem: a single-view motion diffusion model is first pretrained on massive 2D data, followed by multi-view fine-tuning on limited 3D data. Combined with decoupled local pose/global displacement representations and ground pointmap constraints, it recovers full-body motion with metric absolute positions from monocular input, outperforming SOTA methods in both camera-space and world-coordinates on RICH/AIST++.

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

MOFA-VTON enables users to control "how to style tops and bottoms" (e.g., tucked in, tucked out, or various hemline styles) using a single hand-drawn curve sketch. It converts the sketch into a "dual-region mask" for layout guidance and utilizes "Layout Adaptation blocks" to spatially align upper and lower body features at the feature level. It achieves SOTA image quality on VITON-HD and DressCode while unlocking styling diversities unattainable by traditional methods.

MoLingo: Motion-Language Alignment for Text-to-Human Motion Generation

MoLingo achieves overall SOTA performance in FID, R-Precision, and user studies for text-to-human motion generation. This is accomplished by performing masked autoregressive rectified flow on a continuous latent space, utilizing a Semantically Aligned Autoencoder (SAE) and multi-token cross-attention for text condition injection.

MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching

MotionHiFlow decomposes text-to-3D human motion generation into a multi-stage flow matching process that is "coarse-to-fine and low-to-high temporal scale." It links flows across scales using a noise-consistent cross-scale transition. Combined with a dual-stream Text-Motion Diffusion Transformer (TMDiT) and joint-aware Joint RoPE, it achieves SOTA results on HumanML3D and KIT-ML (FID 0.032 / 0.135).

MotionMaster: Generalizable Text-Driven Motion Generation and Editing

MotionMaster treats human motion as a new modality integrated into the shared vocabulary of a pre-trained multimodal large language model (Qwen2.5-VL). It utilizes a 10,000-hour annotated motion dataset (MotionGB) and an FSQ discretizer that balances local joint precision with global trajectory consistency. By employing an end-to-end autoregressive model to perform both text-driven motion generation and editing simultaneously, it achieves 41.6% higher semantic consistency for multi-motion sequences and 20.8% higher body part composition compared to prior methods.

Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment (MoTiGA)

MoTiGA addresses three primary shortcomings of LLM-based text-to-motion generation—fine-grained quantization errors, the representation mismatch between "causal LLMs" and "non-causal VQ-VAE," and the lack of human preference alignment. These are resolved through Causal Residual Quantization (Causal RVQ-VAE), time-lagged causal prediction, and Multi-level Hybrid weighted Preference Optimization (MHPO). This approach reduces the FID by 82.3% on HumanML3D and 64.7% on KIT-ML compared to other LLM-based methods.

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

MV-Fashion utilizes an "economical" multi-view synchronous capture rig consisting of 60 Raspberry Pi RGB cameras and 8 RGB-D cameras to record 3,273 synchronous videos (72.5M frames) of 80 subjects wearing 474 outfits (754 items). It provides multi-modal annotations for each garment, including flat-lay catalogue ↔ in-the-wild worn pairs, pixel-level segmentation, SMPL-X, point clouds, size charts, fabric elasticity, and styling. This marks the first time data required for virtual try-on, size estimation, and novel view synthesis are integrated into a single dataset, with baselines provided for all three tasks.

Next-Scale Autoregressive Models for Text-to-Motion Generation

MoScale proposes a next-scale autoregressive motion generation framework that replaces traditional next-token prediction. By employing coarse-to-fine hierarchical causal generation to capture global semantic structures, and introducing cross-scale hierarchical refinement alongside intra-scale temporal refinement, it achieves SOTA results on HumanML3D and KIT-ML (Top-1 0.540, FID 0.046).

Occluded Human Body Capture with Frequency Domain Denoising Prior

3D human motion capture from monocular occluded video is reformulated as a "wavelet coefficient selection" problem. Uncertainty of occluded keypoints is characterized using Gaussian distributions, and a frequency-domain diffusion prior is utilized to select credible coefficients in the Discrete Wavelet Transform (DWT) domain, enabling consistent and periodicity-preserving motion recovery under long-term occlusion.

OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

This paper constructs the first large-scale public benchmark for skeleton-based online micro hand gesture recognition, OMG-Bench (40 classes, 13,948 instances), and proposes the HMATr framework. By utilizing hierarchical memory banks and position-aware queries, HMATr achieves end-to-end unification of detection and classification, outperforming SOTA methods by 7.6% in detection rate.

Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning

OmniME addresses text-driven human motion editing by decomposing supervision into two complementary branches: "positive supervision" (retrospective intermediate feature supervision + similarity-based motion preservation) and "negative supervision" (triplet semantic alignment). Within a diffusion framework, it simultaneously constrains "what to change" and "what to keep," reducing the Average Rank (AvgR) from 20.88 to 13.06 on MotionFix and from 29.05 to 22.77 on STANCE Adjustment.

Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation

To address the poor generalization of Text-to-Motion (T2M) models on out-of-distribution text, this paper proposes an "Atomic Motion Decomposition-Recomposition" framework. It decomposes arbitrary raw text into low-level "atomic motion" descriptions across different body parts and time intervals, then learns to recompose these atomic motions into complete sequences. Using only HumanML3D for training, it significantly outperforms SOTA models on two out-of-distribution datasets (IDEA400, Mixamo).

OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

OpenDance introduces OpenDanceSet, a 100-hour large-scale 3D dance dataset across 14 genres with multimodal annotations (music, text, 2D keypoints, trajectories) derived from internet videos. Simultaneously, it proposes OpenDanceNet, a unified framework utilizing "decoupled tokenization + multimodal masked joint prediction + inference-time re-masking refinement" to achieve high-fidelity and finely controllable 3D dance generation driven by "music + arbitrary condition combinations."

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

The OpenFS framework is proposed, achieving multi-hand fingerspelling recognition via dual-level positional encoding + signing-hand focus loss + monotonic alignment loss for implicit signing-hand detection. It also designs a frame-wise letter-conditioned diffusion generator to synthesize OOV data, achieving SOTA on ChicagoFSWild/ChicagoFSWildPlus/FSNeo benchmarks with inference speeds over 100x faster than PoseNet.

OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data

The authors identified train/val set leakage in existing Text-to-Motion (T2M) benchmarks, where models overfit rather than generalize. They constructed OpenT2M, a million-scale, physically plausible, second-level annotated, long-horizon open-source motion dataset. Accompanied by MonoFrill—a "no-frills" autoregressive model using a 2D-PRQ tokenizer that treats motion as a "time × body-part" 2D image—the work improves zero-shot R@1 on de-leaked OOD benchmarks from approximately 0.07 to 0.24.

OSMO: Open-vocabulary Self-eMOtion Tracking

This paper proposes a new task, "First-person Self-emotion Tracking"—inferring the wearer's evolving emotions over time from the multimodal streams of smart glasses (speech, visual environment, dialogue text, and eye movements). It introduces the OSMO dataset (110 hours, the first and largest first-person emotion dataset with per-subject timelines), the OSMO benchmark (5 tasks), and the OSIRIS model (the first LMM to join video/audio/dialogue/eye IR and use emotional history for temporal reasoning), significantly setting new SOTA results across all metrics.

PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects

PAMotion utilizes the physical intuition that "object acceleration exposes the contact state" to design a soft physics-aware interaction loss. Combined with a coarse-to-fine two-stage conditional diffusion, it ensures text-driven full-body multi-object interaction motions are semantically aligned while eliminating issues like hand interpenetration and floating objects, achieving SOTA on HIMO and ParaHome.

ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

The ParTY framework is proposed to significantly improve the precision of text-motion semantic alignment for individual body parts while maintaining full-body motion coherence through a Part-Guided Network and Part-aware Text Grounding (PTG). It resolves the fundamental contradiction between "part expressiveness vs. full-body coherence" found in existing holistic and part-based methods.

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

PC-Talk performs "additive deformation" on the intermediate representation of implicit keypoints. It employs a LAC module to control lip-audio alignment with speaking styles and an EMC module to decouple pure emotional deformation by "subtracting neutral expressions." This enables fine-grained, controllable, real-time (30 FPS) talking face generation for speaking styles, lip-motion amplitude, emotional intensity, and even multi-region composite emotions, achieving SOTA on HDTF and MEAD.

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

Starting from the Navier-Stokes equations, this work reveals through rigorous mathematical derivation that the rPPG pulse signal follows a second-order damped harmonic oscillator model. Its discrete solution is equivalent to a causal convolution operator, providing a first-principles justification for the choice of TCN architectures. The resulting PHASE-Net, with only 0.29M parameters, achieves SOTA performance across multiple datasets.

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

PolySLGen feeds past speech and motion of a multi-person group into a LoRA-fine-tuned LLM to online generate the target participant's future speech, body motion, and a "speaking state score." By unifying multi-person non-verbal signals with a Pose Fusion module and a Social Cue Encoder, it models both speaking and listening behaviors. It significantly outperforms baselines that naively extend dyadic methods to polyadic settings in terms of motion quality, speech-motion alignment, and speaking state prediction.

Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification

Addressing the "one client = one camera, limited pose visibility" scenario in federated-by-camera person Re-identification, this paper proposes a Pose Extraction Module (PEM) to decouple features into "pose-related" and "pose-unrelated" components. By swapping pose components across identities to synthesize "pose-changed" hard positive samples, and employing Pose-guided Knowledge Distillation (PKD), Semantic Consistency Maintenance (SCM), and Compatibility Regularization (CR) for decoupling quality and global compatibility, the method compensates for the lack of pose diversity in contrastive learning, achieving SOTA results on Market1501 and MSMT17 for FedReID.

PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation

PRISM represents skeleton actions as a "weighted combination of reusable atomic motion primitives" (primitive coefficient space). It first learns this physically interpretable, view-invariant structured representation using multi-view synthetic data through a generative objective, then sequentially transfers the same representation to classification and frame-wise detection via lightweight task heads. It consistently outperforms specialized models on long-tail, multi-label, and multi-view real-world datasets.

Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models

MotionDEQ reformulates the cascaded framework of "multi-stage progressive guessing" in human motion prediction into a fixed-point solving problem within an implicit layer. This is equivalent to infinite refinement stages but requires only \(O(1)\) training memory. By injecting Euclidean equivariance into this equilibrium process and utilizing the temporal coherence of adjacent predictions to reuse the previous fixed point as a "warm-start," it achieves SOTA accuracy ([email protected] on Human3.6M) with fewer than 300K parameters, saving more than 2x training memory compared to multi-stage competitors.

Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification

PAD treats the frozen CLIP text encoder as a cross-domain invariant "semantic anchor." By employing an asymmetric vision-text distillation—weak text-side distillation to ensure semantic stability and strong vision-side EMA distillation to maintain plasticity—it simultaneously suppresses catastrophic forgetting and semantic drift in exemplar-free lifelong person Re-ID. It achieves an average mAP of 70.7 on seen domains and 78.6 on unseen domains, significantly outperforming previous SOTA methods.

Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds

A two-stage deep reinforcement learning (RL) framework is used to train full-body physical humanoid agents. In the first stage, agents learn "stepping to recover balance" after being pushed through motion imitation and physical balance rewards. In the second stage, the policy is fine-tuned using AdaptNet with a "hand-to-shoulder contact" heuristic. This allows agents in dense crowds to socially dissipate impact by pushing or leaning on neighbors, successfully reproducing phenomena such as force propagation, falls, and crowd crushes in a purely physical simulation.

RAM: Recover Any 3D Human Motion in-the-Wild

RAM proposes a unified multi-person 3D motion recovery framework that integrates a motion-aware semantic tracker SegFollow (based on SAM2 + adaptive Kalman filtering), a memory-enhanced temporal human mesh recovery module T-HMR, a lightweight motion predictor, and a gated combiner. It achieves SOTA zero-shot tracking stability and 3D accuracy on benchmarks such as PoseTrack and 3DPW, with inference speeds 2-3 times faster than previous methods.

Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human-Computer Interaction

This paper does not invent a new network but utilizes a specifically collected dataset of 53,300 millimeter-level RGB-depth pairs to fine-tune existing monocular depth models for near-field fingertip scenarios. By layering a "depth + motion" fusion velocity-gated state machine, the system reduces depth error from 12.3 mm to 3.84 mm (a 68% reduction) using only a standard RGB camera. It achieves a 94.4% F1-score for contact detection, enabling "blind typing" on a tabletop at 45.6 WPM with a 3.1% character error rate, approaching the performance of specialized depth hardware and commercial VR input.

RefTon: Reference Person Shot Assist Virtual Try-on

This paper proposes RefTon, a human-to-human virtual try-on framework based on Flux-Kontext. It introduces additional reference images (photos of others wearing the target garment) to provide more accurate clothing details. Through a two-stage training strategy and a rescaled position indexing mechanism, it achieves end-to-end try-on without auxiliary conditions (e.g., DensePose, segmentation masks), reaching SOTA performance on VITON-HD and DressCode.

RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised HOI Detection

RegFormer transforms weakly-supervised HOI detection from "enumerating all human-object pairs and cropping regions for classification" to "grounding human-object relations as queries on CLIP spatial feature maps and gating non-interacting pairs with interactiveness scores." Trained only with image-level annotations, it transfers directly to instance-level detection with a single backbone forward pass, achieving 38.14 mAP on HICO-DET with H-DETR, surpassing fully-supervised methods.

Region-Aware Instance Consistency Learning for Micro-Expression Recognition

This paper views a micro-expression sequence as a multi-instance set composed of "onset frame + multiple middle frames." By using a Siamese network to force the alignment of attention maps across different instances (IRC) and employing learnable facial queries to uncover neglected weak activation regions (MRD), the method completely eliminates the need for expensive apex frame annotations and outperforms state-of-the-art (SOTA) methods across four public datasets.

ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

ReMoGen is proposed as a modular framework for real-time human interaction-to-reaction motion generation. It leverages a frozen general motion prior learned from large-scale single-person motion data, adapts to different interaction domains (human-human/human-scene) via independently trained Meta-Interaction modules, and introduces Frame-wise Segment Refinement to achieve low-latency online updates (0.047s/frame), outperforming SOTA on Inter-X and LINGO datasets.

Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation

This paper argues that the group-level assumptions of mainstream Unsupervised Domain Adaptation (UDA) are disconnected from real-world scenarios where systems serve only one new user at a time. It proposes a new paradigm, Unsupervised Personal Adaptation (UPA), and constructs a Render-Cycle consistency self-supervision signal using a fixed-parameter differentiable renderer. By rendering predicted gaze into a new image and reading back the iris position, the method uses the consistency of iris locations to backpropagate and correct gaze deviations. This approach achieves stable gains for every user in cross-dataset person-specific adaptation, significantly outperforming existing SOTAs.

RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

This paper introduces the first RGB-Event multimodal pedestrian attribute recognition task and constructs EventPAR, the first large-scale dataset containing 100,000 paired RGB-Event frames with 6 types of emotional attributes. An asymmetric RWKV fusion framework (dual-stream RWKV encoding + OTN-RWKV event token filtering and bidirectional cross-fusion) is proposed, achieving SOTA performance across three datasets.

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

Addressing the long-standing dilemma in 3D human motion generation between "small but clean mocap sets" and "large but noisy in-the-wild sets," RoMo employs a taxonomy-aware adaptive filtering pipeline to distill approximately 1% of high-quality motion from 125,000 hours of web videos. It constructs a large-scale dataset featuring 820,000 segments (approx. 1,238 hours), each with 5 rich text descriptions organized by a three-level taxonomy (Category → Subcategory → Atomic action). Accompanied by the Motion Toolbox for unified evaluation, models trained on RoMo achieve SOTA performance in fidelity, diversity, and fine-grained textual understanding.

rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training

rPPG-VQA proposes the first video quality assessment framework specifically for remote heart rate detection (rPPG), combining signal-level multi-method consensus SNR with scene-level MLLM interference identification, alongside a two-stage adaptive sampling strategy to filter in-the-wild videos for training set construction.

SAM 3D Body: Robust Full-Body Human Mesh Recovery

SAM 3D Body (3DB) is a SAM-style promptable single-view full-body human mesh recovery model. It utilizes a shared encoder + body/hand dual-decoder architecture based on the MHR representation, which decouples skeleton and shape. Coupled with a data engine capable of mining hard samples and producing 7 million high-quality annotations, it achieves SOTA performance on both body and hand poses in in-the-wild images.

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

SceMoS replaces expensive 3D voxel/point cloud supervision with two lightweight 2D scene cues—DINOv2 features from Birds-Eye-View (BEV) for global semantic planning and local 2D height maps to embed surface physics directly into the motion token vocabulary. It achieves SOTA motion realism and contact accuracy on TRUMANS while reducing trainable scene encoding parameters by an order of magnitude (~4M vs. ~50M).

See Through the Noise: Improving Domain Generalization in Gaze Estimation

SeeTN attributes "poor cross-domain generalization in gaze estimation" to source domain label noise for the first time. It identifies noisy samples by aligning feature affinities with continuous label affinities via a prototype-constructed semantic manifold. By applying specific regularization to clean and noisy samples, it transfers supervision from clean to noisy samples, reducing angular error by 12–18% across four cross-domain settings without sacrificing source domain accuracy.

Seeing without Pixels: Perception from Camera Trajectories

This paper systematically promotes camera pose trajectories (6DoF pose sequences) as a standalone video perception modality for the first time. By training a lightweight Transformer encoder, CamFormer, through a contrastive learning framework, camera trajectories are mapped into a joint embedding space aligned with text. Experiments across 10 downstream tasks on 5 datasets demonstrate that camera trajectories are both lightweight and robust video signals—even outperforming video models with thousands of times higher computational costs in physical activities.

SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production

Addressing the gloss-free Text2Pose task, SignPR proposes a "structural + temporal" dual-progressive vector-quantized diffusion framework. It utilizes a structured VQVAE to decompose each frame's pose into semantic-level (global) and regional-level (hand/face/body) discrete tokens. The diffusion process first generates semantically consistent coarse poses before refining regional details. During inference, block-level causal progressive refinement is employed to ensure temporal coherence. SignPR outperforms previous T2P methods on Phoenix14T, CSL-Daily, and USTC-CSL datasets.

Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

Ours proposes Sketch2Colab, which generates coordinated multi-human-object interaction 3D motions from storyboard sketches by distilling sketch-driven diffusion priors into a rectified flow student network, combined with energy guidance and Continuous-Time Markov Chain (CTMC) discrete event planning, achieving SOTA constraint compliance and perceptual quality on CORE4D and InterHuman.

Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification

Aiming at Occluded Visible-Infrared Person Re-Identification (Occluded VI-ReID), this paper proposes the SFCL framework: using FFT to decompose features into amplitude (encoding modality appearance) and phase (preserving identity structure). It aligns modalities in the frequency domain using Optimal Transport, injects frequency structural cues back into spatial features, and employs a frequency-contrastive and semantic-consistent FAD loss. The method outperforms previous SOTA on two self-constructed occlusion datasets (Occ-SYSU-MM01 All-Search Rank-1 65.97%, +4.31%).

SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification

TE-VMamba leverages the SS2D state update intensity (step size \(\Delta\)) and token similarity to guide token reduction. It prunes redundant tokens that contribute minimally to the state in shallow layers based on \(\Delta\) and merges semantically similar tokens in deep layers. On Market-1501, it reduces FLOPs by over 60% while Rank-1 accuracy actually increases.

Stake the Points: Structure-Faithful Instance Unlearning

Structguard is proposed to maintain the semantic relationship structure between retained instances during the unlearning process through semantic anchors, preventing structural collapse. It achieves average improvements of 32.9%, 19.3%, and 22.5% in image classification, face recognition, and retrieval tasks, respectively.

Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

Superman unifies "3D pose perception from video" and "skeleton-based motion generation" into a conditional sequence generation problem: it first employs a vision-guided motion tokenizer (VQ-VAE + dual vision/geometry streams + hybrid codebook) to quantize continuous motion into cross-modal discrete tokens, then utilizes a single MLLM (Qwen2.5-VL-7B) to autoregressively predict these tokens. This framework performs 3D pose estimation, motion prediction, and motion in-betweening within a single model, achieving an 11~12% improvement over specialized SOTAs on Human3.6M.

SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head

SyncDreamer utilizes a Diffusion Transformer framework to generate identity-preserving, emotionally expressive avatar videos with fine-grained text control over gestures and gaze, using only a single reference image, audio, and text prompts. It locks identity through a visual adapter (with attention localization loss), converts speech rhythm/energy into expression drivers via an audio dynamics encoder, and transforms short text into actionable movement instructions through a GRPO-trained cross-modal prompt enhancer, achieving SOTA on both portrait and full-body benchmarks.

SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction

SyncMos utilizes an LLM event planner to decompose natural language instructions into temporal dependency graphs. By applying time-warping and Diffusion Posterior Sampling (DPS) as post-processing without retraining the single-agent diffusion motion model, it aligns the actions of an arbitrary number of agents (e.g., handing over objects) in time, achieving scalable multi-agent 3D scene interaction generation.

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

This paper introduces the first method to generate complete facial animations for two participants co-located in the same 3D space from a single mixed audio stream. By utilizing a dual-stream diffusion architecture (shared U-Net + cross-attention), a two-stage hybrid data training strategy, LLM-driven text-to-spatial layout control, and an auxiliary gaze loss, the system achieves natural mutual gaze, head movements, and space-aware 3D animation synthesis for dyadic conversations.

Text-guided Feature Disentanglement for Cross-modal Gait Recognition

This work generates a "modality+view"-aware gait text dictionary using LLMs and leverages CLIP to use text as semantic anchors for guiding visual feature disentanglement. It decomposes gait features from LiDAR and camera modalities into "modality-specific" and "modality-shared" components, performing retrieval using only the shared features. This approach achieves new SOTA results on the SUSTech1K and FreeGait benchmarks (e.g., FreeGait 3D→2D Rank-1 increases from 43.3 to 57.9).

4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction

This paper proposes 4DSurf, a general dynamic scene surface reconstruction framework based on 2D Gaussian Splatting. By introducing Gaussian motion-induced SDF flow regularization to constrain the temporally consistent evolution of the surface and employing an overlapping segment strategy to handle large deformations, it surpasses existing SOTA methods with Chamfer distance improvements of 49% and 19% on the Hi4D and CMU Panoptic datasets, respectively.

Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification

This paper introduces a new task, PP-VI-ReID (Privacy-Preserving Visible-Infrared Person Re-Identification), and proposes a PPA framework to address two major challenges: "anonymization destroying identity information" and "inconsistent anonymization distortion across modalities." The KPR module utilizes human pose priors for structure-aware precise anonymization, while the DCMA module treats anonymization perturbations as learnable stable offsets to align cross-modal features. The method significantly outperforms a modified version of SecureReID on SYSU-MM01 and RegDB, establishing a strong baseline.

Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

DeMoGen reverses "text-to-motion" generation: it utilizes energy-based diffusion models to decompose a holistic motion into several semantically interpretable motion concepts (e.g., "walking in a Z-shape" + "waving left hand") without decomposition-level ground truth. These concepts can be freely recombined to generate unseen motions. It achieves improvements across text-to-motion, composition, and multi-concept tasks on HumanML3D and MTT.

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Addressing extremely difficult spatial-temporal/numerical constraints (e.g., "passing through a 0.4m narrow gap," "walking 4 meters in exactly 6 steps"), this paper introduces a retrieval channel into the training-free Diffusion Noise Optimization (DNO) framework. It first parses the most difficult constraints via relational task analysis, retrieves reference motions from a dataset to invert them into reference noise, and finally blends random and retrieved noise using a reward-guided mask as a superior initialization. This significantly reduces constraint errors compared to vanilla DNO.

Translating Signals to Languages for sEMG-Based Activity Recognition

This paper proposes LLM-sEMG, which uses an sEMG-specific VQ-VAE to discretize continuous electromyography signals into tokens. Through a combination of "Lewis signaling games + human language inductive bias," these tokens evolve into a natural-language-like "sEMG language." Finally, a frozen pre-trained LLM with LoRA fine-tuning directly interprets this language for activity recognition, achieving accuracies of 95.14% on GRABMyo and 93.17% on NinaPro DB2, outperforming the strongest baseline STET by approximately 4 percentage points.

TriLite: Efficient WSOL with Universal Visual Features and Tri-Region Disentanglement

Using only a frozen DINOv2 ViT and a TriHead module with fewer than 800K trainable parameters, this method achieves new SOTA results in WSOL. It achieves this by disentangling patch features into foreground, background, and ambiguous regions and introducing an adversarial background loss.

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

The authors propose the UniDex robot foundation suite—comprising a large-scale dataset across 8 dexterous hands (50K+ trajectories/9M frames), a Function-Actuator Aligned Space (FAAS), and a 3D VLA policy (UniDex-VLA). It achieves an 81% average task progress (vs. 38% for π₀) on real-world tool-use tasks and demonstrates spatial, object, and zero-shot cross-hand generalization capabilities.

Unified Number-Free Text-to-Motion Generation Via Flow Matching

UMF bridges single-person and multi-person motion datasets using a unified multi-token latent space. It establishes a "1+N" paradigm consisting of a "Pyramid Motion Flow (P-Flow)" for single-pass generation of motion priors and a "Semi-Noisy Motion Flow (S-Flow)" for iterative autoregressive generation of responses. This reaches SOTA on text-driven "number-free" multi-person generation (InterHuman FID 4.772) while performing inference approximately 5x faster than FreeMotion.

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

The first end-to-end unified speaking-listening facial expression generation framework, UniLS, is proposed. Through a two-stage training paradigm (learning intrinsic motion priors first, followed by dual-track audio fine-tuning), it simultaneously generates natural speaking and listening facial movements from dyadic audio inputs, achieving up to a 44.1% improvement in listening metrics.

Unleashing Vision-Language Semantics for Deepfake Video Detection

This paper proposes VLAForge, which independently learns diverse forgery cues and localization maps through ForgePerceiver and integrates an identity-aware Vision-Language Alignment (VLA) scoring mechanism. By unleashing the potential of cross-modal semantics from Vision-Language Models (VLMs) to enhance discriminative capabilities, the method consistently outperforms existing SOTA methods across nine datasets.

Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition

GaitMax utilizes a frozen DINOv3 Large Vision Model (LVM) to concurrently deploy a "Semantic Branch" (capturing global, order-invariant silhouettes) and a "Kinematic Branch" (tracking spatio-temporal trajectories of body parts via learnable queries). It incorporates a Conditional Decoupling Loss (CDLoss) to suppress shortcuts by de-correlating gait embeddings from textual descriptions of distractors (e.g., clothing, viewpoint) using second-order statistics. Supported by the self-constructed GCaption dataset with natural language labels, GaitMax achieves new SOTA performance across multiple cross-domain gait benchmarks.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

ViBES is proposed as a 3D conversational agent that unifies language, speech, and body movement. Through a Modality Mixture of Experts (MoME) architecture and cross-modal attention mechanisms, it generates temporally aligned facial expressions and full-body motions while preserving the conversational intelligence of pre-trained speech LLMs, shifting the paradigm beyond viewing behavior as simple "modality translation."

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

To address the drastic viewpoint differences between UAVs and ground cameras in Aerial-Ground Person Re-Identification (AGPReID), this paper proposes ViSA. Instead of pursuing forced "view-invariant" alignment of shared parts, it utilizes a set of Expert-driven Token Generation Modules (ETGM) to generate adaptive semantic queries. These queries are then anchored to their responsive local regions using a Dual-branch Local Fusion Module (DLFM) through graph reasoning. This simultaneously preserves view-invariant and view-specific identity cues, achieving a \(10.06\%\) mAP improvement on the CARGO cross-view protocol.

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

VLADR proposes leveraging fine-grained attribute knowledge from Vision-Language Models (VLMs) to enhance lifelong person re-identification. Through a two-stage training process involving Multi-granular Textual Attribute Disentanglement (MTAD) and Intra-domain Cross-modal Attribute Reinforcement (ICAR), it explicitly models cross-domain shared human attributes to achieve efficient knowledge transfer and forgetting mitigation. It outperforms the Prev. SOTA by 1.9%-2.2% in anti-forgetting and 2.1%-2.5% in generalization.

WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering

WildCap is proposed as a hybrid inverse rendering framework (data-driven SwitchLight delighting + model-based texel grid lighting optimization + diffusion prior sampling). It reconstructs high-quality 4K facial diffuse albedo maps from smartphone in-the-wild videos, significantly narrowing the quality gap between uncontrolled capture and professional light stage methods.