🧑 Human Understanding¶

📹 ICCV2025 · 49 paper notes

AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion: AdaHuman is proposed as a framework that generates high-fidelity, animatable 3D human avatars from a single image via a pose-conditioned 3D joint diffusion model and a compositional 3DGS refinement module.
AJAHR: Amputated Joint Aware 3D Human Mesh Recovery: The first 3D human mesh recovery framework for amputees — by synthesizing 1M+ amputee images (A3D), designing the BPAC-Net amputation classifier to distinguish amputation from occlusion, and employing a dual-tokenizer switching strategy to encode amputation/normal pose priors separately. The method achieves substantial improvements on amputee data (MVE 16.87 lower than TokenHMR on ITW-amputee) while remaining competitive on non-amputee benchmarks.
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning: This paper proposes AR-VRM, the first method to enhance visual robot manipulation (VRM) through explicit imitation of human hand keypoints. It employs a keypoint vision-language model pretrained on large-scale human activity videos to acquire motion knowledge, and establishes correspondences between human hand keypoints and robot components via analogical reasoning.
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars: This paper presents Avat3r — the first animatable large reconstruction model (LRM) that regresses high-quality drivable 3D Gaussian head avatars from only 4 input images in a feed-forward manner. By integrating DUSt3R positional maps and Sapiens semantic features as priors, and modeling expression-driven animation via simple cross-attention, Avat3r substantially outperforms existing methods on the Ava256 and NeRSemble datasets.
Bi-Level Optimization for Self-Supervised AI-Generated Face Detection: This paper proposes BLADES, a method that employs bi-level optimization to explicitly align self-supervised pretraining with the AI-generated face detection objective. The inner loop optimizes a visual encoder on pretext tasks including EXIF classification/ranking and face manipulation detection, while the outer loop optimizes task weights to improve performance on a proxy detection task, enabling cross-generator generalization without relying on any synthetic face data.
Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation: This paper is the first to investigate the value of rear-mounted cameras on HMDs for egocentric 3D whole-body pose estimation. It proposes a Transformer-based multi-view heatmap refinement method with an uncertainty-aware masking mechanism, achieving >10% MPJPE improvement on the newly constructed Ego4View dataset.
CarGait: Cross-Attention based Re-ranking for Gait Recognition: This paper proposes CarGait, a cross-attention-based re-ranking method for gait recognition. By performing strip-wise cross-attention between the probe and candidate sequences, CarGait learns fine-grained gait correspondences and maps global features from a frozen single-stage model into a new discriminative embedding space. Consistent Rank-1/5 accuracy improvements are achieved across seven gait models on three major benchmarks: Gait3D, GREW, and OU-MVLP.
CarGait: Cross-Attention based Re-ranking for Gait Recognition: This paper proposes CarGait, a cross-attention based re-ranking method for gait recognition. By performing strip-wise cross-attention between probe and candidate sequences, CarGait learns fine-grained gait correspondences and maps global features from pretrained single-stage models into a new discriminative embedding space. The method consistently improves Rank-1/5 accuracy across seven gait models on three major benchmarks: Gait3D, GREW, and OU-MVLP.
CarGait: Cross-Attention based Re-ranking for Gait Recognition: This paper proposes CarGait, a cross-attention-based re-ranking method for gait recognition. Given the top-K retrieval results of any single-stage gait model, CarGait learns fine-grained pair-wise interactions between the probe and each candidate via cross-attention over gait strips, generates new conditioned representations, and recomputes distances for re-ranking. CarGait consistently improves Rank-1/5 accuracy across three datasets (Gait3D, GREW, OU-MVLP) and seven baseline models, with an inference speed of 6.5 ms/probe that substantially outperforms existing re-ranking methods.
CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation: This work is the first to introduce causal reasoning into category-level object pose estimation (COPE). It eliminates spurious correlations induced by data bias via a front-door adjustment-based causal reasoning module, and provides unbiased categorical semantic supervision through residual knowledge distillation from the 3D foundation model ULIP-2. The method achieves 61.7% on the strict 5°2cm metric on REAL275, surpassing the state of the art by 4.7%.
Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing: This paper proposes BioTUCH, a framework that detects self-contact events via wrist-to-wrist bioimpedance sensing and performs contact-aware 3D arm pose refinement in conjunction with a visual pose estimator, achieving an average improvement of 11.7% in reconstruction accuracy.
Controllable and Expressive One-Shot Video Head Swapping: This paper proposes a diffusion-based multi-condition controllable video head swapping framework (SwapAnyHead) that achieves high-fidelity identity preservation, seamless background blending, and accurate cross-identity expression transfer and editing via a shape-agnostic mask strategy, a hair enhancement strategy, and an expression-aware 3DMM-driven landmark retargeting module.
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance: DreamActor-M1 proposes a human image animation framework based on the DiT architecture, achieving fine-grained facial and body control through hybrid control signals comprising implicit facial representations, 3D head spheres, and 3D body skeletons. Combined with complementary appearance guidance and a progressive training strategy, the framework supports multi-scale generation ranging from portrait to full-body.
Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation: This paper proposes ViTaM-D, a vision-tactile fusion framework that achieves dynamic reconstruction of hand-object interaction for both rigid and deformable objects. The framework introduces a novel Distributed Force-aware Contact Representation (DF-Field) and a two-stage pipeline consisting of visual dynamic tracking followed by force-aware optimization. The HOT dataset is also introduced to address the evaluation gap in deformable object hand-object interaction.
DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration: This paper proposes DynFaceRestore, which reformulates blind degradation as a Gaussian deblurring problem via Dynamic Blur Level Mapping (DBLM), and achieves an optimal fidelity-quality trade-off during diffusion model sampling through a Dynamic Starting Step lookup table (DSST) and a Dynamic Guidance Scaling Adjuster (DGSA).
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds: This paper proposes EgoAgent, a unified predictive agent model that simultaneously learns to represent egocentric visual observations, predict future world states, and generate 3D human motions within a single Transformer.
Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision: This paper proposes Fish2Mesh, a fisheye-aware Transformer model that embeds the spherical geometry of fisheye images into a Swin Transformer via an Egocentric Positional Encoding (EPE) based on equirectangular projection, enabling accurate 3D human mesh recovery from a head-mounted fisheye camera in egocentric perspective.
GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation: This paper proposes GenM3, a framework that learns unified discrete motion representations via a Multi-Expert VQ-VAE (MEVQ-VAE) and employs a Multi-path Motion Transformer (MMT) to handle intra-modal variation and cross-modal alignment. By integrating 11 motion datasets (~220 hours), GenM3 achieves state-of-the-art FID of 0.035 on HumanML3D.
GENMO: A GENeralist Model for Human MOtion: This paper proposes GENMO, the first generalist model that unifies human motion estimation (recovering motion from video/2D keypoints) and motion generation (synthesizing motion from text/music/keyframes) within a single framework. Through a dual-mode training paradigm (regression + diffusion), GENMO achieves both precise estimation and diverse generation in a single model.
GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation: This paper proposes GestureHYDRA, a co-speech gesture synthesis system based on a Hybrid-Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation, capable of reliably activating semantically explicit gestures such as numerical and directional indications.
GGTalker: Talking Head Synthesis with Generalizable Gaussian Priors and Identity-Specific Adaptation: GGTalker proposes a prior-adaptation two-stage training strategy that learns generalizable audio-to-expression and expression-to-visual priors from large-scale datasets, then rapidly adapts to a specific identity. The method achieves state-of-the-art performance across rendering quality, 3D consistency, lip synchronization, and training efficiency, requiring only 20 minutes of adaptation to generate photorealistic talking-head videos at 120 FPS.
HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation: This paper proposes simultaneously predicting the 3D coordinates of both the front and back surfaces of an object and densely sampling between the two surfaces to construct ultra-dense 2D-3D correspondences. Combined with a novel Hierarchical Continuous Coordinate Encoding (HCCE), the method surpasses existing state-of-the-art approaches on all seven core BOP benchmark datasets.
High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation: This paper proposes GLSMamba, the first pure-Mamba framework for video-based human pose estimation (VHPE). It models global dynamic context via a Global Spatiotemporal Mamba (GSM) module—featuring 6D selective space-time scanning and spatiotemporal-modulated scan merging—and captures local keypoint details via a Local Refinement Mamba (LRM) with windowed spatiotemporal scanning. The method achieves state-of-the-art performance on four benchmarks with linear computational complexity.
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding: This paper proposes the HIS-QA task, the HIS-Bench benchmark, and HIS-GPT — the first foundation model for joint 3D human-in-scene understanding. Through an Auxiliary Interaction Module (AInt) and a Layout-Trajectory Positional Encoding (LTP), HIS-GPT captures fine-grained human–scene interactions and substantially outperforms GPT-4o and other baselines across 16 sub-tasks.
HUMOTO: A 4D Dataset of Mocap Human Object Interactions: This paper presents HUMOTO, a high-fidelity 4D human-object interaction dataset comprising 735 sequences (7,875 seconds at 30fps), covering 63 precisely modeled objects with 72 articulated parts. It introduces an LLM-driven scene scripting pipeline and a multi-sensor capture system, achieving significantly superior hand pose accuracy and interaction quality compared to existing datasets.
IDFace: Face Template Protection for Efficient and Secure Identification: This paper proposes IDFace, a homomorphic encryption (HE)-based face template protection method that achieves retrieval over 1 million encrypted templates in only 126ms — incurring merely a 2× overhead compared to unprotected retrieval — through two key techniques: a near-isometric transformation (real-valued vector → ternary vector) and a space-efficient encoding scheme.
ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling: imHead proposes the first large-scale implicit 3D head morphable model. Through a global-local decoupled architecture trained on a dataset of 4,000 identities, it achieves both a compact implicit representation and localized facial editing, surpassing existing methods in reconstruction accuracy and editing flexibility.
KinMo: Kinematic-Aware Human Motion Understanding and Generation: This paper proposes the KinMo framework, which decomposes human motion into six kinematic groups and their interactions as a hierarchically describable representation. An automatic annotation pipeline generates fine-grained textual descriptions at multiple granularities. Combined with hierarchical text-motion alignment (HTMA) and a coarse-to-fine motion generation strategy, KinMo significantly improves motion understanding and fine-grained motion generation.
LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition: This paper proposes LVFace, which addresses training instability of ViT in large-scale face recognition via a Progressive Cluster Optimization (PCO) strategy. The training process is decomposed into three stages — feature alignment, centroid stabilization, and boundary refinement — achieving state-of-the-art results on multiple benchmarks.
MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances: MagShield is proposed as the first method addressing magnetic disturbance in sparse inertial motion capture systems. It adopts a two-stage detect-then-correct strategy: detecting magnetic disturbances via joint analysis of multiple IMUs, and correcting orientation errors using a human motion prior network. The approach can be plug-and-played into existing sparse IMU motion capture systems to enhance robustness.
MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation: This paper introduces the Multimodal DuetDance (MDD) dataset — the first large-scale, professional-grade duet dance dataset simultaneously integrating motion, music, and text descriptions. MDD comprises 620 minutes of motion capture data spanning 15 dance styles and over 10K fine-grained text annotations, and defines two new tasks: Text-to-Duet and Text-to-Dance Accompaniment.
Mitigating Object Hallucinations via Sentence-Level Early Intervention: This paper proposes SENTINEL, a framework that mitigates object hallucinations in MLLMs via sentence-level early intervention and in-domain preference learning. It reduces hallucination rates by over 90% on Object HalBench while maintaining or even improving general-purpose capabilities.
MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation: This paper proposes MixRI, a lightweight network with only 12 reference images and 5.3M parameters, which establishes 2D–3D correspondences between multiple reference images and a query image via a multi-view feature fusion strategy. MixRI achieves pose estimation performance comparable to methods requiring hundreds of reference images across 7 core BOP challenge datasets.
Monocular Facial Appearance Capture in the Wild: This paper proposes a method for reconstructing facial appearance attributes (diffuse albedo, specular intensity, specular roughness) from monocular head-rotation videos. By introducing an occlusion-aware split-sum approximation shading model, the method achieves studio-grade facial appearance capture quality without imposing any simplifying assumptions on the illumination environment.
NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction: This paper proposes NGD, a neural gradient-based deformation method that decomposes the Jacobian field into a frame-invariant static component and a frame-dependent dynamic component. Combined with an adaptive remeshing strategy, NGD reconstructs high-fidelity dynamic garment geometry and appearance from monocular video, significantly outperforming existing SOTA methods on challenging scenarios such as loose-fitting garments.
One-Shot Knowledge Transfer for Scalable Person Re-Identification: This paper proposes OSKT (One-Shot Knowledge Transfer), which distills teacher model knowledge into a compact intermediate representation termed a "weight chain," enabling the generation of student models of arbitrary sizes for person re-identification with a single round of computation.
OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization: This paper develops the OpenAnimals open-source framework, systematically revisiting the transferability of person re-identification methods to animal re-identification. It proposes ARBase, an animal-oriented strong baseline that substantially outperforms existing person ReID methods across multiple benchmarks.
PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation: This paper proposes the PersPose framework, which addresses the inaccurate depth estimation caused by existing methods neglecting field-of-view (FOV) information. It encodes cropped camera intrinsics as a 2D map via Perspective Encoding (PE) and centers the subject through Perspective Rotation (PR) to eliminate perspective distortion.
PHD: Personalized 3D Human Body Fitting with Point Diffusion: This paper proposes PHD, a personalized 3D human pose estimation paradigm that first calibrates user-specific body shape via SHAPify, then employs a shape-conditioned point diffusion model (PointDiT) as a 3D prior, and iteratively optimizes pose parameters through Point Distillation Sampling combined with 2D keypoint constraints, achieving state-of-the-art absolute pose accuracy on the EMDB dataset.
PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data: This paper proposes the PoseSyn framework, which identifies hard samples for a target pose estimator (TPE) from in-the-wild 2D pose data via an Error Extraction Module (EEM), then expands inaccurate pseudo-labels into diverse motion sequences via a Motion Synthesis Module (MSM). A human animation model subsequently renders these sequences into realistic training images with accurate 3D annotations, improving 3D pose estimation accuracy by up to 14% across multiple real-world benchmarks.
RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation: This work reformulates unseen 6D object pose estimation as a ray alignment problem, proposes an object-centric ray parameterization scheme, and employs a diffusion transformer to infer the 6D pose of a query image from multiple template images with known poses.
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning: SemGes proposes a two-stage framework that integrates semantic information at both global and fine-grained levels through semantic coherence and semantic relevance learning, generating co-speech gestures aligned with speech semantics. The method surpasses existing approaches on two benchmarks: BEAT and TED-Expressive.
Sequential Keypoint Density Estimator: An Overlooked Baseline of Skeleton-Based Video Anomaly Detection: SeeKer proposes to autoregressively factorize the joint density of skeleton sequences at the keypoint level, detecting abnormal human behaviors by predicting conditional Gaussian distributions over subsequent keypoints. It substantially outperforms existing methods on the UBnormal and MSAD-HR benchmarks.
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator: This paper proposes SOKE, a multilingual sign language generation framework built upon pretrained language models. It discretizes continuous sign language motion into token sequences via a decoupled tokenizer, and achieves high-quality text-to-3D-avatar sign language generation across multiple languages through multi-head decoding and retrieval-augmented strategies.
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data: This paper proposes SynFER, a diffusion-model-based facial expression synthesis framework that achieves fine-grained expression generation via dual control signals — text descriptions and Facial Action Units (FAUs) — and introduces a FERAnno label calibrator to ensure annotation reliability. The effectiveness of synthetic data for FER is validated across four learning paradigms: self-supervised, supervised, zero-shot, and few-shot learning.
TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions: TriDi is proposed as the first unified diffusion model that jointly models the three-variable distribution of humans (H), objects (O), and interactions (I). A single network covers 7 conditional generation modes, outperforming dedicated unidirectional baselines across all settings.
UDC-VIT: A Real-World Video Dataset for Under-Display Cameras: This paper presents UDC-VIT, the first real-world video dataset for under-display cameras (UDC), comprising 647 video clips with 116,460 frames in total. A carefully designed dual-camera beam-splitter acquisition system achieves precise spatiotemporal alignment. With face recognition as the primary application scenario, the dataset reveals the inadequacy of synthetic datasets in simulating real-world UDC degradation.
Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning: This paper proposes the first weakly supervised paradigm for visible-infrared person re-identification (VIReID), which relies solely on intra-modality identity annotations (without cross-modal correspondence labels). A heterogeneous expert collaborative consistency learning framework is introduced to establish cross-modal identity correspondences, achieving performance close to fully supervised methods.
What's Making That Sound Right Now? Video-centric Audio-Visual Localization: This paper proposes AVATAR, a video-level audio-visual localization benchmark, and TAVLO, a temporally-aware model that addresses the neglect of temporal dynamics in conventional AVL methods through high-resolution temporal modeling.