ECCV2024 Human Understanding AI paper notes paper summaries Human Pose Face & Gaze Sentiment Analysis Avatars Re-Identification Self-Supervised Learning

🧑 Human Understanding¶

🎞️ ECCV2024 · 54 paper notes

📌 Same area in other venues: 📷 CVPR2026 (151) · 🔬 ICLR2026 (45) · 🧪 ICML2026 (5) · 🤖 AAAI2026 (20) · 🧠 NeurIPS2025 (21) · 📹 ICCV2025 (41)

🔥 Top topics: Human Pose ×15 · Face & Gaze ×12 · Sentiment Analysis ×3 · Avatars ×2 · Re-Identification ×2

3D Hand Pose Estimation in Everyday Egocentric Images: By systematically investigating four practices—cropped inputs, Intrinsics-aware Positional Encoding (KPE), auxiliary supervision (hand segmentation + grasp labels), and multi-dataset joint training—this work proposes the WildHands system. Under the constraint of using only a ResNet50 backbone and a small amount of data, WildHands achieves robust 3D hand pose estimation in in-the-wild egocentric images. Its zero-shot generalization outperforms FrankMocap across all metrics and competes closely with HaMeR, which is \(10\times\) larger.
3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views: This paper proposes 3DFG-PIFu, which globally fuses multi-view features across the entire pipeline by introducing 3D Feature Grids, replacing the traditional point-wise local fusion approach. Combined with an iterative grid refinement mechanism and SDF-based SMPL-X features, it significantly outperforms state-of-the-art sparse-view human digitization methods.
3DGazeNet: Generalizing 3D Gaze Estimation with Weak-Supervision from Synthetic Views: Proposes to reformulate gaze estimation as dense 3D eye mesh regression, and performs weakly supervised training via automatic pseudo-label extraction from large-scale in-the-wild face images + HeadGAN-synthesized multi-views, achieving up to 30% improvement over SOTA in cross-domain scenarios.
3DSA: Multi-view 3D Human Pose Estimation With 3D Space Attention Mechanisms: This paper proposes a 3D Space Attention (3DSA) module that partitions the feature volume into multiple regions via a 3D space subdivision algorithm and assigns view-based attention weights to them. This addresses the issue of unequal contributions of different views to different spatial regions in multi-view 3D human pose estimation, achieving SOTA performance on the CMU Panoptic Studio dataset.
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars: This paper proposes the first baseline system for Spoken2Sign translation with 3D Avatar output. The system translates spoken text into 3D sign language animations through a three-step pipeline (dictionary construction \(\to\) SMPLSign-X 3D pose estimation \(\to\) retrieval-connection-rendering translation). It achieves a back-translation BLEU-4 of 25.46 on Phoenix-2014T, while its 3D sign language byproducts (keypoint enhancement and multi-view understanding) significantly improve the performance of sign language understanding tasks.
AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition: This paper proposes AdaDistill, which embeds the knowledge distillation concept into the margin penalty softmax loss. By utilizing EMA-based adaptive class centers (employing simple sample-to-sample knowledge in early stages and complex sample-to-center knowledge in later stages) and a hard sample-aware mechanism, it enhances the discriminative power of lightweight face recognition models without requiring extra hyperparameters, outperforming SOTA distillation methods on challenging benchmarks such as IJB-B/C and ICCV21-MFR.
Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification: An Adaptive High-frequency Transformer (AdaFreq) is proposed. By employing frequency-domain mixup augmentation, target-aware dynamic selection of high-frequency tokens, and a feature equilibrium loss, it unifies high-frequency information (such as fur texture and contour edges) for the re-identification of diverse wildlife, outperforming existing ReID methods across 8 cross-species datasets.
ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation: The ADen framework is proposed to unify pose regression and probabilistic estimation paradigms by employing a generator to yield multiple pose hypotheses and a discriminator to score and select the best hypothesis. With only 500 adaptive samples, this approach outperforms methods requiring 500K uniform samples while achieving real-time inference.
Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences: Proposes Alignist, the first method that leverages CAD model information (SDF + SurfEmb correspondence features) to train an implicit distribution network to estimate pose distributions over \(SO(3)\). By fusing geometric and feature alignment via a product of experts, it significantly outperforms contrastive learning methods in low-data scenarios.
Audio-Driven Talking Face Generation with Stabilized Synchronization Loss: This work proposes three improvements—AVSyncNet, stabilized synchronization loss, and a silent-lip generator—to systematically address the two core issues of SyncNet instability and lip leaking in audio-driven talking face generation, achieving SOTA performance in both lip synchronization and visual quality.
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos: This paper defines a new task, "Avatar Fingerprinting," which verifies the true identity of the driver expressing emotions in a synthetic talking-head video. It contributes NVFAIR (161 identities), the largest facial reenactment dataset to date, and proposes a baseline method based on normalized facial landmark distances and a temporal CNN. By learning appearance-agnostic facial motion signatures, the method achieves identity verification (average AUC of 0.85) and generalizes to unseen generators (AUC of 0.83).
Bridging the Gap Between Human Motion and Action Semantics via Kinematic Phrases: This paper proposes Kinematic Phrases (KP) as an intermediate representation between human motion and action semantics. KP is based on objective kinematic facts, possesses appropriate abstraction, interpretability, and generalization capabilities, and is used to construct a motion understanding system and a white-box motion generation evaluation benchmark named KPG.
Combining Generative and Geometry Priors for Wide-Angle Portrait Correction: A dual-module framework combining a StyleGAN-based generative prior (for face correction) and a geometric symmetry prior (for background line correction) is proposed, significantly improving the visual quality and quantitative metrics of wide-angle portrait distortion correction.
CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing: This paper introduces CoMo, which decomposes motion sequences into semantically explicit pose codes (e.g., "left knee slightly bent") to achieve text-based controllable motion generation and LLM-based zero-shot motion editing.
Cut Out the Middleman: Revisiting Pose-Based Gait Recognition: This paper revisits pose-based gait recognition methods and proposes the GaitHeat framework, which utilizes heatmaps instead of traditional skeleton keypoint coordinates to encode human poses. By introducing an improved preprocessing pipeline and a pose-guided heatmap alignment module, this framework significantly enhances performance and generalization capability, bringing pose-based methods close to the accuracy of silhouette-based methods for the first time.
De-confounded Gaze Estimation: This paper proposes a causal intervention-based gaze estimation framework, FSCI, which decouples gaze-related features from irrelevant features (such as identity and illumination) via feature separation. By utilizing a dynamic confounder bank to perform causal intervention on irrelevant features, FSCI achieves a 36.2% improvement over the baseline and an 11.5% improvement over the SOTA under cross-domain settings.
Diffusion Model is a Good Pose Estimator from 3D RF-Vision: Proposes mmDiff, a diffusion-based framework for millimeter-wave (mmWave) radar human pose estimation. By employing global-local radar context extraction and structural-temporal motion consistency constraints, it effectively addresses the challenges of sparse, noisy, and inconsistent radar point clouds, significantly outperforming existing SOTA methods.
EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis: Proposes EDTalk, an efficient disentanglement framework based on learnable orthogonal basis vectors, which decomposes facial dynamics into three independent latent spaces: mouth shape, head pose, and emotional expression. It simultaneously supports both video-driven and audio-driven emotional talking head generation.
Event-based Head Pose Estimation: Benchmark and Method: To address the lack of large-scale datasets and specialized methods in event-based head pose estimation (HPE), this work constructs two large-scale multi-scene event-based HPE benchmark datasets and proposes a specialized network containing two core modules: Event Spatial-Temporal Fusion (ESTF) and Event Motion Perceptual Attention (EMPA), achieving superior performance in various challenging scenarios.
Facial Affective Behavior Analysis with Instruction Tuning: This work proposes the first instruction-tuning dataset for Facial Affective Behavior Analysis (FABA), FABA-Instruct, along with an evaluation benchmark, FABA-Bench, and an efficient MLLM architecture, EmoLA. EmoLA achieves fine-grained description and recognition of emotions and Action Units (AUs) through a facial prior expert module and LoRA adaptation.
FoundPose: Unseen Object Pose Estimation with Foundation Features: FoundPose leverages a frozen DINOv2 foundation model to extract patch descriptors, establishing 2D-3D correspondences via bag-of-words template retrieval and kNN matching. It achieves zero-shot 6D pose estimation of unseen objects without any task-specific training, significantly outperforming existing RGB methods on BOP benchmarks.
FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis: The FreeMotion framework is proposed to recursively decompose the joint distribution of multi-person motions into conditional single-person motion generation through conditional probability decomposition. This achieves text-driven motion synthesis for an arbitrary number of individuals for the first time, while supporting multi-person spatial control.
Gaze Target Detection Based on Head-Local-Global Coordination: A gaze target detection method based on head-local-global tri-view coordination is proposed. By introducing a field of view (FOV)-based local view and designing global-local position and representation consistency mechanisms, the accuracy of gaze target prediction is significantly improved.
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths: This work proposes GazeXplain, which is the first to combine visual scanpath prediction with natural language explanations. Through an attention-language decoder, a semantic alignment mechanism, and cross-dataset co-training, it achieves explainable prediction of human gaze behavior.
Generalizable Facial Expression Recognition: This paper proposes the CAFE method, which learns a Sigmoid Mask on fixed CLIP face features to select expression-related features. Combined with channel separation and channel diversity loss, it achieves zero-shot generalization capabilities that significantly outperform SOTA facial expression recognition methods on multiple unseen datasets, using only a single training set.
GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence: This paper proposes GS-Pose, a method that projects 2D semantic features from a pre-trained vision foundation model (DINOv2) into 3D space, combining them with geometric features via a Transformer matching network for category-level 9D object pose estimation. Highly data-efficient, it achieves state-of-the-art performance on multiple real-world datasets with training on only 10 synthetic 3D models.
How Video Meetings Change Your Expression: Proposes FacET (Facial Explanations through Translations), an interpretable framework based on generative domain translation. By learning disentangled facial spatial features and interpretable spatiotemporal linear transformations, it automatically discovers subtle facial expression variation patterns between video conferencing (VC) and face-to-face (F2F) communication, while supporting "de-zooming" to translate VC videos into F2F styles.
HPE-Li: WiFi-Enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation: This paper proposes HPE-Li, a lightweight human pose estimation method based on WiFi signals. By constructing a multi-branch CNN using an innovative Dual Selective Kernel Attention (SKA) mechanism, it dynamically adjusts the receptive field size according to the characteristics of the input WiFi CSI data, surpassing SOTA methods on both MM-Fi and WiPose benchmarks with extremely low computational overhead.
Human Motion Forecasting in Dynamic Domain Shifts: A Homeostatic Continual Test-Time Adaptation Framework: Proposes the HoCoTTA framework, which achieves robust adaptation for human motion prediction in continuously changing target domains through multi-domain homeostasis assessment and isolated parameter optimization strategies, effectively mitigating catastrophic forgetting and error accumulation.
HUMOS: Human Motion Model Conditioned on Body Shape: This paper proposes HUMOS, a human motion generation model conditioned on body shape. It learns the correlation between body shape and motion without paired training data through cycle consistency loss and differentiable intuitive physics/dynamic stability constraints, generating physically plausible and dynamically stable human motions.
LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation: The LaPose framework is proposed to model object shape uncertainty using the Laplacian Mixture Model (LMM). Combined with a dual-stream architecture comprising a DINOv2 general 3D stream and a Convolutional specialized feature stream, it predicts the NOCS coordinate distribution. It also introduces a scale-invariant pose representation to resolve the inherent scale ambiguity in RGB-only scenarios, achieving SOTA performance on the NOCS dataset.
Large Motion Model for Unified Multi-Modal Motion Generation: This paper proposes the Large Motion Model (LMM), the first motion-centric unified multimodal motion generation foundation model. By constructing the MotionVerse benchmark containing 10 tasks, 16 datasets, and 320K sequences, designing a body-part-aware ArtAttention mechanism, and incorporating a pre-training strategy with random frame rates and masking, LMM achieves high-quality motion generation across diverse tasks.
MANIKIN: Biomechanically Accurate Neural Inverse Kinematics for Human Motion Estimation: This paper proposes MANIKIN, which accurately estimates full-body motion from sparse end-effector poses of the head and hands while ensuring biomechanical plausibility and ground non-penetration. This is achieved by embedding anatomical constraints within the SMPL parametric model and designing a neural inverse kinematics solver based on swivel angle prediction.
MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition: MIGS is proposed to unify the 3DGS parameters of multiple human identities into a single low-rank tensor via CP tensor decomposition, significantly reducing parameter size while achieving robust animation for unseen poses.
Modeling and Driving Human Body Soundfields through Acoustic Primitives: This paper proposes a 3D human body soundfield modeling and rendering framework based on Acoustic Primitives, which attaches multiple low-order spherical harmonic soundfields to human skeletal joints. While maintaining audio quality comparable to the state-of-the-art (SOTA), it achieves a 15x acceleration and near-field sound rendering capabilities.
Motion Mamba: Efficient and Long Sequence Motion Generation: This paper proposes Motion Mamba, which is the first to introduce Selective State Space Models (Mamba) to human motion generation. Through two core components, Hierarchical Temporal Mamba (HTM) and Bidirectional Spatial Mamba (BSM), it reduces FID by 50% (0.473 \(\rightarrow\) 0.281) on HumanML3D while achieving a 4x inference speedup (0.217s \(\rightarrow\) 0.058s).
Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification: A Multi-Memory Matching (MMM) framework is proposed for unsupervised visible-infrared person re-identification. It establishes reliable cross-modal correspondences through three modules: Cross-Modal Clustering (CMC), Multi-Memory Learning and Matching (MMLM), and Soft Cluster-level Alignment loss (SCA), achieving a Rank-1 accuracy of 61.6% on SYSU-MM01 and 89.7% on RegDB.
Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding: To address the issue where human joint occlusion leads to missing edges in 2D skeleton graphs, rendering traditional graph Laplacian positional encodings ineffective, this paper proposes PerturbPE. Leveraging the Rayleigh-Schrödinger Perturbation Theory, the method repeatedly applies random perturbations and computes the average to extract the consistent part of the graph Laplacian eigenspace as the positional encoding. This approach outperforms MöbiusGCN on complete skeletons and achieves up to a 12% performance improvement in scenarios with missing edges.
Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization: A self-supervised learning benchmark is proposed to simultaneously evaluate semantic classification and pose estimation capabilities. A viewpoint trajectory regularization loss (trajectory loss) is designed to constrain local linearity in the feature space using image triplets from adjacent viewpoints. This enables the learned representations to maintain semantic classification accuracy while emerging with global pose-awareness, improving both in-domain and out-of-domain pose estimation by 4%.
PoseSOR: Human Pose Can Guide Our Attention: This paper introduces human pose information to the Salient Object Ranking (SOR) task for the first time. By proposing a Pose-Aware Interaction (PAI) module and a Pose-Driven Ranking (PDR) module, it models the relationship between human activities and attention shifts, significantly improving SOR performance in complex scenes and achieving SOTA results.
ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild: Proposed ReLoo, which reconstructs high-quality 3D human models dressed in loose garments from monocular in-the-wild videos through a layered neural human representation and a non-hierarchical virtual bone deformation module.
RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency: RePOSE proposes replacing traditional absolute depth supervision signals with spatio-temporal relative depth consistency constraints. This shifts 3D human pose estimation in occluded scenarios from "learning absolute depth values" to "learning the relative depth order of keypoints." With an extremely simple implementation (requiring only a few lines of code), it significantly improves the robustness and accuracy of pose estimation under occlusions.
ScanTalk: 3D Talking Heads from Unregistered Scans: ScanTalk is proposed, which is the first deep learning framework capable of generating audio-driven animations for 3D faces with arbitrary topology (including unregistered 3D scans). The core mechanism relies on the discretization-agnostic property of DiffusionNet to break the constraints of a fixed topology.
SCAPE: A Simple and Strong Category-Agnostic Pose Estimator: This work simplifies category-agnostic pose estimation (CAPE) to a pure self-attention feature matching problem, discarding explicit similarity matching and two-stage frameworks. It introduces a Global Keypoint Feature Perceiver (GKP) and a Keypoint Attention Refiner (KAR) to improve attention quality. On the MP-100 dataset under 1-shot and 5-shot settings, it outperforms the SOTA by 2.2 and 1.3 PCK respectively, while reducing parameter count and improving inference speed.
Spectral Subsurface Scattering for Material Classification: This paper proposes a method for material classification utilizing Spectral Sub-Surface Scattering (S4). It demonstrates that the strong spectral dependence of subsurface scattering provides highly discriminative features and designs a novel imaging setup to efficiently acquire S4 measurements via a 2D projection, eliminating the need for time-consuming hyperspectral scanning.
TELA: Text to Layer-wise 3D Clothed Human Generation: TELA proposes a layer-wise 3D clothed human representation and a progressive optimization strategy to generate garment-decoupled 3D human models from text descriptions, supporting applications such as layer-by-layer clothing generation and virtual try-on.
TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing: This paper proposes the TF-FAS framework, which enhances the cross-domain generalization capability of face anti-spoofing through fine-grained guidance of twofold semantic elements (content elements and categorical elements). Within this framework, the CEDM module explores and decouples content-related features, while the FCEM module mines fine-grained intra-class differences, achieving state-of-the-art (SOTA) performance on multiple cross-domain FAS benchmarks.
Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing: This paper proposes the MMA-FAS framework to address the problem of missing modalities in multimodal face anti-spoofing (FAS). It separates modality-invariant and modality-specific features from a frequency decomposition perspective using modality-disentangle adapters. Combined with an LBP-guided contrastive loss and an adaptive modal combination sampling strategy, it achieves SOTA performance across all missing modality scenarios.
U-COPE: Taking a Further Step to Universal 9D Category-Level Object Pose Estimation: This paper proposes U-COPE, the first category-level 9D pose estimation framework that unifiedly handles both rigid and articulated objects. By redefining rigid objects as single-part articulated objects, this work unifies the problem definition, independently extracts features for each part using Point Pair Features (PPF), and predicts key pose parameters via a universal voting strategy, achieving state-of-the-art (SOTA) performance on both synthetic and real-world datasets.
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues: Proposes UPose3D, an uncertainty-aware multi-view 3D human pose estimation method. By modeling 2D keypoint uncertainty with Normalizing Flow, utilizing a scalable cross-view point cloud projection fusion strategy, and employing a Pose Compiler module trained on synthetic data, it achieves state-of-the-art performance in Out-of-Distribution (OoD) scenarios without requiring 3D annotations, while remaining competitive with 3D-supervised methods in In-Distribution (InD) scenarios.
Upper-Body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving: This paper proposes UbH-GCN for assistive driving scenarios. It utilizes upper-body skeleton sequences to construct a hierarchical graph structure (UbH-Graph) that dynamically models the relationship between joint movements and emotions. It also introduces a class-specific variation mechanism to balance the uneven data distribution, outperforming existing multimodal methods on the AIDE assistive driving dataset.
VideoClusterNet: Self-Supervised and Adaptive Face Clustering for Videos: VideoClusterNet proposes a fully self-supervised video face clustering method: adapting a generic face recognition model via a self-distillation mechanism, and designing a parameter-free clustering algorithm based on a learned loss metric, achieving SOTA performance in movie/TV show scenarios.
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment: The Wear-Any-Way framework is proposed, establishing a strong baseline for high-fidelity virtual try-on based on a dual U-Net diffusion model. By introducing a point control mechanism via Sparse Correspondence Alignment, it enables users to precisely manipulate wearing styles (e.g., rolling up sleeves, opening/closing coats, tucking in hems) through click-and-drag interactions, achieving state-of-the-art performance in both standard and manipulable try-on scenarios.
WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation: By leveraging the multi-view static camera infrastructure deployed in the 2022 FIFA World Cup stadiums, this work constructs WorldPose, the first large-scale multi-person global 3D pose estimation dataset. It contains approximately 2.5 million 3D poses and over 120 km of global trajectories, revealing the severe challenges that existing global pose estimation methods face in multi-person scenarios.