CVPR2025 Human Understanding AI paper notes paper summaries Face & Gaze Human Pose Avatars Personalized Generation Speech & Audio Multimodal/VLM

🧑 Human Understanding¶

📷 CVPR2025 · 73 paper notes

📌 Same area in other venues: 📷 CVPR2026 (151) · 🔬 ICLR2026 (45) · 🧪 ICML2026 (5) · 🤖 AAAI2026 (20) · 🧠 NeurIPS2025 (21) · 📹 ICCV2025 (41)

🔥 Top topics: Face & Gaze ×14 · Human Pose ×14 · Avatars ×5 · Personalized Generation ×3 · Speech & Audio ×3

3D Face Reconstruction From Radar Images: For the first time, 3D face reconstruction is achieved from millimeter-wave radar images: a synthetic dataset generated with a physical radar renderer is used to train a CNN encoder to estimate BFM parameters, and a model-based autoencoder is constructed by learning a differentiable radar renderer, achieving a mean vertex-to-vertex error of 2.56 mm on synthetic data while allowing unsupervised parameter optimization during inference.
3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation: This work proposes cross-task few-shot 2D gaze estimation, which leverages a pre-trained 3D gaze model as a prior. Through a physics-based differentiable projection module (with 6 learnable screen parameters), the 3D gaze direction is projected onto 2D screen coordinates. With only 10 annotated images, this approach adapts 2D gaze estimation to unseen devices, achieving over 25% improvement on MPIIGaze/EVE/GazeCapture compared to EFE and IVGaze.
Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation: This paper presents the first systematic study of the synthetic-to-real domain gap in 3D hand pose estimation. By designing a controllable data synthesis pipeline, the authors decompose and analyze the impacts of four key factors: forearms, spectral statistics, pose distribution, and object occlusion. The study demonstrates that with proper integration of these factors, purely synthetic data can achieve accuracy on par with real data.
Any6D: Model-free 6D Pose Estimation of Novel Objects: This paper proposes the Any6D framework to estimate the 6D pose and size of novel objects from a single RGB-D anchor image. By combining InstantMesh 3D reconstruction, oriented bounding box coarse alignment, and joint size-pose refinement, Any6D achieves an ADD-S of 98.7% on HO3D, significantly outperforming GEDI's 71.9%.

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Co-op: Correspondence-based Novel Object Pose Estimation: This paper proposes Co-op, a correspondence-based 6DoF pose estimation framework for novel objects. In the coarse estimation stage, a hybrid representation (patch-level classification + offset regression) is used to estimate the initial pose quickly and accurately with only 42 templates. In the refinement stage, probabilistic flow regression combined with differentiable PnP is utilized for end-to-end optimization, significantly outperforming existing methods on seven core datasets of the BOP Challenge.
ControlFace: Harnessing Facial Parametric Control for Face Rigging: Proposes ControlFace, which utilizes a dual-branch U-Net (FaceNet + denoising U-Net) combined with 3DMM rendering conditions to achieve flexible editing of facial pose, expression, and illumination without fine-tuning, while precisely preserving identity and semantic details.
CRISP: Object Pose and Shape Estimation with Test-Time Adaptation: Proposes CRISP, a category-agnostic object pose and shape estimation pipeline. The core innovations are an optimization-based corrector utilizing an active shape model and a correct-and-certify self-training strategy, which can adaptively bridge large domain gaps at test time.
CryptoFace: End-to-End Encrypted Face Recognition: This paper proposes CryptoFace, the first end-to-end Fully Homomorphic Encrypted (FHE) face recognition system. By utilizing a hybrid shallow patch CNN architecture (CryptoFaceNet), it significantly reduces the multiplicative depth, achieving encrypted inference that is 7 times faster than state-of-the-art (SOTA) FHE networks while improving verification accuracy.
D3-Human: Dynamic Disentangled Digital Human from Monocular Video: D3-Human proposes a method to reconstruct disentangled (garment + body) digital human geometry from a monocular video. By defining an homomorphic Signed Distance Field on the human manifold (hmSDF), it achieves accurate garment-body segmentation of visible regions without 3D garment priors, generating a disentangled template in approximately 20 minutes and supporting virtual try-on and animation applications.
Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis: Proposes Design2GarmentCode, the first neuro-symbolic framework that translates multi-modal design inputs (text/image/sketch) into parameterized garment drafting programs (GarmentCode DSL). This achieves a 100% simulation success rate and 88.67% user satisfaction, with the generated programs being fully editable and highly parameterizable.
Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency: This paper proposes an efficient blind face video restoration framework based on 3D-VQGAN. By designing dual spatial-temporal codebooks to record high-quality portrait features and motion residual information, along with marginal prior regularization to alleviate codebook collapse, it achieves SOTA performance on BFVR and deflickering tasks while improving inference speed by 2 to 140 times.
Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input: A unified framework, Ego4o, is proposed to achieve human motion capture and motion description generation simultaneously from multi-modal inputs of wearable devices (1-3 IMUs + egocentric images + motion descriptions), where the two tasks mutually enhance each other.
Enhancing 3D Gaze Estimation in the Wild Using Weak Supervision with Gaze Following Labels: Proposes a two-stage self-training weakly supervised framework, ST-WSGE, which leverages 2D gaze-following datasets (such as GazeFollow) to generate 3D pseudo-labels to enhance the generalization capability of 3D gaze estimation in the wild. Concurrently, a modality-agnostic Gaze Transformer (GaT) is designed to uniformly process both image and video inputs, achieving SOTA results on Gaze360, GFIE, MPIIFaceGaze, and other datasets.
ESC: Erasing Space Concept for Knowledge Deletion: This paper proposes ESC (Erasing Space Concept), which performs SVD on the feature space of the data to be forgotten and removes the principal component directions, achieving training-free, feature-level knowledge deletion. It defines the "Knowledge Deletion" task for the first time and proposes the Knowledge Retention Score to evaluate the effectiveness of feature-level unlearning.
Exploring Timeline Control for Facial Motion Generation: This paper introduces timeline control for facial motion generation for the first time, where users specify precise frame intervals for various facial actions on a multi-track timeline. Frame-level facial motion annotation is achieved with minimal effort through TICC temporal clustering, and a base-branch diffusion model is designed to decouple facial regions while preserving natural coupling, generating natural and smooth facial motions precisely aligned with the timeline.
FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video: FATE is proposed to reconstruct animratable full-head Gaussian avatars from monocular videos. By employing a sampling-based densification strategy (replacing threshold splitting), neural baking (converting discrete Gaussians into continuous UV texture maps to support editing), and a general completion framework (synthesizing the appearance of the back of the head), FATE achieves highly efficient and high-quality reconstruction with a PSNR of 28.37 dB using only 49K Gaussians.
Few-Shot Personalized Scanpath Prediction: This paper proposes the Few-Shot Personalized Scanpath Prediction (FS-PSP) task and the Subject-Embedding Network (SE-Net). By decoupling subject-embedding learning from scanpath prediction, the model can adapt to a new user using gaze data from only 1-10 images. It outperforms the runner-up by 5.9%-7.9% in the ScanMatch metric across three datasets (OSIE, COCO-FreeView, COCO-Search18), with an adaptation time of only 3.6 seconds and requiring no fine-tuning.
FreeCloth: Free-Form Generation Enhances Challenging Clothed Human Modeling: This paper proposes FreeCloth, a hybrid framework that divides the human surface into three regions: "bare", "deformed", and "generated". It models tight clothing using Linear Blend Skinning (LBS) deformation and loose clothing (skirts, dresses) using a free-form generator free from LBS constraints. It achieves State-of-the-Art (SOTA) performance on the ReSynth dataset, significantly outperforming existing methods, especially in loose clothing scenarios.
FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly: FreeUV proposes a facial UV texture recovery framework that does not require ground-truth UV texture data. By separately training a UV-to-2D network focused on realistic appearance and a 2D-to-UV network focused on structural consistency, it cross-assembles their UV-related modules into a pre-trained Stable Diffusion model during inference to achieve high-fidelity UV-to-UV texture generation.
FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images: This paper proposes FRESA, which learns a general clothed human prior to jointly infer personalized canonical shape, skinning weights, and pose-dependent deformations in a feedforward manner (18 seconds) from a few images. This achieves high-quality animatable 3D human avatar reconstruction with zero-shot generalization to mobile phone photos.
FSboard: Over 3 Million Characters of ASL Fingerspelling Collected via Smartphones: Presents FSboard—the largest ASL fingerspelling recognition dataset to date (3.2 million characters, 266 hours of video, recorded via smartphone selfie mode by 147 Deaf signers). Focused on mobile text entry scenarios, the baseline model achieves an 11.1% CER using MediaPipe + ByT5, providing a solid data foundation for fingerspelling as a mobile input method.
FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning: FSFM proposes the first self-supervised pre-training framework specialized for face security tasks. By employing the CRFR-P facial masking strategy coupled with a dual-task collaborative learning of MIM/ID, it learns "3C" representations of real faces (intra-region consistency, inter-region coherence, and local-to-global correspondence). It surpasses task-specific SOTA methods across three major tasks: Deepfake Detection, Face Anti-Spoofing, and Diffusion Forgery Detection.
GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding: This paper proposes the GA3CE method, which encodes the subject's 3D pose and scene object locations into a subject-centric egocentric space, and designs a direction-distance-decomposed D3 position encoding. This allows a Transformer to learn the spatial relationships between the 3D gaze direction and the scene context, reducing the 3D gaze angular error by 13%–37% under unconstrained settings.
GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior: This work proposes GaussianIP, a two-stage framework that efficiently generates identity-consistent 3D Gaussian humans from a human-centric diffusion model using Adaptive Human Distillation Sampling (AHDS). It then enhances facial and clothing texture details using mutual attention via a View-Consistent Refinement (VCR) mechanism, completing training within 40 minutes while significantly outperforming existing methods.
GCE-Pose: Global Context Enhancement for Category-Level Object Pose Estimation: GCE-Pose proposes a "completion-then-aggregation" strategy, which reconstructs partial observations into complete geometric-semantic 3D representations via a Semantic Shape Reconstruction (SSR) module, and then injects global information into local keypoint features through a Global Context Enhancement (GCE) feature fusion module. This approach significantly outperforms existing methods on HouseCat6D and NOCS-REAL275.
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation: HiPART proposes an autoregressive generation scheme that generates hierarchical dense 2D poses (48→96 joints) from sparse 2D poses (17 joints), replacing complex temporal/visual encoders with rich skeletal context to address occlusion. It achieves SOTA performance on single-frame 3D HPE and surpasses most multi-frame methods, while requiring fewer parameters and less computation.
Homogeneous Dynamics Space for Heterogeneous Humans: This paper proposes HDyS (Homogeneous Dynamics Space). By aggregating heterogeneous human motion data from biomechanics and reinforcement learning, it trains a homogeneous latent space to unify different kinematic and dynamic representations, achieving high-quality bidirectional mapping from kinematics to dynamics and demonstrating effectiveness on downstream tasks such as inverse dynamics estimation and ground reaction force prediction.
HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification: The HSEmotion team proposed a lightweight pipeline for the ABAW-10 competition: using pre-trained EfficientNet to extract facial embeddings, combined with MLP + GLA (Generalized Logit Adjustment) + sliding window smoothing. It significantly outperformed the official baselines on all four tasks (EXPR/VA/AU/VD), among which the violence detection task achieved a macro F1 of 0.783 using ConvNeXt-T + TCN.
Human Motion Instruction Tuning: LLaMo proposes a multimodal instruction tuning framework that preserves native motion representations (rather than converting them into language tokens), enhancing the model's capability to understand and predict complex human behaviors by simultaneously processing video, motion sequences, and textual inputs.
HumanMM: Global Human Motion Recovery from Multi-shot Videos: HumanMM proposes the first framework to recover 3D human motion in the global coordinate system from multi-shot videos. By integrating a shot transition detector, enhanced SLAM, calibration-based orientation alignment, and a motion integrator, it achieves continuous motion reconstruction across shot transitions.
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation: KeyFace proposes a two-stage diffusion framework that first generates anchor frames capturing key expressions at a low frame rate, and then fills in the intermediate frames using an interpolation model. This design addresses identity drift and quality degradation in long sequences for existing audio-driven facial animation methods, while introducing support for continuous emotion (valence/arousal) modeling and animation generation for various non-speech vocalizations (NSVs) for the first time.
Learning Affine Correspondences by Integrating Geometric Constraints: This paper proposes DenseAffine, a new framework for estimating affine correspondences that integrates dense matching with geometric constraints. It employs a two-stage decoupled training strategy: first training a dense point matcher with a Sampson distance loss, and subsequently freezing the matcher to train a local affine transformation extractor using an affine Sampson distance loss. The method achieves state-of-the-art (SOTA) performance on both HPatches matching and MegaDepth pose estimation.
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues: By introducing three contextual cues—background video descriptions, historical translations, and a pseudo-vocabulary—and combining them with LoRA fine-tuning of Llama3-8B, precise continuous sign language-to-text translation is achieved, yielding an improvement of over 40% compared to the SOTA on the BOBSL dataset.
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation: This paper proposes the Mixture of Emotion Experts (MoEE) model, which trains an individual expert network for each of the 6 basic emotions and dynamically combines them through a Soft MoE gating mechanism. In conjunction with a 150-hour professional emotional talking-head dataset and a multi-modal emotion conditioning module, MoEE achieves precise and natural control over both single and compound emotions.
MotionMap: Representing Multimodality in Human Pose Forecasting: MotionMap is proposed, introducing a new paradigm that represents the spatial distribution of motion using heatmaps. By combining t-SNE dimensionality reduction with a codebook, it achieves variable-mode forecasting and confidence quantization, yielding optimal mode coverage with minimal sampling.
MotionReFit: Dynamic Motion Blending for Versatile Motion Editing: Proposes MotionReFit, the first versatile text-guided motion editing framework that supports both spatial and temporal editing without requiring additional specifications or LLMs, achieved through MotionCutMix data augmentation, an autoregressive diffusion model, and a motion harmonizer.
NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction: NBAvatar proposes the Neural Billboard primitive, which combines learnable planar geometric primitives with deferred neural texture rendering to achieve photorealistic head avatar rendering under hand-face interaction, reducing LPIPS by 30% compared to Gaussian-based methods at megapixel resolution.
Omni-ID: Holistic Identity Representation Designed for Generative Tasks: Omni-ID proposes a holistic facial identity representation designed specifically for generative tasks. Through a few-to-many identity reconstruction training paradigm and multi-decoder objectives (Masked Transformer + Flow Matching), it encodes an arbitrary number of input images into a fixed-size structured representation, significantly outperforming ArcFace and CLIP in controllable face generation and personalized T2I tasks.
One2Any: One-Reference 6D Pose Estimation for Any Object: This paper proposes One2Any, which estimates the 6D pose of any novel object using only a single reference image. It encodes the reference pose using Reference Object Coordinates (ROC, based on the reference camera frame rather than canonical coordinates), conditionally generates dense ROC maps via VQVAE+U-Net, and restores the pose using the Umeyama algorithm. It achieves 93.7% ADD-S AUC on YCB-Video with an inference time of only 0.09 seconds.
Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing: The OTA framework is proposed: during the training phase, prototype representations are learned to encode the source domain distribution. During the testing phase, the prototypes are transferred to the target domain via optimal transport (OT) in a training-free or lightweight-training manner without accessing source model parameters or training data. Concurrently, geodesic mixup data augmentation is proposed to improve classifier learning in low-data scenarios.
PersonaBooth: Personalized Text-to-Motion Generation: This work defines a new task of Motion Personalization and proposes PersonaBooth, a multimodal fine-tuning method along with PerMo, a large-scale motion personalization dataset. By employing persona tokens, contrastive learning, and context-aware fusion, the method captures an individual's unique motion style from a few reference motions and generates text-driven personalized motions.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization: This paper proposes PhysMoDPO, which applies Direct Preference Optimization (DPO) to text-driven human motion generation. By integrating a Whole-Body Controller (WBC) into the training pipeline to calculate physics-based rewards and construct preference data, the generated motions satisfy both physical constraints and text instructions. Zero-shot deployment is achieved on the Unitree G1 robot.
Pose Priors from Language Models: Proposes the ProsePose framework, which leverages Large Multimodal Models (LMMs, e.g., GPT-4V) as contact priors to extract body-part contact constraints from images and convert them into optimizable loss functions, improving 3D pose estimation in close-interaction and self-contact scenarios without human contact annotations.
PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation: Proposes PoseBH, which achieves unified training across datasets with different skeleton definitions (e.g., humans, animals, hands) via non-parametric keypoint prototypes (Sinkhorn-Knopp online clustering) and cross-type self-supervision (CSS). It improves upon ViTPose++ by 11.2 AP on the APT-36K animal video dataset, demonstrating the effectiveness of cross-type knowledge transfer.
Probabilistic Prompt Distribution Learning for Animal Pose Estimation: This paper proposes PPAP (Probabilistic Prompt for Animal Pose), a multi-species animal pose estimation method based on probabilistic prompt distribution learning. By constructing multiple learnable attribute prompts for each keypoint and modeling them as Gaussian distributions, combined with a diversity loss and cross-modal fusion strategies, it achieves state-of-the-art (SOTA) performance under both supervised and zero-shot settings.
Quaffure: Real-Time Quasi-Static Neural Hair Simulation: Quaffure proposes the first physical self-supervised real-time quasi-static hair simulation method. By decomposing hair deformation into a rigid pose transformation and a learned correction, it trains a CNN decoder using an improved Cosserat elastic energy as a self-supervised loss, predicting physically plausible hair draping effects for various hairstyles, body shapes, and poses in just a few milliseconds on consumer-grade hardware.
Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation: This paper proposes the FMMP framework, which substantially outperforms state-of-the-art methods (+3.2% [email protected]) in category-agnostic pose estimation (CAPE) via recurrent mining of fine-grained structure-aware (FGSA) features based on deformable attention, combined with a keypoint mixup padding strategy.
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback: VTON-IQA is proposed as a reference-free virtual try-on image quality assessment framework. It achieves image-level quality prediction aligned with human perception through a large-scale human-annotated benchmark VTON-QBench (62,688 try-on images + 431,800 annotations) and an Interleaved Cross-Attention module.
Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios: This paper proposes the first end-to-end video Transformer model for rPPG in real-world outdoor extreme lighting scenarios. It achieves robust physiological signal extraction using only an RGB camera through global interference sharing, background reference decoupling, and biological prior constraints.
RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance: RePerformer is proposed, a 3DGS-based volumetric video representation that unifies high-fidelity playback and photorealistic reperformance under novel poses through hierarchical decoupling of motion and appearance Gaussians, Morton coding parameterization, and a semantic-aware alignment module.
RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars: RGBAvatar proposes a "Reduced Gaussian Blendshapes" representation that efficiently represents animatable head avatars using only 20 learnable bases. Combined with batch-parallel rendering and a color initialization strategy, it achieves online real-time (reconstructing while capturing) head avatar reconstruction for the first time.
RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges: RUBIK proposes a structured image matching benchmark based on the nuScenes dataset. By organizing 16.5K image pairs into 33 difficulty levels using three complementary geometric difficulty criteria (overlap, scale ratio, and viewpoint difference), it systematically evaluates 14 methods. The findings reveal that even the best detector-free method (DUSt3R) succeeds on only 54.8% of the image pairs, exposing severe deficiencies of current methods under extreme geometric conditions.
SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance: This paper proposes SemGeoMo, which leverages an LLM-based automatic annotator to provide semantic guidance, combined with hierarchical geometric guidance at both affordance and joint levels. This two-stage framework achieves high-quality human-object interaction generation under dynamic contextual environments, while simultaneously outputting the corresponding textual descriptions.
Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions: This paper proposes the ShapeMove framework, which injects continuous body shape information into discretely quantized motion tokens via a Shape-Aware FSQ-VAE, and utilizes a pretrained language model to jointly predict shape parameters and motion tokens. It achieves the first end-to-end shape-aware human motion generation from natural language descriptions.
ShowMak3r++: Compositional Entertainment Video Reconstruction: This paper proposes ShowMak3r++, a compositional pipeline to reconstruct dynamic radiance fields from TV shows and web videos. The core innovations include a depth-prior-based spatio-temporal positioning module, ShotMatcher for cross-shot actor association, and an implicit face-fitting network, supporting post-production editing applications such as actor repositioning, insertion, and deletion.
SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction: Optioning SimMotionEdit, which introduces motion similarity prediction as an auxiliary task paired with a dual-module architecture of Condition Transformer + Diffusion Transformer, achieving SOTA performance in text-driven 3D human motion editing on the MotionFix dataset.
SocialGesture: Delving into Multi-Person Gesture Understanding: SocialGesture is the first large-scale dataset focusing on deictic gestures (pointing/showing/giving/reaching) in multi-person social scenarios, covering 9,889 video clips and 42,533 gesture instances. It establishes three benchmark tasks: temporal localization, classification, and VQA, systematically revealing the severe deficiencies of current models in multi-person gesture understanding.
Sonic: Shifting Focus to Global Audio Perception in Portrait Animation: The Sonic framework is proposed, establishing global audio perception as the core paradigm (rather than relying on visual motion frames). Through three modules—context-enhanced audio learning, a motion-decoupled controller, and time-aware position shift fusion—it achieves high-quality and temporally consistent audio-driven portrait animation generation.
StickMotion: Generating 3D Human Motions by Drawing a Stickman: The StickMotion framework is proposed, which uses user-hand-drawn stickman drawings as fine-grained motion control conditions, combined with text descriptions, to achieve global and local 3D human motion generation. A Multi-Condition Module (MCM) is designed to efficiently process condition combinations, saving users 51.5% of their time for expressing creative motion ideas.
Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic: To address the challenges of unsmooth action transitions and the difficulty in learning action characteristics in action-driven stochastic human motion prediction, this paper proposes two memory modules: Soft-Transition Action Bank (STAB) and Action Characteristic Bank (ACB), along with an Adaptive Attention Adjustment (AAA) strategy for feature fusion. The proposed method achieves SOTA performance on four datasets: GRAB, NTU, BABEL, and HumanAct12.
Structure-Aware Correspondence Learning for Relative Pose Estimation: Proposed a structure-aware correspondence learning method (SAC-Pose) that learns keypoints representing object structures and directly regresses 3D-3D correspondences based on inter-image structure-aware features (without explicit feature matching), significantly improving the accuracy of relative pose estimation for unseen object categories.
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach: This paper proposes a multimodal approach for video-level ambivalence/hesitancy (A/H) recognition, integrating four modalities: scene (VideoMAE), face (EmotionEfficientNetB0), audio (EmotionWav2Vec2.0+Mamba), and text (EmotionDistilRoBERTa). Through a prototype-augmented Transformer fusion model, it achieves an average MF1 of 83.25%, with a five-model ensemble ultimately reaching 71.43% on the test set.
Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation: This paper proposes the 2BY2 dataset—the first large-scale dataset for daily pairwise object assembly (18 task classes, 517 object pairs)—and designs a two-step SE(3) pose estimation network that leverages equivariant features to achieve multi-task pairwise object assembly, achieving state-of-the-art (SOTA) performance across all tasks and demonstrating generalization capabilities through real-world robot experiments.
Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models: Proposes EED (Efficient Ensemble Defense), which generates multiple sub-models from a single base model using different pruning strategies (NIS/ERM/ASE/BNSF) and dynamically ensembles them. At 80% sparsity, it achieves 55.71% PGD robust accuracy on CIFAR-10 (close to the uncompressed baseline) with a \(1.86\times\) inference speedup.
UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation: This paper proposes UniHOPE, the first framework to unify Hand-Only Pose Estimation (HPE) and Hand-Object Pose Estimation (HOPE). It dynamically controls outputs via an object switcher, eliminates interference from irrelevant object features through grasp-aware feature fusion, and learns occlusion-invariant features using diffusion-based generative de-occlusion combined with multi-level feature enhancement.
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing: UniPose proposes the first unified multimodal framework that utilizes LLMs to discretize 3D human pose into pose tokens that share a vocabulary with text tokens. Through a mixture-of-visual-encoders and a mixed attention mechanism, it achieves unified modeling of seven core pose tasks (comprehension, generation, and editing) across images, text, and 3D SMPL poses.
UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image: The UNOPose method and benchmark are proposed to estimate the 6DoF relative pose of unseen objects using only a single unposed RGB-D reference image. Through an \(SE(3)\)-invariant reference frame and overlap-aware matching, it achieves performance comparable to methods relying on CAD models.
VI3NR: Variance Informed Initialization for Implicit Neural Representations: VI3NR, a variance-informed initialization method for implicit neural representations (INRs) applicable to arbitrary activation functions, is derived, generalizing Xavier/Kaiming initialization to non-standard activations such as Gaussian and Sinc. By controlling the variance consistency of forward and backward propagation, stability in both directions is simultaneously satisfied using a single degree of freedom \(\sigma_p^2\), significantly improving the convergence speed and reconstruction quality of INRs.
VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction: VTON 360 is proposed, which reformulates 3D virtual try-on as a multi-view consistent 2D virtual try-on expansion problem. By combining pseudo-3D pose representation, multi-view spatial attention, and multi-view CLIP embedding, it achieves high-fidelity virtual try-on from arbitrary viewing directions.
WildAvatar: Learning In-the-Wild 3D Avatars from the Web: Proposes an automated annotation pipeline and filtering protocols to construct WildAvatar—a large-scale, in-the-wild 3D avatar creation dataset containing over 10,000 human subjects, which is over 10 times larger than previous datasets and outperforms existing SMPL annotation methods on the EMDB benchmark.
WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild: This paper proposes WiLoR, an end-to-end multi-hand reconstruction pipeline in-the-wild, featuring a real-time fully convolutional hand detector and a Transformer-based high-fidelity 3D hand reconstruction model that achieves image alignment via a multi-scale refinement module.
X-Dyna: Expressive Dynamic Human Image Animation: X-Dyna proposes a zero-shot human image animation pipeline based on diffusion models. Through a lightweight Dynamics-Adapter module, it generates realistic human and scene dynamics while maintaining appearance consistency, and introduces S-Face ControlNet to achieve identity-decoupled facial expression transfer.