🧑 Human Understanding¶

🧠 NeurIPS2025 · 19 paper notes

BEDLAM2.0: Synthetic Humans and Cameras in Motion: BEDLAM2.0 is a comprehensive upgrade over BEDLAM, introducing diverse camera motions (synthetic translation/tracking/orbit + handheld/head-mounted capture), broader body shape coverage (BMI 18–41), strand-based hair, shoes, size-graded clothing, and more 3D environments. The resulting dataset comprises 27K+ sequences and 8M+ frames; models trained exclusively on this synthetic data surpass the state of the art in world-coordinate human motion estimation.
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts: This paper proposes ConceptScope, a framework that trains sparse autoencoders (SAE) on representations from visual foundation models to automatically discover and quantify visual concept biases in datasets, categorizing concepts into target / context / bias without any manual annotation.
CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals: This paper proposes the CPEP framework, which employs contrastive learning to align low-quality EMG signal representations with high-quality hand pose representations, endowing the EMG encoder with pose-awareness. CPEP is the first to achieve zero-shot recognition of unseen gestures from EMG signals, yielding a 21% improvement on in-distribution gesture classification and a 72% improvement on unseen gesture classification.
Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization: Cycle-Sync is a global camera pose estimation framework that extends Message Passing Least Squares (MPLS) to camera position estimation, introduces a Welsch-type robust loss and cycle-consistency weighting, and surpasses all baselines—including complete SfM pipelines with bundle adjustment (BA)—without requiring BA.
DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces: This paper proposes DevFD—a developmental MoE architecture that models the common characteristics of real faces via a shared Real-LoRA, incrementally captures new forgery types via a sequence of orthogonal Fake-LoRAs, and mitigates catastrophic forgetting by integrating orthogonal gradient constraints into an orthogonal loss. DevFD achieves state-of-the-art accuracy and the lowest forgetting rate in continual learning for face forgery detection.
Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge: FreeCure reveals that identity embeddings in face personalization models suppress but do not destroy the prompt control capability of the foundation model. Based on this insight, the paper proposes a training-free framework that injects attribute information from the foundation model into the personalized generation process via Foundation-Aware Self-Attention (FASA). The method substantially improves prompt consistency while preserving identity fidelity, and can be seamlessly integrated into mainstream architectures including SD, SDXL, and FLUX.
HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion: This paper models human-object interaction (HOI) generation as a Driver-Responder system, employing a lightweight Transformer-based interaction dynamics model to explicitly predict how objects respond to human actions. A residual dynamics loss is introduced during training to enforce causal consistency, while inference efficiency is preserved.
K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning: This paper proposes K-DeCore, a framework that decouples structured knowledge reasoning into two stages — task-agnostic schema filtering and task-specific query construction — and combines dual-perspective memory construction with structure-guided pseudo-data synthesis to enable effective knowledge transfer across heterogeneous SKR tasks under a fixed parameter budget.
Mechanistic Interpretability of RNNs Emulating Hidden Markov Models: A vanilla RNN is trained to reproduce the emission statistics of a Hidden Markov Model (HMM), and its internal mechanisms are reverse-engineered to reveal that the network implements discrete stochastic state transitions via noise-sustained orbital dynamics, "kick neuron" circuits, and self-induced stochastic resonance.
MOSPA: Human Motion Generation Driven by Spatial Audio: This work introduces the novel task of spatial-audio-driven human motion generation, constructs the SAM dataset comprising 9+ hours, 27 scenes, and 12 subjects with paired binaural audio and motion data, and proposes the MOSPA diffusion model. By fusing audio features including MFCC, tempogram, and RMS with sound source position and motion style conditions, MOSPA achieves an FID of 7.98, substantially outperforming music/dance baselines such as EDGE (14.0) and POPDG (21.0).
OmniGaze: Reward-inspired Generalizable Gaze Estimation in the Wild: This paper proposes OmniGaze, a semi-supervised 3D gaze estimation framework that employs a reward model — fusing visual embeddings, MLLM-generated semantic gaze descriptions, and geometric direction vectors — to assess pseudo-label quality. Trained on 1.4 million unlabeled face images, OmniGaze achieves state-of-the-art performance under both within-domain and cross-domain settings across 5 datasets, and demonstrates zero-shot generalization on 4 unseen datasets.
PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space: This paper proposes PandaPose, which propagates 2D pose priors into a 3D anchor space as a unified intermediate representation. By combining joint-wise adaptive 3D anchor setting with joint-wise depth distribution estimation, PandaPose achieves robust single-frame 3D human pose lifting against occlusion and 2D pose errors.
Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection: This paper proposes a part-aware bottom-up group reasoning framework that enhances individual embeddings with pose-guided body part features and infers social groups via similarity-based association, achieving new state-of-the-art results on the NVI and Café datasets.
RAPTR: Radar-Based 3D Pose Estimation Using Transformer: This paper proposes RAPTR, the first Transformer framework for radar-based 3D human pose estimation using weak supervision (3D bounding boxes + 2D keypoint labels). Through pseudo-3D deformable attention and structured loss functions, RAPTR substantially outperforms baselines on two indoor datasets.
Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness: This paper presents the first systematic study on how the choice of optimization algorithm affects group fairness in deep learning. Through stochastic differential equation (SDE) analysis and two novel theorems, it demonstrates that adaptive optimizers (RMSProp/Adam) are more likely to converge to fair minima than SGD, particularly under severe data imbalance.
Switchable Token-Specific Codebook Quantization for Face Image Compression: This paper proposes a Switchable Token-Specific Codebook Quantization (STSCQ) mechanism that employs a hierarchical dynamic structure combining image-level codebook routing and token-level codebook partitioning, achieving significant improvements in reconstruction quality and recognition accuracy for face image compression at ultra-low bitrates.
UnCLe: Towards Scalable Dynamic Causal Discovery in Non-Linear Temporal Systems: This paper proposes UnCLe, a scalable dynamic causal discovery method based on TCN autoencoder disentanglement and autoregressive dependency matrices. It infers time-varying causal relationships by measuring per-timestep prediction error increments following temporal perturbation, achieving state-of-the-art performance on both static and dynamic causal discovery benchmarks.
VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image: This paper presents VASA-3D, which adapts VASA-1's 2D motion latent space to a 3D Gaussian splatting representation and leverages VASA-1-synthesized training data for single-image customization, enabling real-time generation (512×512, 75 fps) of lifelike audio-driven 3D head avatars from a single portrait image.
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models: This paper proposes VimoRAG, a framework that leverages large-scale in-the-wild video databases as 2D motion priors to enhance 3D motion generation. Two core bottlenecks—human motion video retrieval and error propagation—are addressed via the Gemini-MVR retriever and the McDPO training strategy.