Skip to content

🧑 Human Understanding

🤖 AAAI2026 · 20 paper notes

📌 Same area in other venues: 📷 CVPR2026 (138) · 🔬 ICLR2026 (45) · 🧪 ICML2026 (5) · 🧠 NeurIPS2025 (21) · 📹 ICCV2025 (41)

🔥 Top topics: Face & Gaze ×7 · Human Pose ×3

AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification

To address the extreme fine-grained recognition challenge of identical twin face verification, this paper proposes AHAN, a multi-stream architecture that performs multi-scale analysis of semantic facial regions via Hierarchical Cross-Attention (HCA), captures left-right facial asymmetry signatures through a Facial Asymmetry Attention Module (FAAM), and incorporates Twin-Aware Pair-Wise Cross-Attention (TA-PWCA) as a training regularizer. On the ND_TWIN dataset, AHAN improves twin verification accuracy from 88.9% to 92.3% (+3.4%).

CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning

This paper presents the first approach to leverage CLIP-extracted fine-grained facial semantic attribute embeddings for Face Template Inversion (FTI). A cross-modal feature interaction network fuses leaked templates with attribute embeddings and projects them into the StyleGAN latent space, synthesizing identity-consistent face images with richer attribute details. The method surpasses state-of-the-art in recognition accuracy, attribute similarity, and cross-model attack transferability.

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

This paper proposes CoordAR, which formulates 3D-3D correspondence estimation in single-reference-view 6D pose estimation as an autoregressive generation problem over discrete tokens. Through coordinate map tokenization, modality-decoupled encoding, and an autoregressive Transformer decoder, CoordAR substantially outperforms existing single-view methods on multiple benchmarks and demonstrates strong robustness to challenging scenarios such as symmetry and occlusion.

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

This paper proposes Facial-R1, a three-stage alignment training framework (SFT → RL → Data Synthesis) that aligns the reasoning process of VLMs with emotion recognition outcomes by treating AU and emotion labels as verifiable reward signals. The framework achieves state-of-the-art performance on 8 benchmarks and introduces the FEA-20K dataset.

GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations

This paper proposes GazeInterpreter, an LLM-based hierarchical framework that converts raw gaze signals into textual narrations via a symbolic gaze parser, integrates them with body motion narrations to produce eye-body-coordinated descriptions, and iteratively refines outputs through a self-correction loop, yielding significant improvements on downstream tasks including text-driven motion generation, action prediction, and behavior summarization.

Generating Attribute-Aware Human Motions from Textual Prompt

This paper proposes AttrMoGen, a framework that decouples action semantics from human attributes (age, gender, etc.) via a Structural Causal Model (SCM)-based Causal Information Bottleneck, enabling attribute-aware human motion generation from text prompts. The authors also introduce HumanAttr, the first large-scale text-motion dataset with extensive attribute annotations.

Improving Sparse IMU-based Motion Capture with Motion Label Smoothing

This paper proposes Motion Label Smoothing, adapting classical label smoothing from classification tasks to sparse IMU-based motion capture. By incorporating skeleton-structure-aware Perlin noise as smoothed labels, the method improves accuracy across three state-of-the-art methods on four datasets in a plug-and-play manner without modifying model architectures. GlobalPose achieves a 20.41% reduction in SIP error on TotalCapture.

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

This paper proposes KineST, a kinematics-guided state space model that reconstructs whole-body motion from sparse HMD signals via a kinematic tree bidirectional scanning strategy and hybrid spatiotemporal representation learning, surpassing state-of-the-art methods in both accuracy and temporal consistency.

mmPred: Radar-based Human Motion Prediction in the Dark

This work is the first to introduce millimeter-wave radar into human motion prediction (HMP), proposing mmPred — a diffusion-based framework that employs dual-domain historical motion representations (time-domain pose refinement TPR + frequency-domain dominant motion FDM) and a Global Skeleton Transformer (GST) to effectively suppress radar-specific noise and temporal inconsistency, surpassing SOTA methods by 8.6% and 22% on the mmBody and mm-Fi datasets, respectively.

Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification

To address the core problem of unreliable cross-modality associations in unsupervised visible-infrared person re-identification (USVI-ReID), this paper proposes modality-aware Jaccard distance correction and a "split-and-contrast" invariance learning strategy. By eliminating modality bias, the method enables reliable global cross-modality clustering and feature alignment, achieving state-of-the-art performance on SYSU-MM01 and RegDB.

MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network

Grounded in the physical observation that objects in reflection/transmission layers move at different velocities than those in non-glass regions, this paper proposes MVGD-Net, which leverages optical flow motion cues to guide glass surface detection in videos. The framework comprises four core modules: Cross-scale Multimodal Fusion Module (CMFM), History-Guided Attention Module (HGAM), Temporal Cross-Attention Module (TCAM), and Temporal-Spatial Decoder (TSD). A large-scale dataset, MVGD-D, containing 312 videos and 19,268 frames is also introduced.

New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition

This paper proposes SeqEMG-GAN, a conditional adversarial generation framework driven by hand joint angle sequences. Through the joint design of an angle encoder, a two-level context encoder (featuring the novel Ang2Gist unit), a deep convolutional generator, and a multi-view discriminator, the framework synthesizes high-fidelity EMG signals from joint kinematic trajectories, enabling zero-shot generation for unseen gestures. Mixing synthetic and real data for training improves classification accuracy from 57.77% to 60.53%.

PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning

This paper proposes PA-FAS, a framework that addresses two critical bottlenecks of the SFT+RL paradigm in multimodal FAS — insufficient reasoning path diversity and reasoning shortcut — via a Reasoning Path Augmentation strategy and an answer shuffling mechanism, achieving the first unified solution for multimodal fusion, domain generalization, and interpretability simultaneously.

ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

This paper proposes ReAlign (Reward-guided sampling Alignment), which employs a step-aware reward model and a reward-guided sampling strategy to dynamically steer sampling trajectories toward distributions with high text-motion alignment during diffusion inference, significantly improving the generation quality of various motion generation methods without fine-tuning any diffusion model. Using MLD as a baseline, R@1 improves by 17.9% and FID improves by 58.8%.

Robust Long-term Test-Time Adaptation for 3D Human Pose Estimation through Motion Discretization

To address error accumulation in online test-time adaptation (TTA) for 3D human pose estimation, this paper proposes a framework combining motion discretization (an anchor motion set obtained via unsupervised clustering), a self-replay mechanism, and a soft reset strategy. The approach enables robust long-term continuous adaptation by leveraging subject-specific body shape and habitual motion patterns, outperforming all existing online TTA methods on Ego-Exo4D and 3DPW.

SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control

This paper proposes the Salient Orientation Symbolic (SOS) script — a programmable symbolic motion representation framework inspired by Labanotation — that extracts keyframe saliency via temporally-constrained agglomerative clustering, and introduces an SMS data augmentation strategy along with a gradient-optimization-based SOSControl framework for precise control over body-part orientation and motion timing. On HumanML3D, the method achieves an SOS-Acc of 0.988 with an FID of only 3.892.

Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

This paper proposes ST-MoE, the first framework to combine Mixture of Experts (MoE) with bidirectional spatiotemporal Mamba for multi-person motion prediction. Four heterogeneous spatiotemporal experts flexibly capture complex spatiotemporal dependencies, achieving state-of-the-art accuracy while reducing parameter count by 41.38% and accelerating training by 3.6×.

Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

This paper proposes a streaming co-speech gesture generation framework based on Rolling Diffusion, which converts arbitrary diffusion models into streaming gesture generators via a structured progressive noise schedule. It further introduces Rolling Diffusion Ladder Acceleration (RDLA) to achieve up to 4× speedup (200 FPS), comprehensively outperforming baselines on the ZEGGS and BEAT benchmarks.

Toward Gaze Target Detection in Young Autistic Children

To address the severe class imbalance in gaze target detection for autistic children—where face-directed gaze accounts for only 6.6% of samples—this paper proposes the Socially Aware Coarse-to-Fine (SACF) framework. A fine-tuned Qwen2.5-VL serves as a social-context-aware gate that routes inputs to either a socially aware or a socially agnostic expert model. Evaluated on the newly introduced AGT dataset, the framework substantially improves face gaze detection performance (Face L2 reduced by 13.9% on Sharingan; F1 improved from 0.753 to 0.761).

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

This paper proposes VPHO, a framework for hand-object pose estimation that jointly leverages visual and physical cues. It introduces a force prediction module to learn 3D physical cues and designs a two-stage candidate pose aggregation strategy (visual-guided + physics-guided) to achieve physical plausibility while preserving visual consistency. VPHO attains state-of-the-art performance in both pose accuracy and physical plausibility on the DexYCB and HO3D benchmarks simultaneously.