Skip to content

🧑 Human Understanding

🔬 ICLR2026 · 8 paper notes

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behaviour Analysis

This paper introduces BAH, the first multimodal dataset for Ambivalence/Hesitancy (A/H) recognition in videos, comprising 1,118 video clips (8.26 hours total) from 224 participants across 9 Canadian provinces, annotated by behavioural science experts, with frame-level and video-level baseline experimental results provided.

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

This paper proposes the Q Avatar framework, which quantifies the transferability of source-domain models via cross-domain Bellman consistency and combines source- and target-domain Q-functions through an adaptive, hyperparameter-free weighting function. The framework enables reliable knowledge transfer in cross-domain RL with mismatched state-action spaces, guaranteeing no negative transfer regardless of source model quality or domain similarity.

GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

This paper proposes a Snippet paradigm that organizes gait silhouette sequences into several "snippets," each formed by randomly sampling frames from a contiguous interval. This design captures both short-range temporal context and long-range temporal dependencies, achieving 77.5% Rank-1 on Gait3D with a 2D convolution backbone, surpassing all 3D convolution methods.

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

This paper proposes TEMU-VTOFF, a Dual-DiT architecture for the Virtual Try-Off (VTOFF) task. A feature extractor and a garment generator collaborate in a division-of-labor design; Multimodal Hybrid Attention (MHA) fuses image, text, and mask signals to resolve visual ambiguity; and a DINOv2-driven garment aligner preserves high-frequency details. The method achieves state-of-the-art performance on both VITON-HD and the multi-category Dress Code benchmark.

NeuroGaze-Distill: Brain-informed Distillation and Depression-Inspired Geometric Priors for Robust Facial Emotion Recognition

This paper proposes NeuroGaze-Distill, a cross-modal distillation framework that extracts static Valence-Arousal prototypes from an EEG-trained teacher model and injects them into a purely visual student model via Proto-KD and depression-inspired geometric priors (D-Geo), improving cross-dataset robustness for facial expression recognition without requiring paired EEG-face data.

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

This work constructs the PersonaX multimodal dataset (comprising LLM-inferred Big Five behavior traits, facial embeddings, and biographical metadata) and proposes a two-level analysis framework: structured independence testing combined with unstructured causal representation learning (with theoretical identifiability guarantees), revealing cross-modal causal structures.

QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture

QuaMo proposes a 3D human kinematics capture method based on quaternion differential equations (QDE). By solving kinematic equations under the unit quaternion sphere constraint \(\mathcal{S}^3\) and introducing a second-order acceleration-augmented meta-PD controller, the method achieves discontinuity-free, low-jitter online real-time human motion estimation, surpassing state-of-the-art methods on Human3.6M and several other benchmarks.

Visual Autoregressive Modeling for Instruction-Guided Image Editing

VAREdit reformulates instruction-guided image editing as a next-scale prediction problem. It proposes the Scale-Aligned Reference (SAR) module to resolve the scale mismatch between finest-scale conditioning and coarse target features. On EMU-Edit and PIE-Bench, the GPT-Balance score surpasses the strongest diffusion baseline by 64.9% and 45.3%, respectively, with 512×512 editing completed in only 1.2 seconds.