🧑 Human Understanding¶

📷 CVPR2026 · 61 paper notes

A Two-Stage Dual-Modality Model for Facial Expression Recognition: A two-stage dual-modality framework for facial expression recognition is proposed: Stage I adapts a DINOv2 encoder on external datasets via padding-aware augmentation and a training-only MoE head; Stage II performs frame-level audio-visual expression classification using multi-scale facial crops, Wav2Vec 2.0 audio features, and a gated fusion module, achieving 0.5368 Macro-F1 in the ABAW 2026 competition.
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark: This paper proposes LIDMark, the first proactive forensics framework that unifies deepfake detection, tampering localization, and source tracing within a single watermarking scheme. By embedding a 152-dimensional Landmark-Identity watermark (136D facial landmarks + 16D source ID) and leveraging intrinsic/extrinsic consistency, LIDMark achieves three-in-one forensics while surpassing existing methods in both PSNR/SSIM and detection accuracy.
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video: This paper proposes AVATAR, a framework that addresses three fundamental limitations of GRPO in multimodal video reasoning—data inefficiency, advantage collapse, and uniform credit assignment—via an off-policy training architecture (hierarchical replay buffer) and a Temporal Advantage Shaping (TAS) strategy. AVATAR significantly outperforms standard GRPO on audio-visual understanding benchmarks (OmniBench +3.7, 5× sample efficiency improvement).
Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation: This paper reveals that subject-independent cross-validation in facial AU detection introduces a random noise floor of ±0.065 F1 merely from varying subject-to-fold assignments, rendering many claimed SOTA improvements statistically indistinguishable. The authors propose the Leave-One-Dataset-Out (LODO) protocol as a more stable and reliable alternative evaluation scheme.
BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy: This paper proposes a heavily regularized multimodal fusion pipeline that achieves robust video-level recognition of Ambivalence/Hesitancy (A/H) behaviors in naturalistic settings. The framework employs a heterogeneous classifier committee across four modalities — visual (SigLip2), audio (HuBERT), text (F2LLM), and statistical features — combined with PSO-based hard-voting ensemble regularized by a train-validation gap penalty, achieving Macro F1 = 0.7465 on the ABAW10 test set.
CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation: This paper proposes CIGPose, a causal intervention graph-based pose estimation framework that employs a structural causal model (SCM) to identify visual-context confounders, leverages prediction uncertainty to localize confounded keypoints and replaces their embeddings with learned context-free canonical representations, and subsequently models skeletal anatomical constraints via a hierarchical graph neural network. CIGPose achieves a new state of the art of 67.0% AP on COCO-WholeBody.
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation: This paper proposes COG, a framework that models cross-view correspondences as a confidence-aware optimal transport (OT) problem. By predicting per-point confidence scores as transport marginal constraints, COG suppresses contributions from non-overlapping regions and outliers, achieving unsupervised single-reference 6DoF novel object pose estimation on par with supervised methods.
E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation: This paper proposes E-3DPSM, an event-camera-based egocentric 3D human pose state machine that formulates pose estimation as a continuous-time state evolution process. It integrates bidirectional SSM temporal modeling with a learnable Kalman-style fusion module to combine direct and incremental pose predictions, achieving real-time inference at 80Hz with a 19% reduction in MPJPE and a 2.7× improvement in temporal stability.
Editing Physiological Signals in Videos Using Latent Representations: This paper proposes PhysioLatent, a framework that encodes input facial videos into the latent space of a 3D VAE, fuses the resulting representation with target heart rate CLIP text embeddings, captures rPPG temporal coherence via AdaLN-enhanced spatiotemporal fusion layers, and employs a FiLM-modulated decoder with a fine-tuned output layer to achieve precise heart rate modification. The method attains a heart rate modulation MAE of 10 bpm while preserving visual quality at PSNR 38.96 dB / SSIM 0.98.
Efficient Onboard Spacecraft Pose Estimation with Event Cameras and Neuromorphic Hardware: The first end-to-end 6-DoF spacecraft pose estimation system deployed on BrainChip Akida neuromorphic hardware, exploring accuracy–efficiency trade-offs among event camera representations and quantization-aware training for low-power onboard deployment.
EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR: This paper proposes EgoPoseFormer v2 (EPFv2), which achieves state-of-the-art accuracy in egocentric 3D human motion estimation on the EgoBody3M benchmark (MPJPE 4.02 cm, 15–22% improvement over its predecessor) at 0.8 ms GPU latency. The system combines an end-to-end Transformer architecture (single global query token + causal temporal attention + conditioned multi-view cross-attention) with an uncertainty-distillation-based auto-labeling system.
Face Time Traveller: Travel Through Ages Without Losing Identity: This paper proposes FaceTT, a framework that achieves high-fidelity, identity-consistent face age transformation via three core modules—face-attribute-aware prompt refinement, angular inversion, and adaptive attention control (AAC)—surpassing existing methods across multiple benchmarks.
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision: FlexAvatar introduces learnable bias sink tokens to unify training across monocular and multi-view data, resolving the entanglement between driving signals and target viewpoints, and enables the generation of complete, high-quality, animatable 3D head avatars from a single image.
A2P: From 2D Alignment to 3D Plausibility for Occlusion-Robust Two-Hand Reconstruction: The paper decouples two-hand reconstruction into 2D structural alignment and 3D spatial interaction alignment. Stage 1 employs a Fusion Alignment Encoder (FAE) to implicitly distill three 2D priors from Sapiens (keypoints, segmentation, depth), eliminating the need for the foundation model at inference (56 fps). Stage 2 maps penetrating poses to physically plausible configurations via a penetration-aware diffusion model with collision gradient guidance. On InterHand2.6M, MPJPE is reduced to 5.36 mm (surpassing SOTA 4DHands by 2.13 mm) and penetration volume is reduced by 7×.
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation: FSMC-Pose proposes a lightweight top-down framework for cattle mounting pose estimation, comprising the frequency-spatial fusion backbone CattleMountNet (which employs wavelet transform and Gaussian filtering in the SFEBlock for foreground-background separation, and multi-scale dilated convolutions in the RABlock for context aggregation) and the multiscale self-calibration head SC2Head (spatial-channel co-calibration with a self-calibration branch to correct structural displacement). The paper also introduces MOUNT-Cattle, the first dataset for cattle mounting behavior, achieving 89% AP in complex group-housing environments at extremely low computational cost (4.41 GFLOPs, 2.698M parameters).
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation: This paper proposes FSMC-Pose, a lightweight top-down framework that achieves cattle mounting pose estimation in dense and cluttered farm environments via the frequency-spatial fusion backbone CattleMountNet and the multiscale self-calibration head SC2Head, attaining 89% AP with only 2.698M parameters.
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation: FSMC-Pose presents a lightweight cattle mounting pose estimation framework tailored for dense farm environments. By combining the frequency-spatial fusion backbone CattleMountNet with the multiscale self-calibrating prediction head SC2Head, the method achieves 89% AP with only 2.698M parameters and 4.4G FLOPs.
FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition: This paper proposes FusionAgent, an intelligent agent framework based on a multimodal large language model (MLLM) for dynamic sample-level model selection in whole-body biometric recognition. Each expert model (face recognition / gait recognition / person re-identification) is encapsulated as a callable tool. Through reinforcement fine-tuning (RFT), the agent learns to adaptively select the optimal model combination for each test sample based on its characteristics. Combined with the newly proposed ACT score fusion strategy, FusionAgent significantly outperforms existing state-of-the-art fusion methods.
HandDreamer: Zero-Shot Text to 3D Hand Model Generation: This paper presents HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. It addresses view inconsistency and geometric distortion in SDS-based optimization through MANO initialization, skeleton-guided diffusion, and a corrective hand shape loss.
HandX: Scaling Bimanual Motion and Interaction Generation: This work introduces HandX—a unified bimanual motion generation infrastructure comprising 54.2 hours of motion data and 485K fine-grained text annotations. It proposes a decoupled automatic annotation strategy (kinematic feature extraction + LLM-based description generation) and benchmarks two generation paradigms—diffusion and autoregressive—demonstrating clear data and model scaling trends.
How to Take a Memorable Picture? Empowering Users with Actionable Feedback: This paper defines a novel task of memorability feedback (MemFeed) and proposes MemCoach — a training-free, activation-steering approach for MLLMs. Via a teacher-student strategy, memorability-aware knowledge is injected into the model's activation space, enabling the MLLM to generate natural-language actionable suggestions that improve photo memorability.
HUM4D: A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture: This paper introduces the HUM4D dataset, covering complex single- and multi-person motion scenarios (rapid movements, occlusions, identity swaps), providing synchronized multi-view RGB/RGB-D sequences, accurate Vicon marker-based ground truth, and SMPL/SMPL-X parameters. Benchmark evaluations reveal significant performance degradation of state-of-the-art markerless methods under realistic conditions.
HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation: This paper reformulates single-image 3D human reconstruction as a 360° orbital video generation problem. A video diffusion model (Wan 2.1) is fine-tuned via LoRA using only 500 3D scans to generate 81-frame orbital videos, from which high-quality textured meshes are reconstructed via VGGT and Mesh Carving. The approach requires no pose annotations and surpasses existing methods in multi-view consistency and identity preservation.
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations: This paper proposes IDperturb, a geometry-driven sampling strategy that applies angular perturbations to identity embeddings on the unit hypersphere. Without modifying the generative model, it significantly enhances intra-class diversity in synthetic face datasets and improves downstream face recognition performance.
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference: This paper proposes LabanLite, a symbolic motion representation, and the LaMoGen framework, which for the first time enables LLMs to autonomously compose motion sequences through interpretable Laban symbol reasoning, surpassing conventional text-motion joint embedding methods in temporal precision and controllability.
LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics: This paper proposes the LaScA framework, which leverages large language models to generate a deterministic semantic lexicon as affective priors for handcrafted facial and acoustic features. A frozen sentence encoder produces semantic embeddings that are fused with the raw features. LaScA consistently outperforms feature-only baselines in affective dynamics prediction on the Aff-Wild2 and SEWA datasets, and matches or surpasses end-to-end deep models in terms of consistency, efficiency, and interpretability.
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction: This paper proposes LASER, a training-free framework that converts offline feed-forward reconstruction models (e.g., VGGT, π³) into streaming systems via Layer-wise Scale Alignment (LSA), achieving real-time streaming 4D reconstruction of kilometer-scale videos at 14 FPS with 6 GB peak memory on an RTX A6000.
LCA: Large-scale Codec Avatars - The Unreasonable Effectiveness of Large-scale Avatar Pretraining: LCA is the first work to apply the large-scale pretraining/post-training paradigm to 3D avatar modeling: it pretrains on 1 million in-the-wild videos to acquire broad appearance and geometry priors, then post-trains on high-quality multi-view studio data to enhance fine-grained expression fidelity, effectively breaking the inherent trade-off between generalizability and fidelity.
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision: MatchED introduces a lightweight (~21K parameter) plug-and-play module that generates crisp (single-pixel-wide) edge maps by performing one-to-one bipartite matching between predicted and GT edge pixels during training, based on spatial distance and confidence. The module can be appended to any edge detector for end-to-end training, and for the first time matches or surpasses standard post-processing methods without relying on NMS and thinning.
Miburi: Towards Expressive Interactive Gesture Synthesis: Miburi is proposed as the first online causal framework for real-time synchronized whole-body gesture and facial expression generation, achieved by directly leveraging the internal token stream of the speech-text large model Moshi and a 2D causal Transformer.
MMGait: Towards Multi-Modal Gait Recognition: MMGait constructs the most comprehensive multi-modal gait recognition benchmark to date (5 sensors, 12 modalities, 725 subjects, 334K sequences), introduces the novel omni-modal gait recognition task, and proposes a unified baseline model, OmniGait.
Mobile-VTON: High-Fidelity On-Device Virtual Try-On: This paper proposes Mobile-VTON, the first diffusion-based virtual try-on system capable of running fully offline on mobile devices. Through a TeacherNet-GarmentNet-TryonNet (TGT) architecture and a Feature-Guided Adversarial (FGA) distillation strategy, the system achieves high-quality try-on results comparable to server-side baselines with only 415M parameters and 2.84GB memory.
Mobile-VTON: High-Fidelity On-Device Virtual Try-On: The first fully offline, on-device diffusion-based virtual try-on framework. Built upon a TeacherNet-GarmentNet-TryonNet (TGT) architecture, it transfers the capabilities of SD3.5 Large to a 415M-parameter lightweight student network via Feature-Guided Adversarial (FGA) distillation. The method matches or surpasses server-side baselines at 1024×768 resolution on VITON-HD and DressCode, with an end-to-end inference time of approximately 80 seconds on a Xiaomi 17 Pro Max.
MoLingo: Motion-Language Alignment for Text-to-Human Motion Generation: MoLingo achieves comprehensive state-of-the-art performance on text-to-human motion generation—across FID, R-Precision, and user studies—by combining a Semantic Alignment Encoder (SAE) with multi-token cross-attention text conditioning, performing masked autoregressive rectified flow in a continuous latent space.
OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition: This paper introduces OMG-Bench, the first large-scale public benchmark for skeleton-based online micro hand gesture recognition (40 classes, 13,948 instances), and proposes the HMATr framework, which unifies detection and classification end-to-end via a hierarchical memory bank and position-aware queries, achieving a 7.6% improvement in detection rate over the previous state of the art.
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery: This paper proposes OnlineHMR, the first online world-grounded human mesh recovery framework that simultaneously satisfies four criteria: system causality, faithfulness, temporal consistency, and efficiency. It achieves streaming camera-space HMR via sliding-window causal learning with KV-cache inference, and performs online global localization through human-centric incremental SLAM combined with EMA trajectory correction.
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis: This paper proposes OpenFS, a framework that achieves multi-hand fingerspelling recognition with implicit signing-hand detection via dual-level positional encoding, a signing-hand focusing loss, and a monotonic alignment loss. A frame-wise letter-conditioned diffusion generator is further designed to synthesize OOV training data. OpenFS achieves state-of-the-art performance on three benchmarks (ChicagoFSWild / ChicagoFSWildPlus / FSNeo) with inference speed over 100× faster than PoseNet.
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis: This paper proposes ParTY, a framework that employs a Part-Guided Network and Part-aware Text Grounding to significantly improve text–motion semantic alignment at the body-part level while preserving whole-body motion coherence, thereby resolving the fundamental trade-off between part expressiveness and global coherence that exists between holistic and part-decomposition methods.
PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement: Starting from the Navier-Stokes equations, this work derives through rigorous mathematical analysis that rPPG pulse signals obey a second-order damped harmonic oscillator model whose discrete solution is equivalent to a causal convolution operator, thereby providing a first-principles justification for the TCN architecture. The resulting PHASE-Net, with only 0.29M parameters, achieves state-of-the-art performance across multiple datasets.
RAM: Recover Any 3D Human Motion in-the-Wild: RAM proposes a unified multi-person 3D motion recovery framework integrating a motion-aware semantic tracker SegFollow (built on SAM2 with adaptive Kalman filtering), a memory-augmented temporal human mesh recovery module T-HMR, a lightweight motion predictor, and a gated combiner. It achieves state-of-the-art zero-shot tracking stability and 3D accuracy on benchmarks including PoseTrack and 3DPW, while running 2–3× faster than prior methods.
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback: This paper presents VTON-IQA, a reference-free image quality assessment framework for virtual try-on. It introduces VTON-QBench, a large-scale benchmark comprising 62,688 try-on images annotated with 431,800 human judgments, and proposes an Interleaved Cross-Attention (ICA) module to model interactions among garment, person, and try-on images, achieving image-level quality predictions that are closely aligned with human perception.
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback: This work constructs VTON-QBench (62,688 try-on images, 13,838 qualified annotators, 431,800 annotations) and proposes VTON-IQA, a reference-free image quality assessment framework that jointly models garment fidelity and person preservation via an asymmetric Interleaved Cross-Attention (ICA) module, achieving image-level quality prediction highly aligned with human perception.
RefTon: Reference Person Shot Assist Virtual Try-on: This paper proposes RefTon, a person-to-person virtual try-on framework built on Flux-Kontext. By incorporating an additional reference image — a photo of another person wearing the target garment — RefTon provides richer garment detail information. Combined with a two-stage training strategy and a rescaled position index mechanism, the framework achieves end-to-end try-on without auxiliary conditions (e.g., DensePose, segmentation masks), attaining state-of-the-art performance on VITON-HD and DressCode.
RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised HOI Detection: RegFormer proposes a lightweight relational grounding Transformer module that, under weak supervision with only image-level annotations, leverages spatially-grounded HO queries and interactiveness-aware learning to directly transfer image-level reasoning to instance-level HOI detection without additional training, achieving performance close to fully supervised methods.
ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data: This paper proposes ReMoGen, a modular framework for real-time human interaction-to-reaction motion generation. It learns a general motion prior from large-scale single-person motion data (frozen during downstream training), adapts to different interaction domains (human-human/human-scene) via independently trained Meta-Interaction modules, and achieves per-frame low-latency online updates (0.047 s/frame) through Frame-wise Segment Refinement. ReMoGen comprehensively surpasses state-of-the-art methods on the Inter-X and LINGO benchmarks.
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training: rPPG-VQA proposes the first video quality assessment framework tailored for remote heart rate detection (rPPG), combining signal-level multi-method consensus SNR with scene-level MLLM disturbance recognition, along with a two-stage adaptive sampling strategy to curate in-the-wild training data.
Seeing without Pixels: Perception from Camera Trajectories: This paper is the first to systematically elevate camera pose trajectories (6DoF pose sequences) to an independent modality for video perception. Through a contrastive learning framework, a lightweight Transformer encoder, CamFormer, is trained to map camera trajectories into a joint embedding space aligned with text. Across 10 downstream tasks on 5 datasets, the paper demonstrates that camera trajectories serve as a lightweight and robust signal for video content understanding—even surpassing video models requiring thousands of times more computation on physical activity tasks.
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation: This paper proposes Sketch2Colab, which distills a sketch-driven diffusion prior into a rectified flow student network, and combines energy guidance with continuous-time Markov chain (CTMC) discrete event planning to generate coordinated multi-human–object interaction 3D motions from storyboard sketches, achieving state-of-the-art constraint compliance and perceptual quality on CORE4D and InterHuman.
Stake the Points: Structure-Faithful Instance Unlearning: This paper proposes Structguard, which leverages semantic anchors to preserve the semantic relational structure among retained instances during the forgetting process, thereby preventing structural collapse. The method achieves average improvements of 32.9% / 19.3% / 22.5% across image classification, face recognition, and retrieval tasks.
Talking Together: Synthesizing Co-Located 3D Conversations from Audio: This work presents the first method for generating complete facial animations of two participants sharing the same 3D physical space from a single mixed audio stream. It introduces a dual-stream diffusion architecture (shared U-Net + cross-attention), a two-stage mixed-data training strategy, LLM-driven text-to-spatial-layout control, and an auxiliary eye gaze loss to synthesize natural mutual gaze, head turning, and spatially-aware dyadic 3D conversation animations.
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach: A four-modality fusion pipeline (scene VideoMAE + face EfficientNetB0 + audio Wav2Vec2.0/Mamba + text EmotionDistilRoBERTa) is proposed. Each modality embedding is projected into a shared 128-dimensional space via a prototype-augmented Transformer fusion module and regularized with a prototype classification auxiliary loss. A 5-model ensemble achieves 71.43% Macro F1 on the final test set of the BAH corpus, substantially outperforming all unimodal baselines.
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach: This paper presents a multimodal Ambivalence/Hesitancy (A/H) recognition approach for the 10th ABAW Competition, integrating four modalities—scene, facial, audio, and text—via a Transformer-based fusion module and a prototype-augmented classification strategy. The best single model achieves an MF1 of 83.25%, and a five-model ensemble reaches 71.43% on the final test set.
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures: TeHOR leverages text descriptions as semantic guidance and jointly optimizes the geometry and texture of 3D humans and objects via Score Distillation Sampling from pretrained diffusion models. This approach eliminates the reliance on contact information required by conventional methods, enabling accurate and semantically consistent 3D reconstruction of both contact and non-contact interactions.
4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction: This paper proposes 4DSurf, a general-purpose dynamic scene surface reconstruction framework based on 2D Gaussian splatting. By introducing Gaussian motion-induced SDF flow regularization to constrain the temporally consistent evolution of surfaces, and adopting an overlapping segment partitioning strategy to handle large deformations, 4DSurf surpasses existing SOTA methods by 49% and 19% in Chamfer distance on the Hi4D and CMU Panoptic datasets, respectively.
TriLite: Efficient WSOL with Universal Visual Features and Tri-Region Disentanglement: TriLite employs a frozen DINOv2 ViT backbone with a lightweight TriHead module containing fewer than 800K trainable parameters. By disentangling patch features into foreground, background, and ambiguous regions, and introducing an adversarial background loss, the method achieves state-of-the-art WSOL performance with minimal parameter overhead.
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos: This paper presents UniDex, a robot foundation suite comprising a large-scale dataset spanning 8 dexterous hands (50K+ trajectories / 9M frames), a Functionally-Aligned Actuator Space (FAAS), and a 3D VLA policy (UniDex-VLA). UniDex-VLA achieves 81% average task progress on real-world tool-use tasks (vs. 38% for π₀) and demonstrates spatial, object-level, and zero-shot cross-hand generalization.
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking: This paper proposes UniLS, the first end-to-end framework for unified speaking and listening facial expression generation. Through a two-stage training paradigm—first learning intrinsic motion priors without audio, then fine-tuning with dual-track audio—UniLS generates natural speaking and listening facial motions simultaneously from dual-track audio input alone, achieving up to 44.1% improvement on listening metrics.
Unleashing Vision-Language Semantics for Deepfake Video Detection: This paper proposes VLAForge, which employs a ForgePerceiver to independently learn diverse forgery cues and forgery localization maps, and integrates an identity-aware Vision-Language Alignment (VLA) scoring mechanism to unleash the cross-modal semantic potential of VLMs for enhanced deepfake video detection, achieving comprehensive state-of-the-art performance across 9 datasets.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body: This paper proposes ViBES, a 3D conversational agent that unifies language, speech, and body motion via a Mixture of Modal Experts (MoME) architecture and cross-modal attention mechanisms. ViBES generates temporally aligned facial expressions and whole-body motions while preserving the conversational capabilities of a pretrained speech LLM, surpassing the paradigm that treats behavior as simple "modality translation."
Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification: VLADR leverages fine-grained attribute knowledge from vision-language models (VLMs) to enhance lifelong person re-identification. Through a two-stage training pipeline comprising Multi-grain Text Attribute Disentanglement (MTAD) and Inter-domain Cross-modal Attribute Reinforcement (ICAR), the framework explicitly models human body attributes shared across domains to enable effective knowledge transfer and forgetting mitigation, surpassing the state of the art by 1.9%–2.2% in anti-forgetting performance and 2.1%–2.5% in generalization.
WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering: This paper proposes WildCap, a hybrid inverse rendering framework that reconstructs high-quality 4K facial diffuse albedo maps from casual in-the-wild smartphone videos. The approach combines data-driven relighting (SwitchLight), model-based texel grid lighting optimization, and diffusion prior sampling, substantially closing the quality gap between in-the-wild capture and controlled-illumination methods.