Dynamic Emotion and Personality Profiling for Multimodal Deception Detection¶

Conference: ACL 2026 arXiv: 2604.17037 Code: None Area: Multimodal Analysis / Affective Computing Keywords: Deception Detection, Dynamic Emotion Annotation, Personality Profiling, Reliability-Weighted Fusion, Multimodal

TL;DR¶

This paper identifies that existing deception detection datasets provide only participant-level emotion/personality labels (all samples from the same subject share identical labels), and proposes a sample-level dynamic annotation scheme along with a reliability-weighted multimodal fusion framework, Rel-DDEP, achieving improvements of 2.53% in deception detection F1, 2.66% in emotion detection F1, and 9.30% in personality detection F1.

Background & Motivation¶

Background: Multimodal deception detection leverages textual, visual, and audio signals to identify deceptive behavior. Prior work (e.g., the MDPE dataset) incorporates personality and emotion information to assist deception detection, but provides only static, per-participant labels.

Limitations of Prior Work: The emotional and personality expressions of the same individual vary considerably across different contexts — a subject may display a mixture of "feigned happiness and fear of exposure" when lying, and "sadness and disgust" when being perfunctory. Participant-level labels collapse these distinctions, discarding contextual signals critical for deception detection.

Key Challenge: Personality and emotion are key cues for deception detection, yet existing annotation granularity is too coarse (participant-level rather than sample-level), causing ambiguous boundaries between deceptive and truthful samples in the feature space.

Goal: To construct a sample-level dynamic emotion (multi-label) and personality (single-label) annotation dataset, and to design an adaptive reliability-weighted multimodal fusion framework.

Key Insight: Visualization experiments intuitively demonstrate the impact of annotation granularity: participant-level labels correctly classify only 32/200 samples; sample-level single-label emotion improves this to 85/200; sample-level multi-label emotion combined with single-label personality reaches 141/200.

Core Idea: Sample-level dynamic annotation combined with uncertainty-driven reliability-weighted fusion.

Method¶

Overall Architecture¶

The framework consists of two components: (1) Data Annotation: a multi-model multi-prompt annotation scheme → voting and quality scoring → senior re-annotation → human expert annotation → resulting in the DDEP dataset; (2) Model: the Rel-DDEP framework → feature extraction (Baichuan/CLIP/Wav2vec) → uncertainty estimation (mapping to Gaussian distributions) → reliability-weighted fusion → joint prediction of deception, emotion, and personality.

Key Designs¶

Multi-Model Multi-Prompt Annotation Scheme:
- Function: Generates high-quality dynamic emotion and personality labels at the sample level.
- Mechanism: Multiple heterogeneous LLMs (GPT-4o, Llama3, VideoLlama3, Qwen2 Audio) perform initial annotation, each using diverse prompts (e.g., inferring emotion from overall atmosphere vs. from specific behaviors). A voting mechanism produces initial labels, and a quality scoring system is constructed incorporating inter-annotator consistency (Kappa coefficient) and uncertainty scores (entropy + self-assessed confidence): \(S_q = \alpha_1 k + \alpha_2 u_i + \alpha_3 s_c\). Samples failing the quality threshold are forwarded to a multimodal LLM for re-annotation, and remaining failures are escalated to human experts.
- Design Motivation: The multi-model multi-prompt strategy reduces single-perspective bias; quality scoring ensures annotation reliability; the three-tier pipeline (LLM → multimodal LLM → human) balances cost and quality.
Uncertainty Estimation and Reliability-Weighted Fusion:
- Function: Adaptively assigns fusion weights based on the reliability of each modality.
- Mechanism: The feature \(\mathbf{h}_m\) of each modality is mapped to a high-dimensional Gaussian distribution \(N(\mu_m, \sigma_m)\) to quantify uncertainty. The mean \(\mu_m\) and variance \(\sigma_m\) are predicted from modality features via a GRU. Modalities with higher reliability (lower variance) receive larger fusion weights.
- Design Motivation: Multimodal data varies in quality — audio may contain noise and video may suffer from occlusion — necessitating that more certain modalities contribute more to the final decision.
Alignment and Ranking Constraint Module:
- Function: Ensures calibration of uncertainty estimates.
- Mechanism: The alignment module enforces consistency between uncertainty estimates and actual prediction errors (samples with high uncertainty should exhibit high prediction error). The ranking constraint module ensures that uncertainty estimates reflect the relative importance of modalities in joint detection.
- Design Motivation: Uncalibrated uncertainty estimates may lead to erroneous weight allocation — a modality that is "confident but incorrect" could otherwise receive disproportionately high weight.

Loss & Training¶

Three tasks are trained jointly using weighted cross-entropy. Uncertainty calibration is achieved through an alignment loss and a ranking constraint loss.

Key Experimental Results¶

Main Results¶

Task	Dataset	Model	Baseline F1	Rel-DDEP F1	Gain
Deception Detection	DDEP	CLB-HBB-Bai	58.30%	61.49%	+2.53%
Emotion Detection	DDEP	—	—	—	+2.66%
Personality Detection	DDEP	—	—	—	+9.30%

Ablation Study¶

Configuration	Deception Detection	Notes
Participant-level labels (MDPE)	~50%	Samples highly intermixed in feature space
Sample-level labels (DDEP)	~58%	Feature separability significantly improved
DDEP + Rel-DDEP	~61%	Reliability-weighted fusion yields further gains

Key Findings¶

Transitioning from participant-level to sample-level annotation improves deception detection accuracy from 32/200 to 141/200 (using multi-label emotion + single-label personality), demonstrating the necessity of dynamic annotation.
Reliability-weighted fusion consistently outperforms simple concatenation and average fusion.
Personality detection achieves the largest gain (+9.30%), as participant-level labels entirely disregard contextual variation.
A Kappa score of 0.85 confirms annotation quality.

Highlights & Insights¶

The comparison between sample-level and participant-level annotation is highly intuitive and compelling — visualization figures clearly illustrate how annotation granularity affects feature space separability.
The multi-model multi-prompt annotation pipeline constitutes a generalizable data annotation methodology, particularly suited to highly subjective annotation tasks.
The uncertainty-driven modality fusion paradigm is transferable to any multimodal task.

Limitations & Future Work¶

The DDEP dataset is limited in scale; generalizability requires further validation across larger corpora.
The accuracy of LLM-based emotion/personality annotation is inherently uncertain, particularly when inferring visual affective cues from text.
Using GRUs to predict Gaussian parameters for reliability estimation may cause models to be overconfident.
Joint training across three tasks may introduce negative task interference.

vs. Cai et al. (2024) MDPE: MDPE provides only participant-level labels; this work extends to sample-level dynamic annotation and empirically demonstrates its necessity.
vs. DDPM: DDPM addresses only single-task deception detection, whereas this work performs joint three-task detection.
vs. Standard Multimodal Fusion: Simple concatenation or attention-based fusion does not account for modality reliability; the proposed uncertainty-driven fusion is more principled.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of sample-level dynamic annotation and uncertainty-driven fusion constitutes a meaningful contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluations span two datasets, multiple feature extractor combinations, and detailed visualization analyses.
Writing Quality: ⭐⭐⭐ The structure is sound, though certain formalizations (e.g., Theorem 1, 2) appear somewhat forced.