AAAI2026 Audio & Speech Multimodal Sentiment Analysis Personality-Sentiment Alignment Multi-Level Fusion Contrastive Learning BERT Personalized Sentiment

PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis¶

Conference: AAAI2026
arXiv: 2512.01442
Authors: Heng Xie, Kang Zhu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Ruibo Fu, Changsheng Li
Code: Not released
Area: Audio & Speech
Keywords: Multimodal Sentiment Analysis, Personality-Sentiment Alignment, Multi-Level Fusion, Contrastive Learning, BERT, Personalized Sentiment

TL;DR¶

This paper is the first to introduce a pre-trained personality model into Multimodal Sentiment Analysis (MSA) for extracting personalized sentiment features. Through personality-sentiment contrastive alignment and a progressive multi-level fusion architecture (pre-fusion → cross-modal interaction → enhanced fusion), the proposed PSA-MF achieves state-of-the-art performance on CMU-MOSI and CMU-MOSEI.

Background & Motivation¶

Challenges in Multimodal Sentiment Analysis¶

With the explosive growth of social media video content, Multimodal Sentiment Analysis (MSA)—fusing textual, visual, and audio modalities for sentiment recognition—has become a critical technology in human-computer interaction, risk prediction, and related fields. Compared to unimodal analysis, MSA leverages cross-modal complementarity to improve accuracy and robustness, but faces two core challenges: how to extract effective sentiment representations from each modality, and how to effectively fuse sentiment information across modalities.

Limitations of Prior Work¶

Existing MSA methods extract only shallow information during feature extraction, overlooking the significant differences in emotional expression attributable to individual personality traits. Research in psychology and affective computing has demonstrated a close relationship between Big Five personality traits and emotional expression—individuals with different personality profiles exhibit markedly different emotional responses and expression styles to the same situation. Nevertheless, existing methods do not incorporate personalized information during sentiment feature encoding. In the fusion stage, prior methods directly merge features from each modality without accounting for differences in feature granularity: text features extracted by pre-trained models carry far higher information density than pre-extracted visual and audio features, and naive same-level fusion fails to reconcile the differences in information density and semantic depth across modalities.

Theoretical Basis for Personality–Emotion Interaction¶

Eladhari et al. proposed a personality-emotion mapping model; Zhang et al. demonstrated the mutual correlation between emotion and personality traits through multi-task learning; Mohammadi et al. found that personality plays an important role in emotion generation. These studies provide a theoretical basis for incorporating personality information into sentiment analysis. However, existing datasets typically contain only sentiment labels without personality annotations, and how to achieve personalized sentiment modeling under label-only conditions remains an open problem.

Core Problem¶

How can multimodal sentiment analysis (1) leverage a pre-trained personality model to extract personalized sentiment features from text and align them with the sentiment space, and (2) design a progressive multi-level fusion architecture to gradually bridge the semantic gap across modalities, thereby improving sentiment recognition performance?

Method¶

Overall Architecture¶

PSA-MF consists of three major modules: (1) feature extraction and personality-sentiment alignment; (2) multimodal pre-fusion; and (3) cross-modal interaction and enhanced fusion.

Unimodal Feature Extraction¶

Text: The first \(N\) layers of a fine-tuned BERT are used to extract sentiment embeddings \(\text{CLS}_s\); a pre-trained Personality BERT extracts personality embeddings \(\text{CLS}_p\).
Visual/Audio: LSTM encodes pre-extracted facial action units (FACET, 35-dim) and acoustic features (COVAREP): \(h_m = \text{LSTM}(X_m; \theta_{\text{LSTM}_m}), m \in \{v, a\}\)

Personality-Sentiment Alignment Module¶

Sentiment and personality features are linearly projected into a shared space and aligned via CLIP-style contrastive learning:

\[\mathcal{L}_{cl} = -\log \frac{\exp(\text{sim}(T_s^i, T_p^i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(T_s^i, T_p^j)/\tau)}\]

A composite contrastive loss incorporating cosine similarity weighting is defined as: \(\mathcal{L}_{ccl} = \text{sim}(T_s^i, T_p^i) \cdot \mathcal{L}_{cl}\)

Personalized Sentiment Constraint Loss—dynamically regulates alignment strength and constrains features to remain within the correct sentiment space:

\[\mathcal{L}_{ps} = (1 - \text{sim}(T_s^i, T_p^i)) \cdot \|W_y \cdot T_s^i - y_i\|_1\]

Total alignment loss: \(\mathcal{L}_{\text{Align}} = \mathcal{L}_{ccl} + \mathcal{L}_{ps}\)

Multimodal Pre-Fusion¶

The remaining \((12-N)\) layers of BERT serve as a multimodal encoder; the text [CLS], visual, and audio features are concatenated and fed in for preliminary cross-modal alignment:

\[\text{CLS}_m = \text{BERT}_m([\text{CLS}_s, X_v, X_a]; \theta_{\text{BERT}_m})\]

A cross-modal contrastive loss \(\mathcal{L}_{clm}\) is simultaneously applied to align the text representation with visual and audio representations.

The pre-fused feature \(M_s\) serves as the query and interacts with visual/audio features via multi-head attention: \(V_t = \text{Att}_v(M_s, h_v, h_v)\), \(A_t = \text{Att}_a(M_s, h_a, h_a)\)

Serial Fusion: A linear layer merges the three cross-modal representations: \(F_s = W_p \cdot [V_t', A_t', M_s']\)

Parallel Fusion: A convolutional operation (kernel=3) compresses the stacked three-stream features: \(F_p = \text{Conv}([V_t, A_t, M_s])\)

Final prediction: \(\hat{y} = \text{Sub}([F_s, F_p])\); total loss: \(\mathcal{L}_{\text{Total}} = \mathcal{L}_{\text{Align}} + \mathcal{L}_{clm} + \|\hat{y}_i - y_i\|_1\)

Key Experimental Results¶

Datasets: CMU-MOSI (2,199 segments) and CMU-MOSEI (23,453 segments). Training performed on RTX 3090.

Main Results (MOSI / MOSEI)¶

Method	MOSI MAE↓	MOSI Acc2↑	MOSI F1↑	MOSEI MAE↓	MOSEI Acc2↑	MOSEI F1↑
TFN	0.901	80.8	80.7	0.593	82.5	82.1
MuLT	0.871	83.0	82.8	0.580	82.5	82.3
MISA	0.783	83.4	83.6	0.555	85.5	85.3
HyCon	0.713	85.2	85.1	0.601	85.4	85.1
FGTI	0.702	85.8	85.8	0.536	86.0	86.0
ULMD	0.700	85.82	85.71	0.531	85.95	85.91
PSA-MF	0.686	86.43	86.19	0.521	86.30	86.28

PSA-MF improves Acc2 over FGTI by 0.63% and F1 by 0.39% on MOSI; on MOSEI, MAE is reduced by 1.9% relative to ULMD.

Ablation Study (MOSI)¶

Variant	MAE↓	Corr↑	Acc2↑	F1↑
w/o personality features	0.711	0.795	84.60	84.47
w/o BERT pre-fusion	0.735	0.778	84.76	84.44
w/o enhanced fusion	0.806	0.788	85.21	84.96
w/o \(\mathcal{L}_{ps}\)	0.724	0.784	83.99	83.92
w/o \(\mathcal{L}_{clm}\)	0.754	0.784	85.06	84.92
PSA-MF	0.686	0.807	86.43	86.19

Removing the personalized sentiment constraint loss \(\mathcal{L}_{ps}\) yields the largest performance drop (Acc2 −2.44%), underscoring its critical role in balancing personality alignment with sentiment classification.

Highlights & Insights¶

First integration of a personality pre-trained model: PSA-MF is the first MSA framework to incorporate Personality BERT for extracting personalized sentiment features, addressing the neglect of personality factors in conventional methods while remaining applicable to datasets with only sentiment labels.
Elegant personality-sentiment alignment design: CLIP-style contrastive learning achieves semantic-level alignment, while the personalized sentiment constraint loss dynamically regulates alignment strength to prevent over-fitting to personality features at the expense of true sentiment.
Progressive multi-level fusion architecture: The three-stage pipeline—pre-fusion (mitigating modality heterogeneity) → cross-modal interaction (personality-driven modality-specific reconstruction) → enhanced fusion (serial + parallel dual-stream)—progressively bridges the cross-modal semantic gap.
Insightful layer analysis: A layer-by-layer alignment experiment across all 13 layers reveals that personality-sentiment alignment is optimal at the deeper pure-text layer (Layer 11) but degrades at the multimodal fusion layer, providing valuable design guidance.

Limitations & Future Work¶

Outdated visual/audio features: The method still relies on pre-extracted FACET (35-dim facial action units) and COVAREP features, without leveraging modern visual/audio pre-trained models such as VideoMAE or HuBERT, limiting representational capacity.
Limited dataset scale: Evaluation is conducted only on CMU-MOSI and CMU-MOSEI, two classic but relatively small benchmarks; generalization to larger-scale or more recent MSA datasets remains unverified.
Generalizability of the personality model: The Personality BERT used is trained on a YouTube-video Big Five dataset; whether its personality representations transfer to different cultural and linguistic contexts is unknown.
Architectural complexity: The multi-level fusion design combined with multiple loss functions introduces numerous hyperparameters, potentially increasing training and tuning costs.

TFN/LMF (tensor fusion): Perform only tensor-level modal fusion without considering personalized information or modality heterogeneity; PSA-MF outperforms TFN by a large margin on Acc2 (+5.6% on MOSI).
MuLT/MISA (cross-attention): Leverage cross-modal attention for alignment but do not introduce personality information; PSA-MF surpasses MISA by 3.2% in Acc2 on MOSEI.
ULMD (feature disentanglement): Decouples invariant/specific representations via modality separators but requires multiple encoders/decoders and complex constraints; PSA-MF adopts a simpler architecture and achieves 0.61% higher Acc2 on MOSI.
FGTI (multi-granularity fusion): Enhances modality specificity through self-supervised learning but does not incorporate the personality dimension; PSA-MF achieves 0.63% higher Acc2 on MOSI.

Broader Implications¶

The personality-sentiment alignment paradigm can be generalized to other subjective evaluation tasks, such as user review analysis and mental health detection, by incorporating individual trait information into the feature space.
The progressive multi-level fusion design pattern (pre-fusion → interaction → enhanced fusion) offers a general paradigm for handling multimodal heterogeneity.
The personalized sentiment constraint loss—using \((1-\text{sim})\) as a dynamic weight—constitutes an elegant adaptive regularization technique.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to introduce a personality pre-trained model into MSA with a dedicated alignment mechanism; the contribution is clearly defined.
Experimental Thoroughness: ⭐⭐⭐⭐ — Ablation is comprehensive and the layer analysis is insightful; however, the datasets and visual/audio features are dated, and validation on more recent benchmarks is lacking.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with clear mathematical derivations and intuitive figures.
Value: ⭐⭐⭐⭐ — Personalized sentiment analysis is an important research direction; incorporating the personality dimension into MSA opens a promising new avenue.