TEXT: Text-Routed Sparse Mixture of Experts for Multimodal Sentiment Analysis with Explanation Enhancement and Temporal Alignment¶

Conference: AAAI 2026 arXiv: 2512.22741 Authors: Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv Code: fip-lab/TEXT Area: Audio & Speech Keywords: Multimodal Sentiment Analysis, Sparse Mixture of Experts, Temporal Alignment, MLLM Explanation Enhancement, Gated Fusion

TL;DR¶

This paper proposes TEXT, a model that leverages MLLMs to generate natural language explanations for audio and video modalities to enhance modal representations, designs a lightweight temporal alignment module combining the strengths of Mamba and temporal cross-attention, and employs text-routed sparse mixture of experts for cross-modal fusion. TEXT comprehensively outperforms prior SOTA methods and large models such as GPT-4o across four MSA benchmarks.

Background & Motivation¶

Problem Definition¶

Multimodal Sentiment Analysis (MSA) aims to jointly exploit text (transcripts), audio (tone/prosody), and visual (facial expressions) modalities from short videos to predict the speaker's sentiment polarity (positive/negative/neutral) and sentiment intensity score (continuous value). The task has broad applications in healthcare, human–computer interaction, and fraud detection. The core challenge lies in the vastly different—and sometimes contradictory—contributions of modalities to sentiment: text may convey positivity while vocal tone is negative, or facial expressions may conflict with verbal content.

Limitations of Prior Work¶

Existing MSA methods can be broadly categorized into representation-learning-centric approaches (e.g., ALMT, KuDA) and fusion-centric approaches (e.g., DEVA). The authors identify three critical gaps:

Underutilization of MLLM explanation capability: The power of text in the LLM era has not been fully exploited. MLLMs can generate semantic explanations for audio and video to bridge the semantic gap in non-textual modalities, yet no prior work has incorporated this into the feature alignment pipeline of MSA.

Temporal alignment strategies mismatched with MSA: Mamba (linear SSM) is designed for long videos, and temporal cross-attention (TCA) is a general-purpose module; neither is specifically optimized for the dynamic sentiment transitions in short MSA videos.

Fusion strategies ignore modality dominance: Research has shown that text almost always dominates in MSA, yet existing methods lack mechanisms to exploit this prior; both SMoE and gated fusion remain underutilized in the MSA domain.

Core Motivation¶

The paper motivates its approach through a concrete MOSI example in which only the text correctly identifies the sentiment polarity, while audio and video mislead the model. The prediction error of the prior best model ALMT is 0.320, whereas Qwen2.5-vl yields an error of 1.100. This demonstrates that alignment is the critical bridge between representation learning and fusion. Generating MLLM-based explanations for audio and video and aligning them with original features can effectively correct such bias. Simultaneously, routing expert activation using text dominance enables more precise cross-modal fusion. These observations motivate the two core design principles of TEXT: explanation-driven alignment and text-routed fusion.

Method¶

Overall Architecture¶

TEXT comprises six modules organized bottom-up as follows:

Modules ④⑤⑥ (parallel): Three unimodal feature extraction modules. Text is encoded by BERT over transcripts and explanations; audio features are extracted with Librosa; video features are extracted with OpenFace facial action units. Each of the audio and video modules embeds an Explanation Alignment block.
Module ③: Temporal alignment module, which models temporal dependencies between the aligned audio and video representations.
Module ②: Text-routed Sparse Mixture of Experts (SMoE) module, using text features as routing keys for cross-modal interaction.
Module ①: Gated Fusion (GF) + MLP classifier for final sentiment prediction.

Key Design 1: Two-Stage Explanation Generation¶

TEXT generates three components per sample—audio explanation $e_a$, video explanation $e_v$, and an overall comment $c$—via a two-stage pipeline:

Stage 1: VideoLLaMA 3, fine-tuned on the EMER-fine emotion dataset, serves as the multimodal understanding model. Given the original video and task-specific prompts, it generates raw explanations for audio, video, and overall content.

Stage 2: Qwen 3 acts as a reasoning verifier, refining and validating the raw explanations via a reasoning prompt to produce high-quality fine explanations.

The division of labor is elegant: VideoLLaMA 3 excels at multimodal perception but may produce biased descriptions, while Qwen 3's reasoning capability further calibrates the output, with the two models complementing each other to reduce cumulative error.

Key Design 2: Explanation Alignment Block¶

The alignment objective is to bring audio/video features closer to the explanation text in semantic space. Concretely, given feature $F$ and explanation encoding $E$, alignment is achieved via cross-attention:

\[ca(F, E) = \text{softmax}((W_Q E)(W_K F)^T) W_V F\]

Here $Q$ is derived from the explanation while $K/V$ come from the original modality features, allowing the explanation to dominate attention allocation. All unimodal encodings are unified to 50 tokens plus one learnable aggregation token, yielding 51-dimensional embeddings. The aligned representations are denoted $E_t$ (text), $E_a$ (audio), and $E_v$ (video).

Key Design 3: Temporal Alignment Block¶

This constitutes the primary technical innovation of the paper. The authors design a lightweight temporal alignment block that relies on neither CA nor SSM, is simpler than both Mamba and TCA, yet integrates their respective advantages. The core computation is:

\[\text{left} = E_a \oplus L(\text{Conv1d}(LN(E_a)) \otimes \sigma(LN(E_v)))$$ $$\text{right} = E_v \oplus L(\text{Conv1d}(LN(E_v)) \otimes \sigma(LN(E_a)))$$ $$E_{av} = \text{concat}(\text{left}, \text{right})\]

Key design choices: (1) Conv1d captures local temporal patterns (analogous to Mamba's sequence modeling); (2) SiLU-gated element-wise multiplication enables selective cross-modal interaction (analogous to attention weighting); (3) residual connections preserve original information; (4) the symmetric structure treats the audio→video and video→audio information flows equally. This design avoids the complex recurrence of SSMs and the quadratic complexity of cross-attention while retaining temporal modeling capability.

Key Design 4: Text-Routed SMoE¶

Exploiting text dominance in MSA, TEXT uses text features $E_t$ as routing keys to determine which experts are activated to process the temporally aligned audio-visual embeddings $E_{av}$, formalized as $\text{SMoE}(E_t, E_{av})$. Intuitively, sentiment-laden keywords in text (e.g., "disappointing," "excellent") activate experts corresponding to the relevant sentiment themes, endowing the expert network with topic sensitivity.

Loss & Training¶

As a regression problem, the primary optimization objective of TEXT is the MSE loss between predicted sentiment intensity $\hat{y}$ and ground-truth annotations. The output of the gated fusion classifier is:

\[\hat{y} = L(\sigma(\text{SMoE}(E_t, E_{av})))\]

where $\sigma$ denotes sigmoid gating and $L$ is a linear layer.

Key Experimental Results¶

Table 1: Main Comparison on MOSI and MOSEI¶

Model	MOSI Acc-2	MOSI MAE↓	MOSI Corr	MOSEI Acc-2	MOSEI MAE↓	MOSEI Corr
ALMT	83.10/85.23	0.716	0.773	82.39/85.87	0.542	0.767
KuDA	84.40/86.43	0.705	0.795	83.26/86.46	0.529	0.776
DEVA	84.40/86.29	0.730	0.787	83.26/86.13	0.541	0.769
GPT-4o	85.71/86.74	0.682	0.823	84.77/86.08	0.637	0.744
Qwen2.5-vl	83.09/83.38	1.129	0.677	84.14/84.59	1.007	0.587
TEXT	86.44/88.72	0.666	0.829	85.02/86.57	0.528	0.786

TEXT achieves 88.72% Acc-2 on MOSI (approximately 2% above GPT-4o) and reduces MAE to 0.666; on MOSEI, MAE drops to 0.528, outperforming all baselines. On CH-SIMS, MAE decreases from the second-best 0.408 (KuDA) to 0.353, a reduction of 13.5%.

Ablation Study¶

Table 2: Ablation on MOSEI¶

Setting	Acc-2	Acc-7	MAE↓	Corr
TEXT (full)	85.02/86.57	52.29	0.528	0.786
w/o explanations	83.60/86.02	50.35	0.569	0.776
Text only (w/ explanations)	83.49/86.43	52.84	0.535	0.771
EA → Linear	84.25/86.57	48.21	0.577	0.762
TA → Concat	83.77/85.42	48.40	0.580	0.749
TA → Mamba	84.80/86.41	50.65	0.562	0.780
TA → TCA	83.41/86.43	51.38	0.565	0.781
SMoE → Transformer	83.73/85.33	50.29	0.573	0.769
w/o Gated Fusion	84.40/86.35	49.07	0.571	0.780

Replacing temporal alignment with simple concatenation increases MAE from 0.528 to 0.580 (the largest degradation), confirming that temporal alignment is the key factor in MAE improvement. Removing explanations causes an approximately 2% overall decline, and SMoE contributes comparably to explanations.

Highlights & Insights¶

Using MLLMs for data augmentation rather than end-to-end inference: Unlike directly applying GPT-4o for MSA, TEXT cleverly uses MLLMs to generate explanation text as auxiliary signals, which are then encoded by BERT and used in feature alignment. This leverages the semantic understanding capability of MLLMs while avoiding their instability on regression tasks (Qwen2.5-vl MAE as high as 1.129).
Minimalist design of the temporal alignment module: Using only Conv1d, linear layers, and gating—without attention or SSM—the module outperforms both Mamba and TCA in ablation studies. This suggests that for short-video MSA, simple local temporal convolution combined with cross-modal gating is sufficient.
Empirical finding of text dominance: Ablation results show that using text alone (with explanations) achieves slightly higher Acc-7 than the full model (52.84% vs. 52.29%), indicating that audio and video primarily contribute to regression precision (MAE) rather than classification accuracy.
Highly convincing qualitative case study: On a specific sample, TEXT's prediction error is only 0.010 (ground truth 1.400 vs. prediction 1.390), while GPT-4o's error is 0.600 and Qwen2.5-vl's is 1.100. Removing explanations causes the audio branch error to surge from 0.380 to 1.440, clearly validating the value of explanation alignment.

Limitations & Future Work¶

Dependence on cascaded MLLMs: Explanation generation requires two stages involving VideoLLaMA 3 and Qwen 3, introducing risks of cumulative error and substantial inference overhead—a limitation the authors themselves acknowledge.
Coverage limited to Chinese and English: Explanation quality depends on the linguistic capabilities of the MLLMs; the paper only validates on Chinese and English datasets, and generalization to other languages remains unknown.
Risk of data contamination from MLLMs: The paper notes that MLLMs may have memorized portions of certain datasets (e.g., GPT-4o performs unusually well on English datasets but degrades substantially on Chinese ones), which affects the fairness of baseline comparisons.
Absent computational cost analysis: The paper does not report the time and computational resource requirements for the MLLM explanation generation stage, leaving actual deployment costs unclear.

ALMT (Zhang et al., 2023): Learns unrelated/conflict-suppressive representations and unifies modalities via Transformer. TEXT's explanation alignment can be viewed as an enhanced version of ALMT.
KuDA (Feng et al., 2024): Proposes dominant-modality enhancement. TEXT inherits the text-dominance concept but exploits it more precisely via SMoE routing.
DEVA (Wu et al., 2025): Text-guided progressive fusion with sentiment description generation. Conceptually similar to TEXT's explanation enhancement but differs in implementation.
Mamba (Gu & Dao, 2024): Linear SSM model. TEXT's temporal alignment block can be viewed as a simplified variant of Mamba's convolutional pathway.
Insights: The paradigm of using MLLMs as "semantic bridges" rather than direct predictors is generalizable to other multimodal regression tasks (e.g., emotional intensity estimation, pain assessment). The success of the lightweight temporal module also suggests that complex sequence modeling is unnecessary for short-sequence scenarios.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐
Overall	⭐⭐⭐⭐

The combination of explanation alignment and text-routed SMoE is novel and effective, with thorough and convincing ablation studies. Points are deducted for: frequent formula rendering issues in the writing, the absence of a discussion on the computational cost of MLLM cascading, and the observation that the text-only variant already approaches full-model performance on certain metrics, implying limited headroom for multimodal fusion gains.