AAAI 2026 Audio & Speech multimodal sentiment analysis dynamic modality selection graph convolutional network capsule network cross-modal attention sequence compression

Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection¶

Conference: AAAI 2026 arXiv: 2511.06328 Code: To be confirmed Area: Audio & Speech Keywords: multimodal sentiment analysis, dynamic modality selection, graph convolutional network, capsule network, cross-modal attention, sequence compression

TL;DR¶

This paper proposes the MODS framework, which eliminates redundancy in non-linguistic modalities via Graph-based Dynamic Compression (GDC), and introduces a sample-level Dynamic Primary Modality Selector (MSelector) together with Primary-modality-Centric Cross-Attention (PCCA) to enable adaptive dominant modality selection on a per-sample basis for multimodal sentiment analysis (MSA).

Background & Motivation¶

In MSA, different modalities contribute unevenly to sentiment prediction; language typically carries the highest information density and serves as the default primary modality.
Existing methods fix language as the primary modality (e.g., TCSP, ALMT), failing to accommodate samples where non-linguistic modalities are more discriminative.
Although HCT-DMG proposes dynamic selection, it only supports batch-level selection (due to asynchronous sequence constraints) and neglects sequence redundancy in non-linguistic modalities.
Audio and visual sequences have far lower information density than text; using them directly as the primary modality introduces noise.

Core Problem¶

How can the strongest modality be dynamically selected as the primary modality at the sample level, while simultaneously addressing feature quality degradation caused by sequence redundancy in non-linguistic modalities?

Method¶

Overall Architecture¶

MODS = GDC (graph-based compression module) + MSelector (primary modality selector) + PCCA (primary-modality-centric cross-attention).

Key Design 1: Graph-based Dynamic Compression (GDC)¶

A Capsule Network compresses long audio/visual sequences into graph nodes of the same length as the text sequence:

\[\text{Caps}_m^{i,j} = W_m^{ij} H_m^i\]

Dynamic routing iteratively updates routing coefficients \(r_m^{i,j}\); noisy or redundant capsules automatically receive low weights, yielding high-quality nodes \(N_m^j = \sum_i \text{Caps}_m^{i,j} \times r_m^{i,j}\).

Self-attention is then applied to construct edge weights, followed by a GCN to learn graph representations:

\[H_m^l = \text{ReLU}(D_m^{-1/2} E_m D_m^{-1/2} H_m^{l-1} W_m^l + b_m^l)\]

After compression, \(H_a, H_v \in \mathbb{R}^{T_l \times d}\), aligned in length with the language sequence.

Key Design 2: Primary Modality Selector (MSelector)¶

Attention-based aggregation is applied to each modality to obtain a vector \(h_m\); the concatenated representation is passed through an MLP followed by softmax to produce three weights:

\[w = \text{softmax}(\text{MLP}(\text{concat}(h_a, h_l, h_v))), \quad p = \arg\max(w_a, w_t, w_v)\]

The modality with the highest weight is selected as the primary modality \(p\); each modality's features are scaled by its corresponding weight before being forwarded to subsequent modules. This achieves sample-level dynamic selection.

Key Design 3: Primary-modality-Centric Cross-Attention (PCCA)¶

Multi-layer iterative enhancement, where each layer consists of: 1. Two cross-attention operations \(CA_{a \to p}\): auxiliary modality information flows toward the primary modality. 2. One self-attention \(SA_p\): self-enhancement of the primary modality. 3. Fusion: \(H_p^{[i+1]} = H_{p_{update}}^{[i]} + \sum_{a} H_{a \to p}^{[i]}\) 4. Reverse cross-attention \(CA_{p \to a}\): enhanced primary modality information is propagated back to the auxiliary modalities.

In the final layer, only \(CA_{a \to p}\) is retained; the output \(H_p\) is used for sentiment regression.

Loss & Training¶

\[\mathcal{L}_{task} = \mathcal{L}_{reg} + \alpha \mathcal{L}_{NCE}\]

An InfoNCE loss reconstructs individual unimodal features from the fused representation, stabilizing primary modality selection.

Key Experimental Results¶

Method	MOSI MAE↓	MOSI Acc-7↑	MOSI Acc-2↑	MOSEI Acc-2↑	SIMS Acc-5↑
Self-MM	0.708	46.67	83.44/85.46	83.76/85.15	41.53
MMIM	0.718	46.64	83.38/85.82	82.08/85.14	-
DTN	0.716	47.5	-/85.1	-/85.5	44.26
MODS	0.688	49.27	83.53/85.83	84.52/85.88	45.51

Achieves state-of-the-art performance across four benchmarks (MOSI, MOSEI, SIMS, SIMSv2).
SIMS Acc-5: 45.51% (vs. DTN 44.26%); SIMSv2 Acc-5: 55.51% (vs. DTN 53.71%).
Ablation: removing GDC drops MOSI Acc-7 from 49.27 to 45.34 (−3.93); fixing any single modality as the primary modality results in a 3–4 point decline.
Case studies demonstrate that language is selected when it conveys positive sentiment while audio/visual signals are negative, and non-linguistic modalities are selected when language is neutral but audio/visual signals are positive.

Highlights & Insights¶

The first MSA method to achieve sample-level dynamic primary modality selection (as opposed to batch-level).
The GDC design of constructing graph nodes via a capsule network is elegant: dynamic routing automatically filters redundancy and noise.
PCCA uses the primary modality as a bridge for information flow, preventing noise propagation arising from direct interaction between auxiliary modalities.
Significant improvements over fixed-primary-modality methods on modality-balanced datasets such as SIMS/SIMSv2 validate the value of dynamic selection.

Limitations & Future Work¶

The argmax operation in MSelector is non-differentiable; training relies on softmax weights as an approximation, which may yield insufficiently sharp selections.
Validation is limited to three-modality scenarios; extending MSelector to more modalities requires redesign.
GDC compresses audio/visual sequences to match the text length, which is a rigid choice and may not represent the optimal compression ratio for all samples.
Pre-trained multimodal backbones (e.g., CLIP, Whisper) are not explored; only conventional feature extractors are employed.

Dimension	MODS	HCT-DMG	PaSE	ALMT
Primary Modality Selection	Sample-level dynamic	Batch-level dynamic	None (uniform)	Fixed (language)
Sequence Compression	GDC (Capsule + GCN)	None	None	None
Fusion Strategy	PCCA (primary-modality-centric)	Hierarchical	Prototype gating	Text-centric attention
Core Problem	Modality selection + redundancy	Modality selection	Modality competition	Modality interaction

Dynamic routing in capsule networks applied to sequence compression is a paradigm worth attention; it preserves salient information more effectively than pooling.
The "primary-modality-centric" fusion paradigm avoids noisy cross-propagation between weak modalities and is particularly effective when input quality varies across modalities.
Dynamic primary modality selection is transferable to multimodal LLM settings for handling input modalities of heterogeneous quality.

Rating¶

⭐⭐⭐⭐ — The combination of sample-level dynamic selection and graph-based compression is well-motivated and effective, though the differentiability of the core module and its scalability warrant further investigation.