Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach¶

Conference: CVPR 2026
arXiv: 2603.12848
Code: LEYA-HSE/ABAW10-BAH
Area: Human Understanding
Keywords: Ambivalence/Hesitancy recognition, multimodal fusion, prototype learning, affective computing, ABAW competition

TL;DR¶

This paper presents a multimodal Ambivalence/Hesitancy (A/H) recognition approach for the 10th ABAW Competition, integrating four modalities—scene, facial, audio, and text—via a Transformer-based fusion module and a prototype-augmented classification strategy. The best single model achieves an MF1 of 83.25%, and a five-model ensemble reaches 71.43% on the final test set.

Background & Motivation¶

Ambivalence/Hesitancy (A/H) recognition is a challenging task in affective computing, closely related to decision uncertainty, resistance, and fluctuating motivation for behavioral change. The core difficulties of A/H include:

Cross-modal inconsistency: A/H states often manifest as contradictions across modalities—what a person says, how they say it, and their facial expression may be inconsistent.

Fine-grained behavioral signals: Unlike basic emotions (e.g., happiness, surprise), A/H is more subtle and requires comprehensive multimodal modeling.

Text dominance but insufficiency: Prior work shows text is the strongest single-modal cue, yet text alone cannot capture the full manifestation of A/H.

Key Insight: Building upon prior work that primarily uses facial, audio, and text modalities, this paper additionally incorporates scene information and designs a Transformer-based fusion module combined with a prototype-augmented classification objective, performing fusion over modality-level embeddings rather than naive concatenation.

Method¶

Overall Architecture¶

A four-stage pipeline: (1) training dedicated encoders per modality independently; (2) extracting fixed-dimensional modality embeddings; (3) projecting into a shared latent space; (4) a Transformer fusion module modeling cross-modal dependencies to produce the final A/H prediction.

Key Designs¶

Scene Model (VideoMAE): Employs the VideoMAE architecture (ViT-based, pretrained on Kinetics-400), uniformly sampling 16 frames per video and partitioning them into $2 \times 16 \times 16$ spatiotemporal patches via tubelet embedding. A Transformer encoder models spatiotemporal dependencies, and the scene embedding $h_s = \frac{1}{N}\sum_{i=1}^N z_i$ is obtained via global average pooling. Trained for 15 epochs with LR=2e-5 and label smoothing 0.1.
Facial Model (EmotionEfficientNetB0): YOLO face detection → largest-box selection → cropping to 224×224 → EmotionEfficientNetB0 (fine-tuned on AffectNet+) for frame-level emotion embeddings. The key design is statistical pooling for aggregation: $\mu = \frac{1}{F}\sum_f e_f$, $\sigma = \sqrt{\frac{1}{F}\sum_f (e_f - \mu)^2}$, with the final video-level facial representation formed by concatenating $[\mu; \sigma]$. This preserves inter-frame variability, which is valuable for capturing emotional fluctuations in A/H.
Audio Model (EmotionWav2Vec2.0 + Mamba): Audio resampled to 16kHz → pretrained EmotionWav2Vec2.0 (fine-tuned on MSP-Podcast for emotion) extracts feature sequences of $T_a \times 1024$ → a Mamba encoder models temporal dependencies → mean pooling yields a compact embedding. Key choices: layer-10 features + Mamba (outperforms Transformer), hidden size 256, feedforward size 512, Mamba state size 8, convolution kernel 4.
Text Model: Multiple strategies evaluated—TF-IDF with traditional classifiers (Logistic Regression, CatBoost) and fine-tuned Transformers (EmotionDistilRoBERTa, EmotionTextClassifier). The best configuration is fine-tuned EmotionDistilRoBERTa with an MLP classification head, achieving 70.02% average MF1.
Modality Fusion Model: Each modality embedding $x_m$ is mapped to a shared space via a modality-specific projector (Linear + LayerNorm + GELU + Dropout): $u_m = \phi_m(x_m)$. These are stacked into matrix $U = [u_1; ...; u_M]$, augmented with learnable modality embeddings $E_{\text{mod}}$, processed by $L=6$ Transformer encoder layers, and finally mean-pooled with masking to obtain the fused representation $z_{\text{fused}}$. Missing modalities are handled via binary modality masks.
Prototype-Augmented Variant: Maintains $K=16$ learnable prototypes $\{p_{c,k}\}$ per class, computing log-sum-exp similarity between the fused representation and prototypes: $$\hat{y}_c^{\text{proto}} = \log \sum_{k=1}^K \exp\left(\frac{\tilde{z}_{\text{fused}}^\top \tilde{p}_{c,k}}{\tau}\right)$$ The prototype head serves as an auxiliary training loss (not used for final prediction). The total loss is: $\mathcal{L} = \mathcal{L}_{\text{cls}} + \lambda_{\text{proto}} \mathcal{L}_{\text{proto}} + \lambda_{\text{div}} \mathcal{L}_{\text{div}}$, with $\lambda_{\text{proto}}=0.2$.

Loss & Training¶

The fusion model uses RMSprop (LR=9.44e-5) with cosine learning rate scheduling, label smoothing 0.02, and gradient clipping at 0.5. Each configuration is trained with 5 fixed random seeds (42, 2025, 7777, 12345, 31415), and the configuration with the highest average MF1 is selected. Final ensemble averages class probabilities across the 5 seed models.

Key Experimental Results¶

Main Results¶

Model	Modality	Avg MF1	Final Test
EmotionEfficientNetB0	Face	62.67%	-
VideoMAE	Scene	61.96%	-
EmotionWav2Vec2.0+Mamba	Audio	69.03%	-
EmotionDistilRoBERTa	Text	70.02%	-
Four-modal fusion (w/o prototype)	All	82.66%	68.32%
Four-modal fusion (prototype-augmented)	All	83.25%	65.21%
Five-model ensemble (w/o prototype)	All	81.29%	70.17%
Five-model ensemble (prototype-augmented)	All	81.89%	71.43%

Ablation Study¶

Modality Combination	Avg MF1	Notes
Scene + Text	80.39%	Strongest two-modal
Face + Scene + Text	78.77%	Strongest three-modal
Audio + Text	69.02%	Limited audio–text complementarity
Face + Audio	67.40%	Visual+audio inferior to text
Face + Text	63.24%	Weaker combination
All four modalities	82.66%	Best overall

Key Findings¶

Text consistently serves as the strongest single modality (70.02%), yet scene—despite being individually weak (61.96%)—provides the strongest complementary contribution in fusion (Scene+Text=80.39%).
Prototype augmentation yields a notable improvement on the validation set (83.25% vs. 82.66%), but the single model underperforms on the test set (65.21% vs. 68.32%), indicating overfitting risk.
Ensemble is critical for generalization: the 5-model ensemble raises test set performance from 65–68% to 70–71%.
Multimodal fusion (82.66%) substantially outperforms the best single modality (70.02%) by 12.64 percentage points.
The Mamba temporal encoder outperforms Transformers for audio modeling.

Highlights & Insights¶

Value of the scene modality: Prior A/H work overlooks scene information; this paper demonstrates that scene is the most important complementary modality.
Prototype augmentation as regularization: The prototype head serves as an auxiliary loss rather than the primary predictor, providing structured regularization while preserving the flexibility of the main classifier.
Stability-oriented hyperparameter search: Each configuration is evaluated across 5 fixed random seeds to reduce selection bias.
Missing modality handling: Binary modality masks allow the fusion model to gracefully handle partial modality absence.

Limitations & Future Work¶

A large gap exists between validation and test set performance (83.25% vs. 65.21%), indicating that generalization requires further improvement.
Temporal interactions across modalities are not modeled (fusion is performed solely at the video-level embedding stage).
The BAH corpus is relatively small (1,427 videos), limiting the adequacy of model training.

vs. González-González et al.: The baseline work validates text superiority; this paper builds upon it by introducing scene information and improving fusion.
vs. Savchenko & Savchenko: Uses a more lightweight feature pipeline; the Transformer fusion in this paper more thoroughly models cross-modal interactions.
vs. Hallmen et al.: A three-modal fusion scheme; this paper adds the scene modality and adopts a more advanced fusion strategy.

Rating¶

Novelty: ⭐⭐⭐ Individual components (VideoMAE, Mamba, prototype learning) are not new; the contribution lies in their systematic combination and the introduction of the scene modality.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation covering all modality combinations, with comparisons across multiple encoders and fusion strategies.
Writing Quality: ⭐⭐⭐⭐ Clear structure, detailed experimental setup descriptions, and high reproducibility.
Value: ⭐⭐⭐ A competition solution with limited methodological novelty, though the experimental findings offer useful reference value.