Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach¶
Conference: CVPR 2026
arXiv: 2603.12848
Code: LEYA-HSE/ABAW10-BAH
Area: Human Understanding
Keywords: Ambivalence/Hesitancy recognition, multimodal fusion, prototype learning, affective computing, ABAW competition
TL;DR¶
This paper presents a multimodal Ambivalence/Hesitancy (A/H) recognition approach for the 10th ABAW Competition, integrating four modalities—scene, facial, audio, and text—via a Transformer-based fusion module and a prototype-augmented classification strategy. The best single model achieves an MF1 of 83.25%, and a five-model ensemble reaches 71.43% on the final test set.
Background & Motivation¶
Ambivalence/Hesitancy (A/H) recognition is a challenging task in affective computing, closely related to decision uncertainty, resistance, and fluctuating motivation for behavioral change. The core difficulties of A/H include:
Cross-modal inconsistency: A/H states often manifest as contradictions across modalities—what a person says, how they say it, and their facial expression may be inconsistent.
Fine-grained behavioral signals: Unlike basic emotions (e.g., happiness, surprise), A/H is more subtle and requires comprehensive multimodal modeling.
Text dominance but insufficiency: Prior work shows text is the strongest single-modal cue, yet text alone cannot capture the full manifestation of A/H.
Key Insight: Building upon prior work that primarily uses facial, audio, and text modalities, this paper additionally incorporates scene information and designs a Transformer-based fusion module combined with a prototype-augmented classification objective, performing fusion over modality-level embeddings rather than naive concatenation.
Method¶
Overall Architecture¶
A four-stage pipeline: (1) training dedicated encoders per modality independently; (2) extracting fixed-dimensional modality embeddings; (3) projecting into a shared latent space; (4) a Transformer fusion module modeling cross-modal dependencies to produce the final A/H prediction.
Key Designs¶
-
Scene Model (VideoMAE): Employs the VideoMAE architecture (ViT-based, pretrained on Kinetics-400), uniformly sampling 16 frames per video and partitioning them into \(2 \times 16 \times 16\) spatiotemporal patches via tubelet embedding. A Transformer encoder models spatiotemporal dependencies, and the scene embedding \(h_s = \frac{1}{N}\sum_{i=1}^N z_i\) is obtained via global average pooling. Trained for 15 epochs with LR=2e-5 and label smoothing 0.1.
-
Facial Model (EmotionEfficientNetB0): YOLO face detection → largest-box selection → cropping to 224×224 → EmotionEfficientNetB0 (fine-tuned on AffectNet+) for frame-level emotion embeddings. The key design is statistical pooling for aggregation: \(\mu = \frac{1}{F}\sum_f e_f\), \(\sigma = \sqrt{\frac{1}{F}\sum_f (e_f - \mu)^2}\), with the final video-level facial representation formed by concatenating \([\mu; \sigma]\). This preserves inter-frame variability, which is valuable for capturing emotional fluctuations in A/H.
-
Audio Model (EmotionWav2Vec2.0 + Mamba): Audio resampled to 16kHz → pretrained EmotionWav2Vec2.0 (fine-tuned on MSP-Podcast for emotion) extracts feature sequences of \(T_a \times 1024\) → a Mamba encoder models temporal dependencies → mean pooling yields a compact embedding. Key choices: layer-10 features + Mamba (outperforms Transformer), hidden size 256, feedforward size 512, Mamba state size 8, convolution kernel 4.
-
Text Model: Multiple strategies evaluated—TF-IDF with traditional classifiers (Logistic Regression, CatBoost) and fine-tuned Transformers (EmotionDistilRoBERTa, EmotionTextClassifier). The best configuration is fine-tuned EmotionDistilRoBERTa with an MLP classification head, achieving 70.02% average MF1.
-
Modality Fusion Model: Each modality embedding \(x_m\) is mapped to a shared space via a modality-specific projector (Linear + LayerNorm + GELU + Dropout): \(u_m = \phi_m(x_m)\). These are stacked into matrix \(U = [u_1; ...; u_M]\), augmented with learnable modality embeddings \(E_{\text{mod}}\), processed by \(L=6\) Transformer encoder layers, and finally mean-pooled with masking to obtain the fused representation \(z_{\text{fused}}\). Missing modalities are handled via binary modality masks.
-
Prototype-Augmented Variant: Maintains \(K=16\) learnable prototypes \(\{p_{c,k}\}\) per class, computing log-sum-exp similarity between the fused representation and prototypes: $\(\hat{y}_c^{\text{proto}} = \log \sum_{k=1}^K \exp\left(\frac{\tilde{z}_{\text{fused}}^\top \tilde{p}_{c,k}}{\tau}\right)\)$ The prototype head serves as an auxiliary training loss (not used for final prediction). The total loss is: \(\mathcal{L} = \mathcal{L}_{\text{cls}} + \lambda_{\text{proto}} \mathcal{L}_{\text{proto}} + \lambda_{\text{div}} \mathcal{L}_{\text{div}}\), with \(\lambda_{\text{proto}}=0.2\).
Loss & Training¶
The fusion model uses RMSprop (LR=9.44e-5) with cosine learning rate scheduling, label smoothing 0.02, and gradient clipping at 0.5. Each configuration is trained with 5 fixed random seeds (42, 2025, 7777, 12345, 31415), and the configuration with the highest average MF1 is selected. Final ensemble averages class probabilities across the 5 seed models.
Key Experimental Results¶
Main Results¶
| Model | Modality | Avg MF1 | Final Test |
|---|---|---|---|
| EmotionEfficientNetB0 | Face | 62.67% | - |
| VideoMAE | Scene | 61.96% | - |
| EmotionWav2Vec2.0+Mamba | Audio | 69.03% | - |
| EmotionDistilRoBERTa | Text | 70.02% | - |
| Four-modal fusion (w/o prototype) | All | 82.66% | 68.32% |
| Four-modal fusion (prototype-augmented) | All | 83.25% | 65.21% |
| Five-model ensemble (w/o prototype) | All | 81.29% | 70.17% |
| Five-model ensemble (prototype-augmented) | All | 81.89% | 71.43% |
Ablation Study¶
| Modality Combination | Avg MF1 | Notes |
|---|---|---|
| Scene + Text | 80.39% | Strongest two-modal |
| Face + Scene + Text | 78.77% | Strongest three-modal |
| Audio + Text | 69.02% | Limited audio–text complementarity |
| Face + Audio | 67.40% | Visual+audio inferior to text |
| Face + Text | 63.24% | Weaker combination |
| All four modalities | 82.66% | Best overall |
Key Findings¶
- Text consistently serves as the strongest single modality (70.02%), yet scene—despite being individually weak (61.96%)—provides the strongest complementary contribution in fusion (Scene+Text=80.39%).
- Prototype augmentation yields a notable improvement on the validation set (83.25% vs. 82.66%), but the single model underperforms on the test set (65.21% vs. 68.32%), indicating overfitting risk.
- Ensemble is critical for generalization: the 5-model ensemble raises test set performance from 65–68% to 70–71%.
- Multimodal fusion (82.66%) substantially outperforms the best single modality (70.02%) by 12.64 percentage points.
- The Mamba temporal encoder outperforms Transformers for audio modeling.
Highlights & Insights¶
- Value of the scene modality: Prior A/H work overlooks scene information; this paper demonstrates that scene is the most important complementary modality.
- Prototype augmentation as regularization: The prototype head serves as an auxiliary loss rather than the primary predictor, providing structured regularization while preserving the flexibility of the main classifier.
- Stability-oriented hyperparameter search: Each configuration is evaluated across 5 fixed random seeds to reduce selection bias.
- Missing modality handling: Binary modality masks allow the fusion model to gracefully handle partial modality absence.
Limitations & Future Work¶
- A large gap exists between validation and test set performance (83.25% vs. 65.21%), indicating that generalization requires further improvement.
- Temporal interactions across modalities are not modeled (fusion is performed solely at the video-level embedding stage).
- The BAH corpus is relatively small (1,427 videos), limiting the adequacy of model training.
Related Work & Insights¶
- vs. González-González et al.: The baseline work validates text superiority; this paper builds upon it by introducing scene information and improving fusion.
- vs. Savchenko & Savchenko: Uses a more lightweight feature pipeline; the Transformer fusion in this paper more thoroughly models cross-modal interactions.
- vs. Hallmen et al.: A three-modal fusion scheme; this paper adds the scene modality and adopts a more advanced fusion strategy.
Rating¶
- Novelty: ⭐⭐⭐ Individual components (VideoMAE, Mamba, prototype learning) are not new; the contribution lies in their systematic combination and the introduction of the scene modality.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation covering all modality combinations, with comparisons across multiple encoders and fusion strategies.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, detailed experimental setup descriptions, and high reproducibility.
- Value: ⭐⭐⭐ A competition solution with limited methodological novelty, though the experimental findings offer useful reference value.