Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach¶
Conference: CVPR 2025 (ABAW Workshop)
arXiv: 2603.12848
Code: GitHub
Area: Affective Computing / Multimodal Fusion
Keywords: Ambivalence/Hesitancy Recognition, Multimodal Fusion, Prototype-augmented, Mamba, Transformer Fusion
TL;DR¶
This paper proposes a multimodal approach for video-level ambivalence/hesitancy (A/H) recognition, integrating four modalities: scene (VideoMAE), face (EmotionEfficientNetB0), audio (EmotionWav2Vec2.0+Mamba), and text (EmotionDistilRoBERTa). Through a prototype-augmented Transformer fusion model, it achieves an average MF1 of 83.25%, with a five-model ensemble ultimately reaching 71.43% on the test set.
Background & Motivation¶
Background: Affective computing aims to empower intelligent systems with the capability to perceive human emotions. The 10th ABAW Competition introduces the video-level Ambivalence/Hesitancy (A/H) recognition task—determining whether a video contains ambivalent or hesitant behaviors.
Limitations of Prior Work: Unlike basic emotions, A/H is characterized by cross-modal inconsistencies (where verbal content, tone, and facial expressions may conflict), making it difficult to capture via a single modality. Prior works mostly rely on simple fusion strategies and fail to fully model cross-modal interactions.
Key Challenge: A/H signals are subtle and context-dependent. They require cooperative multimodal understanding to detect contradictory "actions mismatching words" cues, but cross-modal fusion easily fails when presented with conflicting evidence across modalities.
Goal: How to effectively fuse four complementary modalities—scene, face, audio, and text—to recognize A/H behaviors in videos?
Key Insight: Independent expert encoders are first trained for each modality to extract compact representations, followed by a Transformer fusion module that models interactions on modal tokens, supplemented by a prototype classification objective to enhance generalization.
Core Idea: Quad-modal expert encoders + Transformer cross-modal fusion + prototype-augmented classification + multi-seed ensemble.
Method¶
Overall Architecture¶
A four-stage pipeline: (1) independent encoders for each modality extract embeddings; (2) projection into a shared latent space; (3) a Transformer encoder fuses modality tokens; (4) a classification head + an optional prototype head predict A/H.
Key Designs¶
-
Scene Modality (VideoMAE)
- Function: Capture behavioral dynamics and uncertainty cues in videos.
- Mechanism: Uniform sampling of 16 frames \(\rightarrow\) tubelet embedding \(\rightarrow\) Transformer encoder \(\rightarrow\) global average pooling to obtain the scene embedding \(h_s\).
- Design Motivation: Pre-trained on Kinetics-400, VideoMAE can capture spatiotemporal dependencies, which is suitable for analyzing behavioral patterns.
-
Facial Modality (EmotionEfficientNetB0 + Statistical Pooling)
- Function: Extract frame-level facial emotion embeddings and aggregate them into video-level representations.
- Mechanism: YOLO face detection \(\rightarrow\) EfficientNetB0 (AffectNet + fine-tuning) to extract frame-level emotional embeddings \(\rightarrow\) statistical pooling \([\mu; \sigma]\) to obtain video-level representations.
- Design Motivation: The mean captures the dominant emotional state, while the standard deviation captures emotional fluctuations—instability is indeed a key characteristic of A/H.
-
Audio Modality (EmotionWav2Vec2.0 + Mamba Encoder)
- Function: Extract emotional prosody features from speech and model temporal dependencies.
- Mechanism: EmotionWav2Vec2.0 extracts acoustic embedding sequences \(\rightarrow\) Mamba encoder models temporal dependencies \(\rightarrow\) mean pooling \(\rightarrow\) linear layer.
- Design Motivation: The linear complexity of Mamba is suitable for handling variable-length audio sequences, and the state-space model can capture temporal patterns such as hesitation and pauses in speech.
-
Text Modality (EmotionDistilRoBERTa Fine-tuning)
- Function: Extract linguistic hesitation cues from transcribed text.
- Mechanism: Direct fine-tuning of EmotionDistilRoBERTa \(\rightarrow\) MLP classification head.
- Design Motivation: Text is the strongest single-modal cue (70.02% MF1), as hesitation and ambivalence are frequently expressed through wording.
-
Prototype-augmented Transformer Fusion Model
- Function: Fuse the four modal embeddings and enhance generalization through prototype classification.
- Mechanism: Each modal embedding is projected onto a shared space \(u_m = \phi_m(x_m)\), added with modal embeddings, and input into a 6-layer Transformer \(\rightarrow\) masked mean pooling \(\rightarrow\) main classification head + prototype classification head.
- Prototype Head: 16 learnable prototypes per class, calculating class scores using \(\ell_2\) normalized cosine similarity + log-sum-exp.
- Design Motivation: The prototype auxiliary loss provides smoother gradient signals during training, preventing overfitting to hard labels.
Loss & Training¶
\(\mathcal{L} = \mathcal{L}_{\text{cls}} + \lambda_{\text{proto}} \mathcal{L}_{\text{proto}} + \lambda_{\text{div}} \mathcal{L}_{\text{div}}\), where \(\lambda_{\text{proto}}=0.2\) and \(\lambda_{\text{div}}=0\) (diversity regularization term is disabled). The RMSprop optimizer is used, and training is averaged over 5 fixed seeds to reduce sensitivity to initialization.
Key Experimental Results¶
Main Results (BAH Corpus)¶
| Model Configuration | Modality | Average MF1 | Final Test MF1 |
|---|---|---|---|
| EmotionDistilRoBERTa | Text | 70.02% | — |
| EmotionWav2Vec2.0+Mamba | Audio | 69.03% | — |
| Four-modal Fusion | All | 82.66% | 68.32% |
| Four-modal Fusion + Prototype | All | 83.25% | 65.21% |
| 5-Model Ensemble | All | 81.29% | 70.17% |
| 5-Model Ensemble + Prototype | All | 81.89% | 71.43% |
Ablation Study (Modality Combinations)¶
| Modality Combination | Average MF1 | Description |
|---|---|---|
| Scene + Text | 80.39% | Strongest bi-modal combination |
| Face + Scene + Text | 78.77% | Strongest tri-modal combination |
| Face + Audio | 67.40% | Poor performance without text |
| Full Quad-modal | 82.66% | Optimal quad-modal |
Key Findings¶
- Text is the strongest single modality (~70% MF1), followed by audio (69%), while face and scene are individually weaker (~62%).
- Multimodal fusion significantly outperforms all single modalities—the best fusion exceeds the best single modality by 13+ percentage points.
- Prototype-augmentation improves validation performance but degrades individual model test generalization; the benefits of prototype-augmentation are only reflected on the test set after ensembling.
- Scene + Text is the strongest bi-modal combination (80.39%), indicating that behavioral dynamics and linguistic cues are highly complementary.
Highlights & Insights¶
- Text Modality Dominates A/H Recognition: Verifies the core role of linguistic content in ambivalence/hesitancy detection—people's hesitation is often most directly expressed through wording.
- Prototype Enhancement + Ensemble Strategy: Prototype classification as an auxiliary loss provides a regularization effect during training, yet an ensemble is required to compensate for single-model instability.
- Mamba for Audio Temporal Modeling: The linear complexity of Mamba is suitable for variable-length audio, being more efficient than Transformer and exhibiting superior performance on the audio modality.
- Statistical Pooling Captures Emotion Fluctuations: The facial modality utilizes \([\mu; \sigma]\) rather than just \(\mu\). The standard deviation encodes the degree of emotional fluctuation—a key signal in ambivalent behavior.
Limitations & Future Work¶
- The BAH corpus is small (1427 videos), which limits the generalization ability of deep models.
- The text modality relies on automatic transcription quality; ASR errors in real-world scenarios may impact performance.
- Inconsistency across modalities is not modeled—while a core characteristic of A/H is the misalignment between "what is said" and facial expressions, the current fusion approach focuses on alignment rather than contrast.
- The scene modality only samples 16 frames, which may miss crucial moments of hesitation.
Related Work & Insights¶
- vs González-González et al. (baseline): The baseline uses simple concatenation fusion, whereas this work employs Transformer fusion + prototype-augmentation to better capture cross-modal interactions.
- vs Savchenko & Savchenko: Their best results came from a text + face combination, while this work demonstrates that quad-modal fusion and scene information can yield further improvements.
- vs Hallmen et al.: Their tri-modal fusion utilizes an MLP, whereas the Transformer fusion module used in this work is more powerful.
Rating¶
- Novelty: ⭐⭐⭐ No major innovations at the component level (combinations of existing modules), but the prototype-augmented fusion and the quad-modal design for the A/H task offer some novelty.
- Experimental Thoroughness: ⭐⭐⭐⭐ Exhaustive ablation experiments cover all modality combinations and model variants.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed experimental descriptions and good reproducibility.
- Value: ⭐⭐⭐ A competition solution paper; the generalizability of the method is limited, but it provides a valuable baseline for A/H recognition.