CVPR 2025 (Workshop) Human Understanding Facial Expression Recognition Valence-Arousal Estimation Action Unit Detection Violence Detection EfficientNet MLP

HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification¶

Conference: CVPR 2025 (Workshop)
arXiv: 2603.12693
Code: EmotiEffLib / VD
Area: Facial Expression Recognition / Affective Computing
Keywords: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection, Violence Detection, EfficientNet, MLP

TL;DR¶

The HSEmotion team proposed a lightweight pipeline for the ABAW-10 competition: using pre-trained EfficientNet to extract facial embeddings, combined with MLP + GLA (Generalized Logit Adjustment) + sliding window smoothing. It significantly outperformed the official baselines on all four tasks (EXPR/VA/AU/VD), among which the violence detection task achieved a macro F1 of 0.783 using ConvNeXt-T + TCN.

Background & Motivation¶

Background: The ABAW (Affective Behavior Analysis in-the-wild) competition is a mainstream benchmark in affective computing. The 10th edition comprises four tasks: Facial Expression Recognition (EXPR), Valence-Arousal Estimation (VA), Action Unit Detection (AU), and Fine-Grained Violence Detection (VD).

Key Challenge: In-the-wild data faces issues such as occlusion, head pose/illumination variations, domain shift, label noise, and class imbalance. Existing methods typically require complex temporal modeling (Transformers/TCNs) and multimodal fusion, resulting in high computational costs.

Key Insight: Rather than pursuing architectural complexity, the authors deploy a high-quality pre-trained encoder + simple MLP + post-processing techniques (GLA, confidence filtering, sliding window smoothing) to construct a "simple but effective" pipeline.

Mechanism: Pre-trained models already possess powerful feature extraction capabilities; the key lies in how to efficiently utilize these features and address class imbalance and inter-frame noise issues.

Method¶

Facial Expression Recognition (EXPR)¶

Feature Extraction: Face embeddings are extracted using EmotiEffNet-B0 (EfficientNet pre-trained on AffectNet).
MLP Classifier: A single-hidden-layer MLP is trained using weighted softmax loss to handle class imbalance.
GLA (Generalized Logit Adjustment): Per-class bias \(b_y^*\) is searched on the validation set to maximize the F1 score, effectively correcting class prior bias.
Confidence Filtering: If the maximum prediction probability of the pre-trained model exceeds \(p_0\) (\(0.8\)-\(0.9\)), its prediction is directly adopted; otherwise, the MLP is used for classification.
Temporal Smoothing: A sliding window is applied to average the probabilities of adjacent frames, eliminating frame-level noise.
Optional Audio Fusion: Features from wav2vec 2.0 are extracted and utilized to train a separate MLP, which is then fused with the visual branch via weighted summation.

Valence-Arousal Estimation (VA)¶

A pre-trained MT-DDAMFN model is used to extract embeddings, followed by a zero-hidden-layer MLP for regression.
The loss function combines MSE and Concordance Correlation Coefficient (CCC).
Sliding window smoothing is similarly applied.

Action Unit Detection (AU)¶

Multi-label classification for 12 AUs is performed using an MLP with a sigmoid output.
Weighted BCEWithLogitsLoss is employed, where positive weights are computed based on class frequencies.
Innovation: Blending predictions from two separate MLPs trained on embeddings and logits, respectively.
Per-AU optimal thresholds are searched instead of using a uniform threshold of 0.5.

Violence Detection (VD)¶

Best single-stream model: ConvNeXt-T (pre-trained on ImageNet-1K) extracts 768-d frame features, followed by a 5-layer dilated TCN.
Multimodal variant: MediaPipe Pose skeletal features (406-d compressed to 256-d) are incorporated, followed by cross-attention fusion and a BiLSTM.
Training is conducted using AdamW + OneCycleLR + TrivialAugmentWide, with a positive class weight of 1.15.

Key Experimental Results¶

EXPR Classification (AffWild2 Validation Set)¶

Method	F1-score	Accuracy
Baseline VGGFACE	25.0	-
EmotiEffNet, GLA, sliding window	44.85	55.41
EmotiEffNet, GLA, filtering + sliding window	45.79	55.69
EmotiEffNet + wav2vec, GLA, filtering + sliding window	47.40	57.98
Comparison: CLIP+TCN [68]	46.51	-

VA Estimation (AffWild2 Validation Set)¶

Method	CCC_V	CCC_A	\(P_{VA}\)
Baseline ResNet-50	0.24	0.20	0.22
MT-DDAMFN, MLP, sliding window	0.510	0.615	0.562
Comparison: CLIP+TCN [68]	0.562	0.612	0.587

AU Detection (AffWild2 Validation Set)¶

Method	F1-score
Baseline VGGFACE	39.0
EmotiEffNet, logits+embeddings, sliding window, optimal threshold	54.7
Comparison: CLIP+TCN [68]	58.0

Violence Detection (DVD Validation Set)¶

Method	F1_V	F1_NV	Macro F1
Baseline ResNet-50 + BiLSTM	0.56	0.71	0.640
ConvNeXt-T + TCN	0.738	0.828	0.783
ConvNeXt-T + Skel. attn + BiLSTM	0.715	0.828	0.772

Key Findings¶

A 2D pre-trained encoder + simple temporal head consistently outperforms 3D video backbones (e.g., SlowFast, VideoMAE).
Two-stream optical flow fusion performs worse than pure RGB ConvNeXt-T.
GLA yields significant improvements in calibrating class imbalance (F1 increases from 38.68 to 41.40).
Confidence filtering and sliding window smoothing each contribute roughly a 1-2% boost in F1.
Per-AU threshold searching consistently yields a 0.2-0.5% improvement compared to using a uniform 0.5 threshold.

Highlights & Insights¶

Extreme Engineering Simplicity: The entire EXPR pipeline consists only of a pre-trained encoder + single-layer MLP + three post-processing techniques, yet achieves near-SOTA performance.
Effective Application of GLA: Migrating post-hoc logit adjustment from general classification to affective recognition is straightforward and highly effective.
Intuitive Confidence Filtering: Pre-trained models already exhibit accurate judgments on high-confidence samples, reserving the need for an extra classifier specifically for low-confidence ones.
Systematic Ablation for the VD Task: Extensive combinations of backbones, temporal heads, and multimodal options are tested, yielding clear conclusions.

Limitations & Future Work¶

Methodology innovation is relatively limited, primarily representing an engineering assembly of existing techniques (EfficientNet + MLP + GLA + smoothing).
Gaps still exist between their solutions and previous top works in EXPR/VA/AU tasks (specifically, AU detection lags behind CLIP+TCN by approximately 3.3%).
VA estimation utilizes only single frames + simple smoothing, failing to fully leverage temporal dependencies.
The integration of the audio modality is relatively primitive (simple weighted fusion), without exploring more sophisticated fusion mechanisms like cross-attention.
Violence detection is evaluated only on the DVD dataset; its generalizability remains to be verified.

Rating¶

Novelty: ⭐⭐⭐ Limited methodological innovation; mainly a combination of mature techniques.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Detailed ablations across all four tasks, with extensive architectural comparisons for VD.
Writing Quality: ⭐⭐⭐⭐ Well-structured, though tending towards a technical report style.
Value: ⭐⭐⭐⭐ Holds practical engineering reference value as a competition solution, demonstrating the upper bound of "simple methods."