Skip to content

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Conference: AAAI 2026 arXiv: 2511.10254 Code: https://github.com/RobitsG/Facial-R1 Area: Human Understanding Keywords: Facial Emotion Analysis, Reinforcement Learning, Action Unit, Vision-Language Model, GRPO

TL;DR

This paper proposes Facial-R1, a three-stage alignment training framework (SFT → RL → Data Synthesis) that aligns the reasoning process of VLMs with emotion recognition outcomes by treating AU and emotion labels as verifiable reward signals. The framework achieves state-of-the-art performance on 8 benchmarks and introduces the FEA-20K dataset.

Background & Motivation

Facial Emotion Analysis (FEA) extends conventional Facial Expression Recognition (FER) by not only predicting emotion labels but also identifying facial Action Units (AUs) and generating interpretable emotion reasoning grounded in AUs. Recent vision-language models (VLMs) such as LLaVA and InternVL have been introduced to FEA tasks with promising results.

However, two core limitations persist in existing approaches:

Reasoning Hallucination: VLMs lack domain-specific priors in emotion understanding, making them prone to generating plausible-sounding yet factually incorrect emotion explanations—such as omitting critical facial features or misidentifying AUs.

Reasoning–Recognition Misalignment: Even when a model correctly identifies emotion cues during reasoning, the final emotion label may contradict the reasoning conclusion, due to the absence of an inherent causal link between the reasoning chain and the predicted label.

Prior methods (e.g., ExpLLM, FABA) have attempted to address these issues via fine-grained instruction tuning, but high-quality emotion reasoning data is difficult to collect at scale, and overly rigid instruction tuning constrains the flexible reasoning capacity of VLMs.

The core idea of this paper is to use verifiable emotion factors (AUs and emotion labels) as reinforcement learning reward signals, rather than prescribing specific reasoning paths, allowing the model to naturally develop flexible reasoning patterns during training—thereby simultaneously addressing hallucination and reasoning–recognition misalignment.

Method

Overall Architecture

Facial-R1 adopts a three-stage progressive training pipeline: - Stage 1: SFT (Supervised Fine-Tuning) — Establishes foundational emotion reasoning capability using a small set of high-quality samples. - Stage 2: RL (Reinforcement Learning) — Aligns reasoning with recognition using emotion factors as reward signals. - Stage 3: Data Synthesis — Iteratively expands training data to enable self-improvement.

Key Designs

  1. Minimal Supervised Fine-Tuning (SFT):

    • Function: Fine-tunes the model on only 300 high-quality emotion analysis samples generated by GPT-4o-mini.
    • Mechanism: AU definitions and other emotion domain knowledge are embedded in the instructions, enabling the VLM to establish basic reasoning associations between facial expressions and emotions.
    • Design Motivation: Eliminates reasoning hallucinations at minimal initialization cost, circumventing the bottleneck of large-scale annotation.
  2. Reinforcement Learning with Verifiable Rewards (RL via GRPO):

    • Function: Applies the GRPO algorithm with three reward components to guide model training.
    • Mechanism: Composite reward \(R = R_{AU} + R_{acc} + R_{format}\)
      • AU Reward \(R_{AU}\): Measures the F1 score between predicted AUs and ground truth, encouraging reasoning grounded in observable facial features and mitigating reward sparsity.
      • Accuracy Reward \(R_{acc}\): Assigns 1 if the predicted emotion label is correct and 0 otherwise, directly aligning reasoning with recognition.
      • Format Reward \(R_{format}\): Requires outputs to use <think> and <answer> tags, standardizing the reasoning structure.
    • Design Motivation: Unlike SFT, the RL stage imposes no constraints on specific reasoning paths—only requiring the model to attend to two verifiable emotion factors (AUs and emotion labels)—thereby enhancing flexibility and robustness.
  3. Iterative Data Synthesis:

    • Function: Leverages the model trained in the preceding two stages to automatically generate large-scale emotion reasoning data.
    • Mechanism: Questions and ground-truth labels from datasets such as FABA-Instruct are used to construct instructions; the trained VLM generates reasoning chains, which are then filtered through a triple-check (AU / emotion / format) and human review to ensure quality.
    • Design Motivation: Bypasses the bottleneck of manual annotation by iteratively expanding training data across multiple rounds, ultimately yielding the FEA-20K dataset comprising 17,737 training samples and 1,688 test samples.

Loss & Training

  • The SFT stage uses standard cross-entropy loss.
  • The RL stage employs the GRPO algorithm, optimizing the policy via intra-group relative advantage \(A_i = (R^i - \text{mean})/\text{std}\).
  • The data synthesis stage incorporates a reflection mechanism: when the model's initial reasoning is incorrect, it is guided to self-correct before regenerating the response.

Key Experimental Results

Main Results

AU Recognition (F1↑):

Dataset Metric Facial-R1 Prev. SOTA Gain
DISFA F1 73.1 72.9 (Face-LLaVA) +0.2
BP4D F1 67.4 69.3 (Norface) -1.9
RAF-AU F1 70.2 69.5 (Exp-BLIP) +0.7
FABA-Instruct F1 68.3 61.9 (FMAE) +6.4

Emotion Recognition: - RAF-DB: Facial-R1 ranks first across all 7 emotion categories, substantially outperforming GPT-4o (62.7% Acc). - AffectNet: Facial-R1 achieves 65.2% accuracy (8-class), achieving top performance on happiness, sadness, anger, surprise, and fear.

Ablation Study

Configuration Key Metric Notes
Qwen2.5-VL (zero-shot) 22.1 F1 (DISFA) Baseline; lacks emotion priors
+ SFT only (300 samples) Significant improvement Eliminates hallucinations; establishes foundational capability
+ SFT + RL Further improvement Aligns reasoning with recognition
+ SFT + RL + Data Synthesis 73.1 F1 (DISFA) Full framework; comprehensive SOTA
SFT with large-scale data only Below full framework Excessively constrains reasoning flexibility

Key Findings

  • As few as 300 SFT samples suffice to effectively eliminate reasoning hallucinations and establish foundational emotion reasoning capability.
  • The AU reward in the RL stage is most effective at reducing spurious reasoning, while the emotion accuracy reward is most effective at eliminating reasoning–recognition misalignment.
  • Data synthesis realizes a "data flywheel" effect without the need for large-scale manual annotation.
  • A gain of 6.4 F1 points on FABA-Instruct demonstrates the method's clear advantage in complex reasoning scenarios.

Highlights & Insights

  • Extremely low initialization cost: The entire training pipeline can be bootstrapped with only 300 GPT-4o-mini-generated samples and weak AU/emotion labels.
  • Transferring the GRPO paradigm from DeepSeek-R1 to expression analysis: The verifiable reward mechanism originally designed for mathematical reasoning is extended to verifiable emotion factors in FEA.
  • Trade-off between reasoning flexibility and path constraints: RL is better suited than SFT for emotion reasoning, as emotional expression is highly individualized and should not be forced into a uniform reasoning path.
  • Visualizations show that Facial-R1 accurately detects multiple AUs and derives emotion labels through coherent reasoning, whereas baseline VLMs frequently misidentify AUs.

Limitations & Future Work

  • The current data synthesis pipeline relies on images sourced from FABA-Instruct, limiting sample diversity.
  • Performance on BP4D does not surpass Norface (67.4 vs. 69.3), indicating room for improvement in generalizing to laboratory-controlled settings.
  • The framework only supports discrete emotion classification and does not account for continuous affect dimensions (valence–arousal) or compound emotions.
  • Inference speed is constrained by the autoregressive generation of VLMs, posing challenges for real-time deployment.
  • The verifiable-reward RL paradigm from DeepSeek-R1/GRPO proves effective in domain-specific settings such as emotion analysis.
  • The three-stage paradigm of "minimal supervision + RL + data synthesis" is potentially generalizable to other vision tasks requiring interpretable reasoning.
  • AUs, as an intermediate representation of emotion, provide a structured reasoning pathway transferable to other affective computing tasks.

Rating

  • Novelty: ⭐⭐⭐⭐ — Applying GRPO to expression analysis is innovative, though the three-stage framework is not entirely novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Evaluated across 8 benchmarks with multi-task assessment and complete ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Logic is clear; motivation and methodology are naturally connected.
  • Value: ⭐⭐⭐⭐ — The FEA-20K dataset and low-cost training paradigm offer practical utility.