LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics¶

Conference: CVPR 2026 arXiv: 2604.07193 Code: None Area: Affective Computing Keywords: affective modeling, language models, semantic priors, Valence-Arousal, preference learning

TL;DR¶

This paper proposes the LaScA framework, which leverages large language models to generate a deterministic semantic lexicon as affective priors for handcrafted facial and acoustic features. A frozen sentence encoder produces semantic embeddings that are fused with the raw features. LaScA consistently outperforms feature-only baselines in affective dynamics prediction on the Aff-Wild2 and SEWA datasets, and matches or surpasses end-to-end deep models in terms of consistency, efficiency, and interpretability.

Background & Motivation¶

Modeling affective behavior in-the-wild is a central challenge in affective computing. Existing approaches suffer from the following issues:

Opacity of end-to-end deep models: CNN/RNN/Transformer architectures learn high-dimensional latent representations directly from visual and audio streams, but signal extraction and affective reasoning are entangled within opaque embeddings, making it difficult to analyze how specific behavioral cues influence predictions.

Lack of contextual abstraction in handcrafted features: Facial geometry features and acoustic descriptors are compact, efficient, and grounded in domain knowledge, yet they fail to capture higher-level semantic relationships that affect emotional perception—for instance, the same facial action unit may carry different affective meanings in different contexts.

High annotation noise: Affective annotations in naturalistic settings are highly subjective and culturally variable; predicting the direction of change is more reliable than predicting absolute values.

The core insight of LaScA is that handcrafted features provide a strong representational foundation but require contextual semantic enrichment from language models—rather than being replaced by deep embeddings.

Method¶

Overall Architecture¶

The complete LaScA pipeline proceeds as follows: 1. Extract handcrafted facial features (58 blendshape coefficients) and acoustic features (15 MFCCs). 2. Perform per-sample saliency estimation via Otsu thresholding to select active features. 3. Assemble semantic descriptions of active features into a structured template. 4. Generate semantic embeddings using a frozen sentence encoder. 5. Fuse features and embeddings, then apply a preference learner to predict the direction of affective change.

Key Designs¶

Affect-Aware Semantic Lexicon:
Function: Generates fixed textual descriptions for each handcrafted feature (e.g., blendshape coefficients, MFCCs).
Mechanism: ChatGPT 5.2 is prompted once, assuming the role of an "affective computing researcher," to generate affective descriptions for all features, stored as a fixed mapping \(\mathcal{L} = \{(f_i, \ell_i)\}_{i=1}^d\).
Design Motivation: The lexicon is constructed only once, eliminating LLM stochasticity and computational overhead at inference time while ensuring reproducibility.
Ablation Evidence: The LLM-generated lexicon improves arousal performance by approximately 2% over a lexicon using only feature names.
Per-Sample Saliency Estimation (Otsu Thresholding):
Function: For each time slice, normalized feature values are sorted and binarized into salient/non-salient groups via the Otsu method.
Mechanism: Unsupervised partitioning by maximizing inter-class variance yields a binary mask \(\mathbf{m}_t \in \{0,1\}^d\).
Design Motivation: Adaptively selects dominant behavioral cues under strong individual variability.
Semantic Encoding and Fusion:
Function: Descriptions of active features are inserted into a structured template and encoded by a frozen sentence Transformer into a semantic embedding \(\mathbf{s}_t\).
Mechanism: The semantic embedding captures contextual relationships among active behavioral cues.
Fusion: Simple concatenation \(\mathbf{z}_t = [\mathbf{x}_t \| \mathbf{s}_t]\).
Five sentence encoders are evaluated: MPNet, QAMPNet, DistilRoBERTa, MiniLM, and DistilBERT.
Preference Learner:
Function: Predicts whether affect increases or decreases between consecutive temporal windows.
Mechanism: Preference pairs \((x_t, x_{t+1})\) are constructed and retained only when the relative change exceeds threshold \(\tau\); the embedding difference \(\Delta\mathbf{z}\) is passed through a two-layer MLP with sigmoid activation to predict the direction.
Design Motivation: Relative prediction is more robust than absolute prediction, mitigating annotation noise.

Loss & Training¶

Binary cross-entropy loss
Adam optimizer, up to 25 iterations
L2 regularization with \(\alpha = 1\); early stopping after 3 non-improving iterations
Two temporal window sizes: 3s and 5s
Two relative thresholds: 10% and 20%
15-fold cross-validation (SEWA) / 15 random seeds (Aff-Wild2)
Only 129–230K trainable parameters (MLP head), making the model extremely lightweight

Key Experimental Results¶

Main Results (Aff-Wild2, Comparison with SOTA, 5s/20% Configuration)¶

Modality	Method	Arousal	Valence
Visual	VGGFace2	0.71	0.72
Visual	SwinFace	0.74	0.73
Visual	MAE-Face	0.72	0.71
Visual	LaScA	0.74	0.74
Audio	Wav2Vec2	0.71	0.60
Audio	MAE-Audio	0.69	0.60
Audio	LaScA	0.72	0.58
Multimodal	HiCMAE	0.75	0.63
Multimodal	MMA-DFER	0.75	0.63
Multimodal	LaScA	0.74	0.61

Best Visual-Modality Results on SEWA DB (5s/20%)¶

Method	Arousal	Valence
SwinFace	0.71	0.82
MAE-Face	0.70	0.81
LaScA	0.70	0.83

Ablation Study¶

Configuration	Arousal (5s/20%)	Valence (5s/20%)	Note
Features only	0.55	0.52	Weakest baseline
Sentence Transformer only	0.60	0.67	Semantics alone are informative
Fusion (F)	0.74	0.74	Best overall
Feature-based lexicon	0.74	0.61	Feature names as descriptions
LLM-based lexicon	0.74	0.63	LLM descriptions are superior

Key Findings¶

Fusion is consistently effective: Across visual, audio, and multimodal settings, the fused variant always outperforms both feature-only and text-only variants.
Larger gains on SEWA: In conversational interaction scenarios, semantic context provides greater compensatory benefit (feature-only performance is near chance at ~50%).
5s window > 3s window: Longer temporal context benefits affective modeling.
Arousal gains > Valence gains: Semantic priors are more helpful for modeling affective intensity.
Encoder choice has limited impact: After fusion, performance differences across sentence encoders are small, indicating that the fusion strategy matters more than the choice of encoder.

Highlights & Insights¶

"Augment, don't replace" paradigm: A clear departure from end-to-end black-box models, preserving interpretability.
Elegant design of the deterministic lexicon: The LLM is used only once offline; inference is fully deterministic, efficient, and reproducible.
Extremely lightweight: Only 129–230K trainable parameters; inference takes 80–140ms per sample on a laptop GPU, making it suitable for real-time deployment.
Cross-dataset consistency: Effective on both laboratory-grade (SEWA) and in-the-wild (Aff-Wild2) datasets.
Otsu thresholding for feature selection: A simple yet effective unsupervised saliency estimator that avoids the additional complexity of learned gating.

Limitations & Future Work¶

All encoders are fully frozen; selective fine-tuning may yield further performance gains.
SEWA experiments are limited to pre-extracted acoustic features (no access to raw audio), precluding evaluation of end-to-end audio models.
The lexicon is fixed; cross-cultural and multilingual scenarios would require adaptive lexicons.
Only local temporal dynamics between adjacent windows are modeled; long-range temporal modeling (e.g., sequence encoders, temporal attention) is absent.
The framework has not been extended to discrete emotion categories or higher-dimensional affective representations.

Affective dynamics prediction paradigm: Using relative direction (increase/decrease) instead of absolute value prediction mitigates annotation noise.
LLMs as semantic priors: Unlike approaches that integrate LLMs directly into end-to-end architectures, LaScA distills LLM knowledge into a fixed lexicon.
Sentence Transformers: General-purpose sentence encodings provide plug-and-play semantic representations for downstream tasks.
Insight: The "small model + large-model lexicon" hybrid paradigm is worth broader adoption—it leverages LLM knowledge without incurring its computational cost.

Rating¶

Novelty: ⭐⭐⭐⭐ — LLM semantic lexicon combined with handcrafted feature fusion for affective modeling is a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation spans multiple datasets, modalities, and encoders.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear, though the abundance of tables slightly impedes reading flow.
Value: ⭐⭐⭐⭐ — Provides an efficient and practical solution for interpretable affective computing.