Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition¶

Conference: AAAI 2026 arXiv: 2512.17946 Code: https://github.com/ZoeyHuang-paper/MoFi Area: Music Emotion Recognition / Symbolic Music Understanding Keywords: symbolic music emotion recognition, MIDIBERT, mode injection, FiLM, music psychology

TL;DR¶

The MoGE diagnostic strategy systematically identifies that MIDIBERT fails to encode mode–emotion associations. The proposed MoFi injection framework leverages FiLM conditioning to inject major/minor priors into Layer 1 of MIDIBERT (identified as the weakest layer in terms of emotional information). This achieves 75.2% accuracy (+11.8%) on EMOPIA and 59.1% (+11.8%) on VGMIDI, with F1 improvements of 12.3% and 15.5%, respectively.

Background & Motivation¶

Symbolic Music Emotion Recognition (SMER) is a core task in symbolic music understanding. Transformer-based pretrained models such as MIDIBERT excel at semantic comprehension but exhibit fundamental deficiencies in emotion recognition:

Problem	Cause	Impact
Missing mode–emotion association	MLM pretraining focuses on token reconstruction without explicit incentive to learn mode	The core music-psychological principle of major (happy) / minor (sad) is not encoded
Extremely small SMER datasets	EMOPIA has only 1,087 clips; VGMIDI has only 200 tracks	Fine-tuning cannot automatically learn emotional features from limited data
Black-box fine-tuning	Fine-tuning proceeds without diagnosing which layer lacks which knowledge	Injection strategies lack specificity and may be applied at the wrong layer

Music psychology basis: Empirical studies consistently demonstrate that major mode → high-valence positive emotions (happiness, brightness) and minor mode → low-valence negative emotions (sadness, melancholy). This association is relatively invariant to key: C major and G major convey similar emotions, as the determining factor is interval structure rather than absolute pitch. This regularity has been repeatedly validated by Kastner & Crowder, Gerardi & Gerken, Dalla Bella, and others.

Method¶

Overall Architecture¶

A two-stage approach: (1) MoGE Diagnosis—systematically identifies MIDIBERT's knowledge gaps and optimal injection location via data augmentation experiments and layer-wise probing; (2) MoFi Injection—injects mode priors into the identified target layer via a FiLM conditioning module.

Key Designs¶

MoGE Diagnosis — Data Augmentation Experiment
- Mode-preserving pitch transposition is applied to EMOPIA (all notes shifted uniformly within a single octave, preserving interval structure → mode unchanged).
- Result: accuracy improves from 67.5% (original) to 72.3% (+4.8%) with augmentation.
- Conclusion: MIDIBERT does not encode mode–emotion associations; otherwise, augmentation should yield no significant difference.
MoGE Diagnosis — Layer-wise Probing
- All 12 layers of MIDIBERT are frozen; a trainable self-attention head and classifier are inserted before each layer for fine-tuning.
- Finding: middle layers yield the best performance; the bottom layer (Layer 1) contains the weakest emotional information; upper layers are specialized for MLM.
- Conclusion: Layer 1 is the optimal target for knowledge injection.
Mode Extraction
- The Krumhansl-Kessler (K-K) algorithm is used to automatically extract mode from MIDI (cognitively motivated, fitting human tonal perception).
- Only binary major/minor classification is adopted (simplified to reduce noise; rare modes such as Dorian/Lydian have insufficient samples and inconsistent emotional characteristics).
- Output is a one-hot vector.
MoFi — FiLM Injection
- A FiLM conditioning module is inserted between the Compound Word embedding layer and the first Transformer layer.
- Mode one-hot vector \(c\) → parameter generation network \(f_\text{cond}\) → scale factor \(\gamma\) and shift factor \(\beta\).
- Affine transformation: \(\text{FiLM}(x, c) = \gamma \odot x + \beta\)
- Initialization with \(\gamma=1, \beta=0\) ensures training stability, allowing the model to gradually incorporate mode information starting from pretrained representations.

Loss & Training¶

Cross-entropy loss (Russell 4Q four-class classification: HVHA/LVHA/LVLA/HVLA). MIDIBERT: 12 layers / 12 heads / 768 dimensions / 111M parameters. Batch size 16 (EMOPIA) / 8 (VGMIDI), fine-tuned on a single RTX 3090 GPU for ≤20 epochs with early stopping (patience=3); total training time <30 minutes.

Key Experimental Results¶

Main Results: Comparison with Existing Symbolic Music Models¶

Method	Type	EMOPIA Acc↑	EMOPIA F1↑	VGMIDI Acc↑	VGMIDI F1↑
SVM	Traditional ML	0.477	0.476	0.451	0.377
LSTM-Attn	RNN	0.647	0.563	0.417	0.260
MIDIGPT	GPT	0.587	0.572	0.538	0.505
MT-MIDIBERT	Multi-task	0.676	0.664	0.498	0.453
BiLMA	Transformer	0.708	0.631	0.572	0.478
MIDIBERT (baseline)	Pretrained	0.634	0.628	0.473	0.432
MoFi (Ours)	Prior Injection	0.752	0.751	0.591	0.587

Ablation Study¶

Configuration	EMOPIA Acc↑	VGMIDI Acc↑	VGMIDI F1↑	Notes
Full MoFi	0.752	0.591	0.587	Mode injection at Layer 1
w/o mode injection	0.716	0.500	0.365	No FiLM module
Inject at Layer 6 (middle)	0.734	0.552	0.513	Suboptimal position
Inject at last layer	0.721	0.528	0.489	Further degraded
Data augmentation only, no injection	0.723	—	—	Augmentation helps but is insufficient

Key Findings¶

Mode injection yields a more pronounced effect on VGMIDI (F1: 0.365→0.587, +60.8%), as smaller datasets rely more heavily on prior knowledge to compensate for data scarcity.
Layer 1 injection significantly outperforms middle and final layer injection, validating the effectiveness of layer-wise probing diagnosis.
F1 and accuracy are closely aligned (0.752 vs. 0.751), indicating balanced four-quadrant classification without severe class bias.
Data augmentation alone improves accuracy to 72.3%; MoFi further raises it to 75.2%, demonstrating complementarity.

Highlights & Insights¶

Diagnose-then-inject as a general paradigm: systematically identifying knowledge gaps in pretrained models before targeted domain prior injection, transferable to other domains.
FiLM injection introduces minimal parameters (only two additional linear layers), making it parameter-efficient.
Using only binary mode (major/minor) effectively reduces noise and aligns with the requirements of four-class classification.
An innovative integration of music psychology theory with deep learning, yielding strong interpretability.
The \(\gamma=1, \beta=0\) initialization ensures a smooth transition from pretrained representations.

Limitations & Future Work¶

Only binary major/minor mode is considered, ignoring emotional distinctions of intermediate modes such as Dorian, Lydian, and Mixolydian.
Validation is limited to MIDIBERT; other symbolic music pretrained models such as MusicBERT and PopMAG are not evaluated.
VGMIDI contains only 200 tracks, potentially leading to high variance in results.
The Russell 4Q emotion taxonomy has coarse granularity; continuous valence-arousal regression is not explored.
Only single-instrument (piano) settings are addressed; mode extraction and injection for multi-instrument arrangements remain unverified.

Direction	Representative Methods	Difference from This Work
Traditional SMER	SVM + handcrafted features	Cannot capture long-range temporal dependencies; heavy feature engineering
Pretrained models	MIDIBERT, MusicBERT	MLM objective limited to token-level reconstruction; lacks emotional priors
Multi-task learning	MT-MIDIBERT	Implicitly enhances via auxiliary tasks; does not introduce music-theoretic priors
Conditioning techniques	FiLM (computer vision)	This work is the first to apply FiLM to symbolic music emotion recognition

Rating¶

Novelty: ⭐⭐⭐⭐ Innovative integration of music psychology and deep learning; the diagnose-then-inject paradigm is original.
Experimental Thoroughness: ⭐⭐⭐ Datasets are small (EMOPIA 1,087; VGMIDI 200), but ablation and diagnostic experiments are systematic.
Writing Quality: ⭐⭐⭐⭐ Logical progression from diagnosis to solution; reasoning is clear and well-structured.
Value: ⭐⭐⭐⭐ Provides an interpretable, theory-driven approach for symbolic music understanding with generalizable paradigm.