Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction¶
Conference: ACL 2025 (SemEval Workshop)
arXiv: 2506.04409
Code: None
Area: Information Retrieval
Keywords: Emotion Detection, RAG, Multilingual, LLM Ensembling, Multi-label Classification
TL;DR¶
This paper proposes the EmoRAG system, which combines a retrieval-augmented generation (RAG) pipeline with multi-LLM ensemble aggregation. Without any additional training, it achieves competitive results across 28 languages on the SemEval-2025 Task 11 multi-label emotion detection task, with an average F1-micro score of 0.638.
Background & Motivation¶
SemEval-2025 Task 11 focuses on perceived emotion detection: determining which emotions the majority of readers would infer the speaker is experiencing from a given text (joy, sadness, fear, anger, surprise, disgust, neutral). This does not analyze the reader's evoked emotions, nor does it infer the speaker's true emotions; instead, it focuses on the socially consensual understanding of emotions.
Key Challenge:
Multilingual Coverage: Covers 28 languages, including many low-resource languages (Hausa, Kinyarwanda, Emakhuwa, isiZulu, etc.).
Multi-label Task: Each text segment may contain multiple emotions simultaneously.
Cultural Differences: Significant differences exist in emotion expression and interpretation across different cultural backgrounds.
Traditional methods (finetuning pre-trained Transformers + linear classification heads) perform well in monolingual scenarios, but cross-lingual generalization faces additional challenges from linguistic variability and cultural differences.
Mechanism of EmoRAG: Utilizing the training-free nature of RAG, the annotated training data is used as a retrieval corpus. This allows the model to refer to relevant emotion instances during inference, thereby improving cross-lingual and cross-cultural robustness.
Method¶
Overall Architecture¶
EmoRAG consists of four components connected in series: Database → Retriever → Generators (LLM Ensemble) → Aggregation Model.
Key Designs¶
-
Database:
- Built directly using annotated training data.
- Based on the BRIGHTER dataset (multi-label emotion annotations in 28 languages) and the EthioEmo dataset (4 Ethiopian languages).
-
Retriever:
- Comparison of two retrievers:
- n-gram Retriever (from the LangChain module): Hypothesized to be better for low-resource languages because it relies on surface text features.
- BGE-M3 Sentence Embedding Retriever: A multilingual embedding model.
- K-value setting: K=30 for low-resource languages (due to high token consumption), K=100 for high-resource languages.
- Retrieved samples are used as few-shot prompts for the LLMs.
- Comparison of two retrievers:
-
Generators (Decoder Models):
- Uses an ensemble of four LLMs:
- Llama-3.1-70B
- Qwen2.5-72B-Instruct
- gpt-4o-mini
- gemma-2-27b-it
- System prompts are all in English (experiments showed that English prompts outperform target language prompts).
- Each model independently outputs emotion predictions.
- Uses an ensemble of four LLMs:
-
Aggregation Strategies:
- Single Model: Direct use of a single model's output.
- Majority Vote: The prediction for each label is determined by a majority vote of all models.
- Macro/Micro Weighted Vote: Weighted by the model's F1 score on the development set.
- Label-F1 Weighted Vote: Weighted independently for each label, based on the F1 score of that label across different models.
- GPT-4o Aggregation: Feeds all model results and few-shot examples into gpt-4o-mini for aggregation.
Loss & Training¶
- Training-Free: No model finetuning is performed; it relies entirely on retrieval and generation during inference.
- The optimal aggregation strategy and retriever configuration are determined solely on the dev set.
Key Experimental Results¶
Main Results — Performance on Representative Languages in Test Set (Table)¶
| Language | Best Model | Dev F1-micro | Dev F1-macro | Test F1-micro | Test F1-macro |
|---|---|---|---|---|---|
| English | L-F1 Vote | 0.821 | 0.818 | 0.807 | 0.789 |
| Spanish | L-F1 Vote | 0.813 | 0.809 | 0.820 | 0.817 |
| Russian | L-F1 Vote | 0.880 | 0.880 | 0.883 | 0.879 |
| Hindi | L-F1 Vote | 0.842 | 0.849 | 0.866 | 0.866 |
| Marathi | L-F1 Vote | 0.943 | 0.947 | 0.856 | 0.864 |
| Hausa | L-F1 Vote | 0.735 | 0.731 | 0.704 | 0.695 |
| Swahili | L-F1 Vote | 0.440 | 0.409 | 0.430 | 0.386 |
| Emakhuwa | gpt-4o-mini | 0.300 | 0.211 | 0.256 | 0.216 |
Average Performance of Each Model on Dev Set (Table)¶
| Model/Strategy | F1-micro | F1-macro |
|---|---|---|
| llama-3.1-70b | 0.563 | 0.515 |
| qwen2.5-70b | 0.590 | 0.556 |
| gpt-4o-mini | 0.631 | 0.590 |
| gpt-4o-mini + n-gram | 0.641 | 0.601 |
| gemma-2-27b | 0.617 | 0.576 |
| majority_vote | 0.661 | 0.617 |
| majority_vote_by_label_f1 | 0.678 | 0.634 |
Key Findings¶
- Label-F1 Weighted Vote is the optimal aggregation strategy for the vast majority of languages (the best choice for 21 out of 28 languages).
- High-resource languages perform well (Russian 0.883, Hindi 0.866), while performance on low-resource languages varies significantly (Emakhuwa is only 0.256, Tigrinya is 0.260).
- The n-gram retriever is more effective for some low-resource languages (Oromo, Sundanese, Mandarin, and Kinyarwanda selected the n-gram configuration).
- gpt-4o-mini performs best among single models (F1-micro 0.631), but the ensemble aggregation (0.678) significantly outperforms any single model.
- English system prompts perform better than target-language prompts, possibly because LLMs have stronger English instruction-following capabilities.
- Some languages show a large disparity between dev and test performance (German: dev 0.745 \(\rightarrow\) test 0.269; Brazilian Portuguese: dev 0.766 \(\rightarrow\) test 0.481), suggesting distribution shift issues.
Highlights & Insights¶
- Zero-Training Paradigm: The RAG + LLM ensemble completely bypasses finetuning, making it attractive for resource-constrained scenarios.
- Label-F1 Weighting Strategy is more fine-grained than simple majority voting or global weighting: different models perform optimally for different emotion labels.
- The choice of n-gram vs. embedding retriever reveals distinct characteristics of low-resource vs. high-resource languages: the embedding quality of low-resource languages is poor, making surface-level feature matching potentially more reliable.
- Strong complementarity among multiple LLMs: individual models excel on different languages.
Limitations & Future Work¶
- Only covers 6 basic emotions plus neutral; finer-grained emotion classification was not tested.
- Insufficient handling of highly imbalanced class distributions and significant distribution shifts (e.g., German, Portuguese).
- The configuration of K values (30 vs. 100) is somewhat coarse, with no systematic hyperparameter search conducted.
- High inference cost (4 large models + retrieval), making it unsuitable for low-latency scenarios.
- No direct comparison with finetuning baselines, making it difficult to judge whether the RAG approach is truly superior to finetuning.
Related Work & Insights¶
- The limitations of traditional methods (finetuning Transformers + classification heads) on multilingual emotion tasks serve as the starting point of Ours.
- The success of RAG in knowledge-intensive NLP tasks is extended to the field of emotion classification.
- The BRIGHTER dataset covers emotion annotations for 28 languages for the first time, providing a foundation for multilingual emotion research.
- This echoes LLM-based approaches to cross-lingual emotion detection, but Ours emphasizes practicality and scalability.
Rating¶
- Novelty: 5/10 — The combination of RAG and LLM ensembles is an application of mature paradigms and lacks technological innovation.
- Experimental Thoroughness: 7/10 — Comprehensive coverage of 28 languages, but lacks a comparison with finetuning baselines.
- Writing Quality: 6/10 — The content is clearly organized but short, lacking in-depth analysis.
- Value: 6/10 — As a SemEval system description paper, it demonstrates the feasibility of RAG in multilingual emotion analysis.