EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis¶
Conference: CVPR 2026 arXiv: 2511.12554 Code: Unavailable Area: Multimodal VLM / Visual Emotion Analysis Keywords: Visual Emotion Analysis, Emotion Dataset, B-A-S Triplet, Dimensional Emotion Space, Interpretable Model
TL;DR¶
This paper proposes EmoVerse, a 219K-scale visual emotion dataset that achieves word-level and subject-level emotion attribution via knowledge graph-inspired Background-Attribute-Subject triplets. It provides dual emotion annotations in both discrete CES and continuous 1024-dimensional DES spaces, accompanied by a multi-stage annotation validation pipeline and an interpretable emotion model based on Qwen2.5-VL.
Background & Motivation¶
Background: Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Existing datasets (FI 23K, EmoSet 118K, EmoArt 130K) are coarse-grained, providing only single discrete emotion labels at the image level.
Limitations of Prior Work: Limited scale; insufficient annotation reliability; lack of interpretable emotion grounding — it remains unknown which visual elements evoke specific emotions; only discrete categorical labels, incapable of expressing mixed emotions or intensity variations.
Key Challenge: Emotions are inherently continuous, multi-dimensional, and subjective, yet existing annotation schemes rely on discrete simplifications, constraining the depth of model understanding.
Goal: To construct a dataset with fine-grained interpretable annotations, dual-space representations, and large-scale diversity, along with a companion interpretable model.
Key Insight: Drawing inspiration from knowledge graph triplets, the paper decomposes image emotions into three semantic components — Background, Attribute, and Subject — each grounded to specific visual regions.
Core Idea: By adopting B-A-S triplets and CES+DES dual-space annotations, the paper upgrades visual emotion analysis from single-label classification to multi-level interpretable attribution.
Method¶
Overall Architecture¶
Three major components: (1) the EmoVerse dataset with 219K images and multi-level annotations, (2) an annotation validation pipeline (multi-VLM cross-validation + Critic Agent), and (3) an interpretable emotion model (fine-tuned Qwen2.5-VL).
Key Designs¶
-
B-A-S Triplet Annotation:
- Function: Decomposes image emotion into three components — Background (scene), Attribute (atmospheric properties), and Subject (salient objects).
- Mechanism: Each B-A-S element is localized via Grounding DINO and segmented via SAM, grounding it to specific pixel regions and enabling word-level and subject-level emotion attribution.
- Design Motivation: Rather than merely knowing "this image evokes happiness," the model can further identify "which visual elements evoke happiness."
-
CES+DES Dual-Space Annotation:
- Function: Each image is simultaneously annotated with discrete emotion categories and continuous emotion vectors.
- Mechanism: CES adopts the Mikels 8-class model with confidence scores; DES projects emotions into a 1024-dimensional continuous emotion space via an interpretable model.
- Design Motivation: CES is intuitive and interpretable, suitable for classification tasks; DES supports emotion intensity estimation and smooth interpolation.
-
Multi-Stage Annotation Validation Pipeline:
- Function: Automated high-quality annotation with minimal human intervention.
- Mechanism: Three stages — independent dual-VLM annotation by Gemini 2.5 and GPT-4o; contrastive calibration of emotion labels via EmoViT; consistency verification by a Critic Agent using Chain-of-Thought reasoning.
- Design Motivation: Given the high subjectivity of emotion annotation, multi-model cross-validation combined with CoT reasoning substantially improves reliability.
-
Interpretable Emotion Model:
- Function: Fine-tuned on Qwen2.5-VL-3B to output DES embeddings and textual attribution explanations.
- Mechanism: Two-round fine-tuning — first enhancing attribution capability using attribute annotations, then improving classification stability using category labels.
- Design Motivation: End-to-end mapping of visual cues to a continuous emotion space, with attribution explanations making predictions traceable.
Dataset Construction¶
- Three sources: integration of existing datasets (EmoSet + EmoArt + Flickr30K), B-A-S-driven web image collection, and AIGC images (~25K, 12.17%).
- Total scale: 219K images, each annotated with B-A-S triplets, CES 8-class labels, confidence scores, 1024-dimensional DES vectors, and subject-level bounding boxes and masks.
Key Experimental Results¶
Dataset Scale Comparison¶
| Dataset | Scale | Category Annotation | Description | Word-Level Annotation | Confidence | Subject-Level Annotation |
|---|---|---|---|---|---|---|
| FI | 23K | 2 classes | No | No | No | No |
| Artemis | 80K | 8 classes | Yes | No | No | No |
| EmoSet | 118K | 8 classes | No | No | No | No |
| EmoArt | 130K | 12 classes | Yes | No | No | No |
| EmoVerse | 219K | 8 classes + DES | Yes | Yes | Yes | Yes |
Ablation Study¶
| Component | Effect | Note |
|---|---|---|
| B-A-S Triplet | Effective | Provides word-level and subject-level attribution |
| DES Space | Effective | Supports continuous emotion representation |
| AIGC Data | Effectively fills long-tail | 12% generated images cover rare emotions |
| Multi-Stage Validation | High consistency | Three-model cross-validation + CoT reasoning |
Key Findings¶
- B-A-S triplets enable the model not only to identify the emotion type but also to explain why it arises, improving interpretability.
- DES space supports emotion interpolation and distance measurement, capabilities unavailable with discrete labels.
- AIGC data effectively fills the long tail, covering rare emotional scenarios underrepresented in real images.
Highlights & Insights¶
- B-A-S triplets draw on knowledge graph principles, combining emotion analysis with structured knowledge representation — the concept of a minimal emotion knowledge unit is generalizable to other subjective attributes such as aesthetic quality.
- CES+DES dual-space design: discrete representations facilitate human understanding, while continuous representations are more amenable to machine processing; the two complement each other.
- B-A-S-driven data collection creates a positive feedback loop from annotation to retrieval to annotation.
- The annotation validation pipeline constitutes an effective paradigm for large-scale subjective annotation.
Limitations & Future Work¶
- Emotional subjectivity remains a fundamental challenge; cross-cultural differences are difficult to fully eliminate.
- The Mikels 8-class taxonomy may not cover compound emotions (e.g., nostalgic, bittersweet).
- Individual dimensions of the 1024-dimensional DES space lack clear semantic interpretability.
- AIGC data (12%) may introduce biases from generative models.
- The interpretable model is based on a 3B-parameter backbone, limiting its reasoning capacity.
Related Work & Insights¶
- vs. EmoSet: EmoSet provides auxiliary attributes but lacks structured decomposition; EmoVerse's B-A-S triplets offer a more systematic and interpretable framework.
- vs. Artemis: Artemis includes descriptions but lacks visual grounding; EmoVerse achieves subject-level grounding via Grounding DINO + SAM.
- Extending grounding from object localization to emotion attribution represents a novel application scenario for grounding techniques.
Rating¶
- Novelty: ⭐⭐⭐⭐ B-A-S triplets and dual-space annotation constitute novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons with multiple datasets; annotation quality is rigorously validated.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with rich illustrations.
- Value: ⭐⭐⭐⭐ A 219K interpretable emotion dataset of significant value to the VEA community.