EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis¶

Conference: CVPR 2026 arXiv: 2511.12554 Code: Unavailable Area: Multimodal VLM / Visual Emotion Analysis Keywords: Visual Emotion Analysis, Emotion Dataset, B-A-S Triplet, Dimensional Emotion Space, Interpretable Model

TL;DR¶

This paper proposes EmoVerse, a 219K-scale visual emotion dataset that achieves word-level and subject-level emotion attribution via knowledge graph-inspired Background-Attribute-Subject triplets. It provides dual emotion annotations in both discrete CES and continuous 1024-dimensional DES spaces, accompanied by a multi-stage annotation validation pipeline and an interpretable emotion model based on Qwen2.5-VL.

Background & Motivation¶

Background: Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Existing datasets (FI 23K, EmoSet 118K, EmoArt 130K) are coarse-grained, providing only single discrete emotion labels at the image level.

Limitations of Prior Work: Limited scale; insufficient annotation reliability; lack of interpretable emotion grounding — it remains unknown which visual elements evoke specific emotions; only discrete categorical labels, incapable of expressing mixed emotions or intensity variations.

Key Challenge: Emotions are inherently continuous, multi-dimensional, and subjective, yet existing annotation schemes rely on discrete simplifications, constraining the depth of model understanding.

Goal: To construct a dataset with fine-grained interpretable annotations, dual-space representations, and large-scale diversity, along with a companion interpretable model.

Key Insight: Drawing inspiration from knowledge graph triplets, the paper decomposes image emotions into three semantic components — Background, Attribute, and Subject — each grounded to specific visual regions.

Core Idea: By adopting B-A-S triplets and CES+DES dual-space annotations, the paper upgrades visual emotion analysis from single-label classification to multi-level interpretable attribution.

Method¶

Overall Architecture¶

Three major components: (1) the EmoVerse dataset with 219K images and multi-level annotations, (2) an annotation validation pipeline (multi-VLM cross-validation + Critic Agent), and (3) an interpretable emotion model (fine-tuned Qwen2.5-VL).

Key Designs¶

B-A-S Triplet Annotation:
- Function: Decomposes image emotion into three components — Background (scene), Attribute (atmospheric properties), and Subject (salient objects).
- Mechanism: Each B-A-S element is localized via Grounding DINO and segmented via SAM, grounding it to specific pixel regions and enabling word-level and subject-level emotion attribution.
- Design Motivation: Rather than merely knowing "this image evokes happiness," the model can further identify "which visual elements evoke happiness."
CES+DES Dual-Space Annotation:
- Function: Each image is simultaneously annotated with discrete emotion categories and continuous emotion vectors.
- Mechanism: CES adopts the Mikels 8-class model with confidence scores; DES projects emotions into a 1024-dimensional continuous emotion space via an interpretable model.
- Design Motivation: CES is intuitive and interpretable, suitable for classification tasks; DES supports emotion intensity estimation and smooth interpolation.
Multi-Stage Annotation Validation Pipeline:
- Function: Automated high-quality annotation with minimal human intervention.
- Mechanism: Three stages — independent dual-VLM annotation by Gemini 2.5 and GPT-4o; contrastive calibration of emotion labels via EmoViT; consistency verification by a Critic Agent using Chain-of-Thought reasoning.
- Design Motivation: Given the high subjectivity of emotion annotation, multi-model cross-validation combined with CoT reasoning substantially improves reliability.
Interpretable Emotion Model:
- Function: Fine-tuned on Qwen2.5-VL-3B to output DES embeddings and textual attribution explanations.
- Mechanism: Two-round fine-tuning — first enhancing attribution capability using attribute annotations, then improving classification stability using category labels.
- Design Motivation: End-to-end mapping of visual cues to a continuous emotion space, with attribution explanations making predictions traceable.

Dataset Construction¶

Three sources: integration of existing datasets (EmoSet + EmoArt + Flickr30K), B-A-S-driven web image collection, and AIGC images (~25K, 12.17%).
Total scale: 219K images, each annotated with B-A-S triplets, CES 8-class labels, confidence scores, 1024-dimensional DES vectors, and subject-level bounding boxes and masks.

Key Experimental Results¶

Dataset Scale Comparison¶

Dataset	Scale	Category Annotation	Description	Word-Level Annotation	Confidence	Subject-Level Annotation
FI	23K	2 classes	No	No	No	No
Artemis	80K	8 classes	Yes	No	No	No
EmoSet	118K	8 classes	No	No	No	No
EmoArt	130K	12 classes	Yes	No	No	No
EmoVerse	219K	8 classes + DES	Yes	Yes	Yes	Yes

Ablation Study¶

Component	Effect	Note
B-A-S Triplet	Effective	Provides word-level and subject-level attribution
DES Space	Effective	Supports continuous emotion representation
AIGC Data	Effectively fills long-tail	12% generated images cover rare emotions
Multi-Stage Validation	High consistency	Three-model cross-validation + CoT reasoning

Key Findings¶

B-A-S triplets enable the model not only to identify the emotion type but also to explain why it arises, improving interpretability.
DES space supports emotion interpolation and distance measurement, capabilities unavailable with discrete labels.
AIGC data effectively fills the long tail, covering rare emotional scenarios underrepresented in real images.

Highlights & Insights¶

B-A-S triplets draw on knowledge graph principles, combining emotion analysis with structured knowledge representation — the concept of a minimal emotion knowledge unit is generalizable to other subjective attributes such as aesthetic quality.
CES+DES dual-space design: discrete representations facilitate human understanding, while continuous representations are more amenable to machine processing; the two complement each other.
B-A-S-driven data collection creates a positive feedback loop from annotation to retrieval to annotation.
The annotation validation pipeline constitutes an effective paradigm for large-scale subjective annotation.

Limitations & Future Work¶

Emotional subjectivity remains a fundamental challenge; cross-cultural differences are difficult to fully eliminate.
The Mikels 8-class taxonomy may not cover compound emotions (e.g., nostalgic, bittersweet).
Individual dimensions of the 1024-dimensional DES space lack clear semantic interpretability.
AIGC data (12%) may introduce biases from generative models.
The interpretable model is based on a 3B-parameter backbone, limiting its reasoning capacity.

vs. EmoSet: EmoSet provides auxiliary attributes but lacks structured decomposition; EmoVerse's B-A-S triplets offer a more systematic and interpretable framework.
vs. Artemis: Artemis includes descriptions but lacks visual grounding; EmoVerse achieves subject-level grounding via Grounding DINO + SAM.
Extending grounding from object localization to emotion attribution represents a novel application scenario for grounding techniques.

Rating¶

Novelty: ⭐⭐⭐⭐ B-A-S triplets and dual-space annotation constitute novel contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons with multiple datasets; annotation quality is rigorously validated.
Writing Quality: ⭐⭐⭐⭐ Clear structure with rich illustrations.
Value: ⭐⭐⭐⭐ A 219K interpretable emotion dataset of significant value to the VEA community.