Skip to content

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Conference: CVPR 2026 arXiv: 2511.12554 Code: Unavailable Area: Multimodal VLM / Visual Emotion Analysis Keywords: Visual Emotion Analysis, Emotion Dataset, B-A-S Triplet, Dimensional Emotion Space, Interpretable Model

TL;DR

This paper proposes EmoVerse, a 219K-scale visual emotion dataset that achieves word-level and subject-level emotion attribution via knowledge graph-inspired Background-Attribute-Subject triplets. It provides dual emotion annotations in both discrete CES and continuous 1024-dimensional DES spaces, accompanied by a multi-stage annotation validation pipeline and an interpretable emotion model based on Qwen2.5-VL.

Background & Motivation

Background: Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Existing datasets (FI 23K, EmoSet 118K, EmoArt 130K) are coarse-grained, providing only single discrete emotion labels at the image level.

Limitations of Prior Work: Limited scale; insufficient annotation reliability; lack of interpretable emotion grounding — it remains unknown which visual elements evoke specific emotions; only discrete categorical labels, incapable of expressing mixed emotions or intensity variations.

Key Challenge: Emotions are inherently continuous, multi-dimensional, and subjective, yet existing annotation schemes rely on discrete simplifications, constraining the depth of model understanding.

Goal: To construct a dataset with fine-grained interpretable annotations, dual-space representations, and large-scale diversity, along with a companion interpretable model.

Key Insight: Drawing inspiration from knowledge graph triplets, the paper decomposes image emotions into three semantic components — Background, Attribute, and Subject — each grounded to specific visual regions.

Core Idea: By adopting B-A-S triplets and CES+DES dual-space annotations, the paper upgrades visual emotion analysis from single-label classification to multi-level interpretable attribution.

Method

Overall Architecture

Three major components: (1) the EmoVerse dataset with 219K images and multi-level annotations, (2) an annotation validation pipeline (multi-VLM cross-validation + Critic Agent), and (3) an interpretable emotion model (fine-tuned Qwen2.5-VL).

Key Designs

  1. B-A-S Triplet Annotation:

    • Function: Decomposes image emotion into three components — Background (scene), Attribute (atmospheric properties), and Subject (salient objects).
    • Mechanism: Each B-A-S element is localized via Grounding DINO and segmented via SAM, grounding it to specific pixel regions and enabling word-level and subject-level emotion attribution.
    • Design Motivation: Rather than merely knowing "this image evokes happiness," the model can further identify "which visual elements evoke happiness."
  2. CES+DES Dual-Space Annotation:

    • Function: Each image is simultaneously annotated with discrete emotion categories and continuous emotion vectors.
    • Mechanism: CES adopts the Mikels 8-class model with confidence scores; DES projects emotions into a 1024-dimensional continuous emotion space via an interpretable model.
    • Design Motivation: CES is intuitive and interpretable, suitable for classification tasks; DES supports emotion intensity estimation and smooth interpolation.
  3. Multi-Stage Annotation Validation Pipeline:

    • Function: Automated high-quality annotation with minimal human intervention.
    • Mechanism: Three stages — independent dual-VLM annotation by Gemini 2.5 and GPT-4o; contrastive calibration of emotion labels via EmoViT; consistency verification by a Critic Agent using Chain-of-Thought reasoning.
    • Design Motivation: Given the high subjectivity of emotion annotation, multi-model cross-validation combined with CoT reasoning substantially improves reliability.
  4. Interpretable Emotion Model:

    • Function: Fine-tuned on Qwen2.5-VL-3B to output DES embeddings and textual attribution explanations.
    • Mechanism: Two-round fine-tuning — first enhancing attribution capability using attribute annotations, then improving classification stability using category labels.
    • Design Motivation: End-to-end mapping of visual cues to a continuous emotion space, with attribution explanations making predictions traceable.

Dataset Construction

  • Three sources: integration of existing datasets (EmoSet + EmoArt + Flickr30K), B-A-S-driven web image collection, and AIGC images (~25K, 12.17%).
  • Total scale: 219K images, each annotated with B-A-S triplets, CES 8-class labels, confidence scores, 1024-dimensional DES vectors, and subject-level bounding boxes and masks.

Key Experimental Results

Dataset Scale Comparison

Dataset Scale Category Annotation Description Word-Level Annotation Confidence Subject-Level Annotation
FI 23K 2 classes No No No No
Artemis 80K 8 classes Yes No No No
EmoSet 118K 8 classes No No No No
EmoArt 130K 12 classes Yes No No No
EmoVerse 219K 8 classes + DES Yes Yes Yes Yes

Ablation Study

Component Effect Note
B-A-S Triplet Effective Provides word-level and subject-level attribution
DES Space Effective Supports continuous emotion representation
AIGC Data Effectively fills long-tail 12% generated images cover rare emotions
Multi-Stage Validation High consistency Three-model cross-validation + CoT reasoning

Key Findings

  • B-A-S triplets enable the model not only to identify the emotion type but also to explain why it arises, improving interpretability.
  • DES space supports emotion interpolation and distance measurement, capabilities unavailable with discrete labels.
  • AIGC data effectively fills the long tail, covering rare emotional scenarios underrepresented in real images.

Highlights & Insights

  • B-A-S triplets draw on knowledge graph principles, combining emotion analysis with structured knowledge representation — the concept of a minimal emotion knowledge unit is generalizable to other subjective attributes such as aesthetic quality.
  • CES+DES dual-space design: discrete representations facilitate human understanding, while continuous representations are more amenable to machine processing; the two complement each other.
  • B-A-S-driven data collection creates a positive feedback loop from annotation to retrieval to annotation.
  • The annotation validation pipeline constitutes an effective paradigm for large-scale subjective annotation.

Limitations & Future Work

  • Emotional subjectivity remains a fundamental challenge; cross-cultural differences are difficult to fully eliminate.
  • The Mikels 8-class taxonomy may not cover compound emotions (e.g., nostalgic, bittersweet).
  • Individual dimensions of the 1024-dimensional DES space lack clear semantic interpretability.
  • AIGC data (12%) may introduce biases from generative models.
  • The interpretable model is based on a 3B-parameter backbone, limiting its reasoning capacity.
  • vs. EmoSet: EmoSet provides auxiliary attributes but lacks structured decomposition; EmoVerse's B-A-S triplets offer a more systematic and interpretable framework.
  • vs. Artemis: Artemis includes descriptions but lacks visual grounding; EmoVerse achieves subject-level grounding via Grounding DINO + SAM.
  • Extending grounding from object localization to emotion attribution represents a novel application scenario for grounding techniques.

Rating

  • Novelty: ⭐⭐⭐⭐ B-A-S triplets and dual-space annotation constitute novel contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons with multiple datasets; annotation quality is rigorously validated.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with rich illustrations.
  • Value: ⭐⭐⭐⭐ A 219K interpretable emotion dataset of significant value to the VEA community.