EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis¶

Conference: CVPR2026 arXiv: 2511.12554 Code: To be confirmed Area: Multimodal VLM Keywords: Visual Emotion Analysis, Emotion Representation Dataset, Knowledge Graph, Interpretability, Multimodal Large Language Models

TL;DR¶

This paper introduces EmoVerse — the first large-scale interpretable visual emotion dataset (219K+ images) covering both CES (Mikels 8-class discrete emotions) and DES (1024-dimensional continuous emotion space). It proposes a B-A-S (Background-Attribute-Subject) triplet knowledge graph annotation scheme and an Annotation & Verification Pipeline (Gemini/GPT-4o + EmoViT + CoT Critic Agent), and fine-tunes Qwen2.5-VL-3B to perform 1024-dimensional DES projection and emotion attribution explanation.

Background & Motivation¶

Background: Visual Emotion Analysis (VEA) aims to predict viewers' emotional responses from images. Existing datasets (FI, EmoSet, Instagram, etc.) predominantly adopt discrete emotion classification (Mikels 8-class or VAD three-dimensional), resulting in limited annotation dimensions.
Limitations of Prior Work: (1) Large-scale open-source interpretable emotion datasets are lacking — existing datasets only provide emotion category labels without explaining "why a particular emotion is evoked"; (2) discrete emotion labels (CES) fail to capture fine-grained emotional variation, and datasets with continuous representations (DES) are almost nonexistent; (3) subject-level instance localization is absent — it remains unclear which subject in an image triggers which emotion.
Key Challenge: The VEA field urgently needs interpretability and fine-grained annotations, yet manual annotation is prohibitively expensive (1024-dimensional continuous space cannot be annotated by hand), and traditional crowdsourcing cannot cover all four dimensions: word-level, subject-level, CES, and DES.
Goal: To construct a visual emotion dataset that simultaneously covers CES and DES, provides interpretable annotations, and is sufficiently large in scale.
Key Insight: Leveraging MLLMs (Gemini 2.5, GPT-4o) for automatic annotation, coupled with a multi-round verification pipeline to ensure quality, and introducing a knowledge graph to structure emotion attribution.
Core Idea: The B-A-S triplet decomposes emotional attribution into Background (scene context), Attribute (visual properties such as color and lighting), and Subject (key objects), combined with Grounding DINO + SAM for subject localization, and an MLLM-based annotation–verification–correction loop for complete pipeline closure.

Method¶

Overall Architecture¶

EmoVerse is constructed in four stages: (1) data collection and cleaning; (2) B-A-S triplet annotation with CES/DES generation; (3) multi-round verification and correction (Annotation & Verification Pipeline); and (4) subject instance localization (Grounding DINO + SAM). Qwen2.5-VL-3B is subsequently fine-tuned on this dataset as an interpretable emotion analysis model.

Key Designs¶

B-A-S (Background-Attribute-Subject) Triplet Annotation:
- Function: Decomposes image emotion attribution into a three-dimensional knowledge graph structure.
- Mechanism: Inspired by knowledge graphs, each image is annotated with a \((B, A, S)\) triplet — \(B\) describes the scene background (e.g., "a coastline during a storm"), \(A\) describes visual attributes (e.g., "dim lighting, cool color palette"), and \(S\) describes the key subject (e.g., "a person standing alone"). The three components jointly explain the triggers of emotional response.
- Design Motivation: Conventional annotation assigns only a single emotion label (e.g., "sadness") without rationale. The B-A-S triplet makes emotion attribution traceable and supports downstream interpretable analysis.
Hybrid Data Sources:
- Function: Collects 219K+ images from multiple sources to ensure emotional diversity.
- Mechanism: (a) Existing emotion datasets: EmoSet, EmoArt; (b) general-purpose datasets: Flickr30k (natural scenes); (c) web crawling: images retrieved via emotion-specific keywords; (d) AIGC generation: approximately 25K images generated by the Seedream model using emotion-conditioned prompts to supplement underrepresented emotion categories.
- Design Motivation: Single-source datasets exhibit distributional bias (e.g., EmoSet is dominated by natural images), while AIGC generation can precisely augment the long-tail emotion distribution.
Annotation & Verification Pipeline:
- Function: Leverages MLLMs for automatic annotation and ensures annotation quality through multi-round verification.
- Mechanism: (a) Initial Annotation: Gemini 2.5 and GPT-4o independently generate emotion annotations (CES category + DES vector + B-A-S triplet) for each image; (b) Emotion Verification: A pre-trained EmoViT (emotion classification expert model) cross-validates CES label consistency; (c) CoT Critic Agent: Applies Chain-of-Thought critical review to initial annotations, classifying each annotation as valid (retained), revisable (returned for re-annotation), or discarded; (d) Human Spot-Check: Sampled manual verification is applied to the Critic Agent outputs.
- Design Motivation: Noise in pure MLLM annotation is non-negligible (especially in the 1024-dimensional DES space); multi-round verification combining expert model cross-checking and CoT critical review effectively reduces annotation noise.
Subject-Level Instance Localization:
- Function: Provides bounding boxes and segmentation masks for the Subject component in B-A-S.
- Mechanism: (a) The Subject text description from the B-A-S triplet is fed into Grounding DINO to obtain bounding boxes; (b) SAM (Segment Anything Model) generates pixel-level segmentation masks conditioned on the bounding box prompt.
- Design Motivation: Subject-level instance localization enables the model to learn "which region or object in an image evokes which emotion," supporting local emotion attribution.
Interpretable Emotion Model: Fine-tuning Qwen2.5-VL-3B:
- Function: Fine-tunes a multimodal model on EmoVerse to achieve CES classification, DES projection, and emotion attribution text generation.
- Mechanism: Two-stage fine-tuning — the first stage trains CES/DES prediction; the second stage trains B-A-S-based emotion attribution explanation generation. DES is realized through a 1024-dimensional linear projection head. Training employs cross-entropy loss (CE Loss).
- Design Motivation: End-to-end multi-task training equips the model with both emotion prediction capability and interpretability.

Dataset Statistics¶

Total size: 219K+ images
CES coverage: Mikels 8-class (amusement, awe, contentment, excitement, anger, disgust, fear, sadness)
DES dimensionality: 1024-dimensional continuous emotion space
Annotation dimensions: Full coverage of word-level (emotion words), subject-level (subject localization), CES, and DES
AIGC-generated: ~25K images (Seedream)

Key Experimental Results¶

Dataset Comparison¶

Dataset	# Images	CES	DES	Interpretable Annotation	Subject Localization
FI	23K	✓	✗	✗	✗
Instagram	42K	✓	✗	✗	✗
EmoSet	118K	✓	✗	Partial	✗
EmoArt	80K	✓	✗	✗	✗
EmoVerse	219K+	✓	✓	✓ (B-A-S)	✓ (bbox+mask)

Key Findings¶

EmoVerse is the first dataset covering both CES and DES: All existing datasets lack DES annotations.
B-A-S triplets enhance interpretability: Ablation studies show that incorporating B-A-S improves both emotion classification accuracy and the quality of attribution text.
AIGC data effectively supplements the long tail: Removing Seedream-generated data leads to a notable performance drop on underrepresented emotion categories (disgust, fear).
Annotation pipeline effectively reduces noise: The CoT Critic Agent filters approximately 15–20% of low-quality annotations; human spot-checks confirm pipeline output accuracy exceeds 90%.
Fine-tuned Qwen2.5-VL-3B performance: On the EmoVerse test set, both CES classification accuracy and DES projection correlation outperform baseline methods.

Highlights & Insights¶

The B-A-S triplet is the core contribution: Structuring emotion attribution as knowledge-graph-style triplets is both amenable to automatic annotation and supports downstream reasoning — more principled than free-form textual descriptions.
Multi-round verification via MLLM + expert model + CoT Critic: This pipeline offers a reusable paradigm for constructing large-scale annotated datasets with MLLMs — not merely "annotating with GPT," but a closed-loop quality assurance process.
AIGC for long-tail augmentation: Generating emotion-specific images on demand using generative models is an elegant data augmentation strategy, more effective than naive oversampling.
Subject-level localization: The Grounding DINO + SAM combination advances emotion analysis from image-level to region-level, opening a new direction for local emotion attribution.

Limitations & Future Work¶

The annotation quality of the 1024-dimensional DES relies entirely on MLLMs, lacking a human-verified gold standard — MLLMs may exhibit systematic biases in understanding the continuous emotion space.
The Mikels 8-class CES taxonomy is relatively coarse-grained, omitting common emotion categories such as surprise and neutral.
AIGC-generated images may carry style biases inherent to the generative model, creating a domain gap relative to real-world images in emotional expression.
Qwen2.5-VL-3B is a relatively small model; the performance of larger models (7B/72B) remains unexplored.
Evaluation is conducted solely on EmoVerse's own test set, without cross-dataset generalization experiments on external benchmarks such as FI and EmoSet.
The threshold selection for the CoT Critic Agent's judgment criteria is insufficiently discussed.

vs. EmoSet: EmoSet is currently the largest visual emotion dataset (118K), but contains only CES without DES or subject localization. EmoVerse comprehensively surpasses it in scale and annotation dimensions.
vs. EmotionCLIP: EmotionCLIP performs zero-shot emotion classification via contrastive learning but provides no interpretable attribution. The B-A-S triplets in EmoVerse directly support explanation generation.
vs. SentiCap / ArtEmis: These works provide emotion-descriptive text for images but lack structured annotations (B-A-S) and subject localization.
vs. Grounding DINO + SAM combination: EmoVerse demonstrates the effectiveness of the "text description → visual grounding" pipeline in the context of emotion analysis, with potential generalizability to other subjective perception tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The B-A-S triplet and dual CES+DES representation system are pioneering; the multi-round MLLM verification pipeline has methodological value.
Experimental Thoroughness: ⭐⭐⭐ Cross-dataset generalization experiments and validation with larger models are lacking.
Writing Quality: ⭐⭐⭐⭐ The dataset construction process is described in detail, with clear motivation for each pipeline stage.
Value: ⭐⭐⭐⭐ Fills the gap in interpretable visual emotion analysis datasets; the B-A-S annotation scheme and verification pipeline offer strong reusability.