MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation¶

Conference: CVPR 2025
arXiv: 2411.17945
Code: https://sankalpsinha-cmos.github.io/MARVEL/
Area: 3D Vision
Keywords: Text-to-3D, 3D Annotation, Dataset, Multi-Level Description, Stable Diffusion

TL;DR¶

This work constructs MARVEL-40M+, a large-scale 3D description dataset featuring 8.9 million 3D assets and over 40 million multi-level text annotations. Through a multi-stage automated annotation pipeline (InternVL2 + Qwen2.5), it generates five levels of annotation ranging from detailed narratives to concise tags. Leveraging this dataset, Stable Diffusion 3.5 is fine-tuned to achieve high-fidelity text-to-3D generation within 15 seconds.

Background & Motivation¶

While text-to-3D (TT3D) content generation is in high demand across gaming, AR/VR, and film production, the development of this field is severely constrained by the lack of high-quality, aligned 3D-text asset pairs:

Insufficient Scale of Existing Datasets: Cap3D contains only around 1 million assets, and 3DTopia contains about 360,000, failing to cover the immense diversity of 3D models.
Low Annotation Quality: Descriptions generated by single-view VLMs (e.g., BLIP, LLaVA) often suffer from contradictions or inconsistency, and lack the detailed information required for fine-grained 3D reconstruction.
Poor Scalability: Methods like Cap3D and CLAY rely on proprietary models such as GPT-4, which are highly expensive and difficult to deploy at scale.
Lack of Domain-Specific Descriptions: Datasets like Objaverse contain diverse models ranging from characters and creatures to historical artifacts, which require domain expertise for accurate annotation.
Single-Granularity Annotations: Existing approaches only offer single-level labels, failing to flexibly adapt to varying needs of detailed reconstruction versus rapid modeling.

Key Motivation: Leverage open-source multi-view VLMs and LLMs to construct an automated and scalable 3D annotation pipeline, generating multi-level annotations to simultaneously serve both detailed reconstruction and rapid prototyping.

Method¶

Overall Architecture¶

MARVEL is a multi-stage 3D asset annotation pipeline that takes four-view renders of a 3D model and optional human metadata as input, and outputs five levels of text descriptions. The downstream application, MARVEL-FX3D, is a two-stage TT3D pipeline: it first uses a fine-tuned SD3.5 to generate an image, and then utilizes a pre-trained SF3D to convert it into a textured mesh, completing the entire process in under 15 seconds.

Key Designs¶

Multi-View VLM-Based Dense Description Generation:
- Function: Generates unified, dense text descriptions from four standard viewpoints (front/back/left/right) of a 3D model.
- Mechanism: Uses InternVL2-40B as the multi-view VLM, taking four \(512 \times 512\) renders along with metadata-enhanced prompts as input to directly output dense descriptions covering five key aspects: (1) structural decomposition and object recognition, (2) geometric properties and symmetry, (3) surface texture and material, (4) color mapping and transition, and (5) environmental context and spatial relationships.
- Design Motivation: This avoids the "describe-then-aggregate" workflow of methods like Cap3D (reducing information loss and conflict). InternVL2-40B delivers performance close to GPT-4o at a significantly lower cost.
Human Metadata Fusion and Filtering:
- Function: Injects human-generated metadata from original source datasets into the annotation pipeline to reduce VLM hallucinations and introduce domain-specific information.
- Mechanism: Extracts user-generated metadata (such as names, tags, and descriptions) from datasets like Objaverse, and uses Mistral-Nemo-Instruct-2407 to filter out random, redundant, and sensitive content. The clean information relevant to 3D attributes is then injected into the VLM prompt. For example, metadata helps identify domain-specific entities like the "Monument to Dante" that a VLM cannot infer solely through visual reasoning.
- Design Motivation: Pure visual VLMs are prone to hallucinating when facing complex 3D scenes (due to 2D-3D domain gaps). Human metadata provides strong domain priors, though the raw metadata itself contains noise and must be filtered.
Multi-Level Visual Elaboration:
- Function: Progressively compresses the dense descriptions into five distinct levels to adapt to different downstream task requirements.
- Mechanism: Uses Qwen2.5-72B to execute a hierarchical prompting strategy, sequentially specifying which aspects to retain or compress: Level 1 (150-200 words, comprehensive description) → Level 2 (100-150 words, key structure and geometry) → Level 3 (50-100 words, functional semantics) → Level 4 (~30 words, brief summary) → Level 5 (10-20 words, list of semantic tags). Each level gradually compresses aspects like texture, color, and geometry.
- Design Motivation: Directly instructing compression can constrain the model's creative capacity (consistent with findings in recent studies). Hierarchical prompting strikes a balance between detail and brevity. Additionally, different use cases demand different granularities (e.g., fine-grained reconstruction vs. rapid modeling).

Loss & Training¶

MARVEL-FX3D Stage 1: Fine-tunes Stable Diffusion 3.5 using LoRA (rank=4, alpha=4) on a training set of 798K assets from Objaverse for 5 epochs with half-precision, a batch size of 8, on a single H100 GPU.
Stage 2: Uses DIS to remove the background, then feeds the image into pre-trained SF3D to generate the textured mesh (5 seconds).
Annotation Pipeline Throughput: ~24,000 samples per day.
Ethical Filtering Stage: Employs Qwen2.5-14B to remove offensive, nonsensical, or overly specific personally identifiable information.

Key Experimental Results¶

Main Results¶

Dataset	Avg. Length	MTLD↑	Unigram↑	GPT-4 Win Rate↑	Human Win Rate↑
Cap3D	16 words	39.71	15,189	14.55%	9.50%
3DTopia	29 words	41.43	10,329	10.80%	14.00%
Kabra	5 words	25.85	3,862	2.24%	3.10%
MARVEL (L4)	44 words	47.43	27,659	72.41%	73.40%

Annotation Accuracy¶

Method	Avg. Length	GPT-4 Accuracy	Human Accuracy
Cap3D	16	76.00%	72.80%
Kabra	5	83.40%	78.20%
MARVEL (L1)	170	84.70%	82.80%

TT3D Generation (Human Evaluation)¶

Method	Time	Prompt Fidelity	Overall Preference
Shap-E	5s	2.65	2.41
DreamFusion	30min	4.22	4.09
Lucid-Dreamer	45min	6.62	6.59
MARVEL-FX3D	15s	7.71	6.94

Ablation Study¶

Level-to-Level	Semantic Similarity	Compression Ratio
L1→L2	0.91	0.30
L2→L3	0.92	0.27
L3→L4	0.88	0.47
L4→L5	0.72	0.22

Key Findings¶

MARVEL annotations outperform existing methods across all metrics, showing an 83% higher language diversity (vs. Kabra) and a 7x larger vocabulary size.
Despite Level 1 descriptions being up to 170 words long (34x that of Kabra), they maintain a high accuracy rate of 84.7%.
MARVEL-FX3D generates results within 15 seconds, which is 180x faster than Lucid-Dreamer while achieving higher prompt fidelity.
Finetuning SD3.5 on MARVEL annotations yields significant improvements over the Cap3D-finetuned version across all metrics, demonstrating the decisive impact of data quality.
Semantic retention remains high across L1–L4 transitions (0.88–0.92), dropping to 0.72 at L5 due to the transition into tag formats.

Highlights & Insights¶

Exceptionally High Engineering Value: With 40M+ annotations covering 8.9M 3D assets, this is the largest 3D text annotation dataset to date, providing a fundamental contribution to future 3D foundation model training.
Open-Source Solution Rivaling GPT-4: The pipeline is built entirely with open-source models like InternVL2 and Qwen2.5, ensuring manageable costs and seamless reproducibility.
Insights on Metadata Fusion: Human-annotated metadata should not be discarded (as done in Cap3D) but rather filtered and injected as domain priors. This is critical for identifying complex or domain-specific entities like "lunar craters" or the "Monument to Dante."
Multi-Level Annotation Structure: The five-level annotation design elegantly addresses the challenge of a "one-size-fits-all" annotation failing to accommodate disparate downstream tasks.

Limitations & Future Work¶

VLMs and LLMs face limitations in numerical precision and directional spatial understanding, still leading to errors when describing scenes with multi-object occlusion.
InternVL-2 might misinterpret side views of extremely thin objects as separate independent entities.
When metadata is absent, descriptions of complex 3D structures (e.g., highly fragmented geometry inside architectural models) tend to generalize too much.
MARVEL-FX3D sometimes generates flat 3D objects due to potential depth ambiguity issues.
The pipeline is resource-intensive, requiring high-end GPUs (H100 + A6000), which may present deployment challenges for small-to-medium teams.

Core Difference from Cap3D: Cap3D uses single-view BLIP and aggregates them using GPT-4, which easily leads to contradictory descriptions. MARVEL employs a multi-view VLM to directly output highly consistent descriptions.
Difference from CLAY: CLAY utilizes GPT-4 for multi-view annotations, yielding extremely high costs and preventing open-source reproducibility. In contrast, MARVEL is entirely open-source.
Insights: The quality of 3D annotations is a crucial decisive factor in downstream TT3D generation quality. The paradigm of multi-level annotation combined with metadata fusion is a generalizable framework for building other cross-modal datasets.

Rating¶

Novelty: ⭐⭐⭐⭐ While the hierarchical annotation structure and metadata fusion ideas are highly novel, the VLM+LLM annotation pipeline itself is somewhat straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering linguistic evaluations, text-image alignment, annotation accuracy, downstream TT3D performance, ablations, and human evaluations.
Writing Quality: ⭐⭐⭐⭐ The structure is clear and the tables are rich, though the paper is quite long with occasional redundancies.
Value: ⭐⭐⭐⭐⭐ The dataset sets a new benchmark in both scale and quality, making a substantial foundational contribution to 3D foundation model research.