Skip to content

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Conference: CVPR 2025
arXiv: 2411.17945
Code: https://sankalpsinha-cmos.github.io/MARVEL/
Area: 3D Vision
Keywords: Text-to-3D, 3D Annotation, Dataset, Multi-Level Description, Stable Diffusion

TL;DR

This work constructs MARVEL-40M+, a large-scale 3D description dataset featuring 8.9 million 3D assets and over 40 million multi-level text annotations. Through a multi-stage automated annotation pipeline (InternVL2 + Qwen2.5), it generates five levels of annotation ranging from detailed narratives to concise tags. Leveraging this dataset, Stable Diffusion 3.5 is fine-tuned to achieve high-fidelity text-to-3D generation within 15 seconds.

Background & Motivation

While text-to-3D (TT3D) content generation is in high demand across gaming, AR/VR, and film production, the development of this field is severely constrained by the lack of high-quality, aligned 3D-text asset pairs:

  • Insufficient Scale of Existing Datasets: Cap3D contains only around 1 million assets, and 3DTopia contains about 360,000, failing to cover the immense diversity of 3D models.
  • Low Annotation Quality: Descriptions generated by single-view VLMs (e.g., BLIP, LLaVA) often suffer from contradictions or inconsistency, and lack the detailed information required for fine-grained 3D reconstruction.
  • Poor Scalability: Methods like Cap3D and CLAY rely on proprietary models such as GPT-4, which are highly expensive and difficult to deploy at scale.
  • Lack of Domain-Specific Descriptions: Datasets like Objaverse contain diverse models ranging from characters and creatures to historical artifacts, which require domain expertise for accurate annotation.
  • Single-Granularity Annotations: Existing approaches only offer single-level labels, failing to flexibly adapt to varying needs of detailed reconstruction versus rapid modeling.

Key Motivation: Leverage open-source multi-view VLMs and LLMs to construct an automated and scalable 3D annotation pipeline, generating multi-level annotations to simultaneously serve both detailed reconstruction and rapid prototyping.

Method

Overall Architecture

MARVEL is a multi-stage 3D asset annotation pipeline that takes four-view renders of a 3D model and optional human metadata as input, and outputs five levels of text descriptions. The downstream application, MARVEL-FX3D, is a two-stage TT3D pipeline: it first uses a fine-tuned SD3.5 to generate an image, and then utilizes a pre-trained SF3D to convert it into a textured mesh, completing the entire process in under 15 seconds.

Key Designs

  1. Multi-View VLM-Based Dense Description Generation:

    • Function: Generates unified, dense text descriptions from four standard viewpoints (front/back/left/right) of a 3D model.
    • Mechanism: Uses InternVL2-40B as the multi-view VLM, taking four \(512 \times 512\) renders along with metadata-enhanced prompts as input to directly output dense descriptions covering five key aspects: (1) structural decomposition and object recognition, (2) geometric properties and symmetry, (3) surface texture and material, (4) color mapping and transition, and (5) environmental context and spatial relationships.
    • Design Motivation: This avoids the "describe-then-aggregate" workflow of methods like Cap3D (reducing information loss and conflict). InternVL2-40B delivers performance close to GPT-4o at a significantly lower cost.
  2. Human Metadata Fusion and Filtering:

    • Function: Injects human-generated metadata from original source datasets into the annotation pipeline to reduce VLM hallucinations and introduce domain-specific information.
    • Mechanism: Extracts user-generated metadata (such as names, tags, and descriptions) from datasets like Objaverse, and uses Mistral-Nemo-Instruct-2407 to filter out random, redundant, and sensitive content. The clean information relevant to 3D attributes is then injected into the VLM prompt. For example, metadata helps identify domain-specific entities like the "Monument to Dante" that a VLM cannot infer solely through visual reasoning.
    • Design Motivation: Pure visual VLMs are prone to hallucinating when facing complex 3D scenes (due to 2D-3D domain gaps). Human metadata provides strong domain priors, though the raw metadata itself contains noise and must be filtered.
  3. Multi-Level Visual Elaboration:

    • Function: Progressively compresses the dense descriptions into five distinct levels to adapt to different downstream task requirements.
    • Mechanism: Uses Qwen2.5-72B to execute a hierarchical prompting strategy, sequentially specifying which aspects to retain or compress: Level 1 (150-200 words, comprehensive description) → Level 2 (100-150 words, key structure and geometry) → Level 3 (50-100 words, functional semantics) → Level 4 (~30 words, brief summary) → Level 5 (10-20 words, list of semantic tags). Each level gradually compresses aspects like texture, color, and geometry.
    • Design Motivation: Directly instructing compression can constrain the model's creative capacity (consistent with findings in recent studies). Hierarchical prompting strikes a balance between detail and brevity. Additionally, different use cases demand different granularities (e.g., fine-grained reconstruction vs. rapid modeling).

Loss & Training

  • MARVEL-FX3D Stage 1: Fine-tunes Stable Diffusion 3.5 using LoRA (rank=4, alpha=4) on a training set of 798K assets from Objaverse for 5 epochs with half-precision, a batch size of 8, on a single H100 GPU.
  • Stage 2: Uses DIS to remove the background, then feeds the image into pre-trained SF3D to generate the textured mesh (5 seconds).
  • Annotation Pipeline Throughput: ~24,000 samples per day.
  • Ethical Filtering Stage: Employs Qwen2.5-14B to remove offensive, nonsensical, or overly specific personally identifiable information.

Key Experimental Results

Main Results

Dataset Avg. Length MTLD↑ Unigram↑ GPT-4 Win Rate↑ Human Win Rate↑
Cap3D 16 words 39.71 15,189 14.55% 9.50%
3DTopia 29 words 41.43 10,329 10.80% 14.00%
Kabra 5 words 25.85 3,862 2.24% 3.10%
MARVEL (L4) 44 words 47.43 27,659 72.41% 73.40%

Annotation Accuracy

Method Avg. Length GPT-4 Accuracy Human Accuracy
Cap3D 16 76.00% 72.80%
Kabra 5 83.40% 78.20%
MARVEL (L1) 170 84.70% 82.80%

TT3D Generation (Human Evaluation)

Method Time Prompt Fidelity Overall Preference
Shap-E 5s 2.65 2.41
DreamFusion 30min 4.22 4.09
Lucid-Dreamer 45min 6.62 6.59
MARVEL-FX3D 15s 7.71 6.94

Ablation Study

Level-to-Level Semantic Similarity Compression Ratio
L1→L2 0.91 0.30
L2→L3 0.92 0.27
L3→L4 0.88 0.47
L4→L5 0.72 0.22

Key Findings

  • MARVEL annotations outperform existing methods across all metrics, showing an 83% higher language diversity (vs. Kabra) and a 7x larger vocabulary size.
  • Despite Level 1 descriptions being up to 170 words long (34x that of Kabra), they maintain a high accuracy rate of 84.7%.
  • MARVEL-FX3D generates results within 15 seconds, which is 180x faster than Lucid-Dreamer while achieving higher prompt fidelity.
  • Finetuning SD3.5 on MARVEL annotations yields significant improvements over the Cap3D-finetuned version across all metrics, demonstrating the decisive impact of data quality.
  • Semantic retention remains high across L1–L4 transitions (0.88–0.92), dropping to 0.72 at L5 due to the transition into tag formats.

Highlights & Insights

  • Exceptionally High Engineering Value: With 40M+ annotations covering 8.9M 3D assets, this is the largest 3D text annotation dataset to date, providing a fundamental contribution to future 3D foundation model training.
  • Open-Source Solution Rivaling GPT-4: The pipeline is built entirely with open-source models like InternVL2 and Qwen2.5, ensuring manageable costs and seamless reproducibility.
  • Insights on Metadata Fusion: Human-annotated metadata should not be discarded (as done in Cap3D) but rather filtered and injected as domain priors. This is critical for identifying complex or domain-specific entities like "lunar craters" or the "Monument to Dante."
  • Multi-Level Annotation Structure: The five-level annotation design elegantly addresses the challenge of a "one-size-fits-all" annotation failing to accommodate disparate downstream tasks.

Limitations & Future Work

  • VLMs and LLMs face limitations in numerical precision and directional spatial understanding, still leading to errors when describing scenes with multi-object occlusion.
  • InternVL-2 might misinterpret side views of extremely thin objects as separate independent entities.
  • When metadata is absent, descriptions of complex 3D structures (e.g., highly fragmented geometry inside architectural models) tend to generalize too much.
  • MARVEL-FX3D sometimes generates flat 3D objects due to potential depth ambiguity issues.
  • The pipeline is resource-intensive, requiring high-end GPUs (H100 + A6000), which may present deployment challenges for small-to-medium teams.
  • Core Difference from Cap3D: Cap3D uses single-view BLIP and aggregates them using GPT-4, which easily leads to contradictory descriptions. MARVEL employs a multi-view VLM to directly output highly consistent descriptions.
  • Difference from CLAY: CLAY utilizes GPT-4 for multi-view annotations, yielding extremely high costs and preventing open-source reproducibility. In contrast, MARVEL is entirely open-source.
  • Insights: The quality of 3D annotations is a crucial decisive factor in downstream TT3D generation quality. The paradigm of multi-level annotation combined with metadata fusion is a generalizable framework for building other cross-modal datasets.

Rating

  • Novelty: ⭐⭐⭐⭐ While the hierarchical annotation structure and metadata fusion ideas are highly novel, the VLM+LLM annotation pipeline itself is somewhat straightforward.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering linguistic evaluations, text-image alignment, annotation accuracy, downstream TT3D performance, ablations, and human evaluations.
  • Writing Quality: ⭐⭐⭐⭐ The structure is clear and the tables are rich, though the paper is quite long with occasional redundancies.
  • Value: ⭐⭐⭐⭐⭐ The dataset sets a new benchmark in both scale and quality, making a substantial foundational contribution to 3D foundation model research.