MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation¶
Conference: CVPR 2025
arXiv: 2411.17945
Code: https://sankalpsinha-cmos.github.io/MARVEL/
Area: 3D Vision
Keywords: Text-to-3D, 3D Annotation, Dataset, Multi-Level Description, Stable Diffusion
TL;DR¶
This work constructs MARVEL-40M+, a large-scale 3D description dataset featuring 8.9 million 3D assets and over 40 million multi-level text annotations. Through a multi-stage automated annotation pipeline (InternVL2 + Qwen2.5), it generates five levels of annotation ranging from detailed narratives to concise tags. Leveraging this dataset, Stable Diffusion 3.5 is fine-tuned to achieve high-fidelity text-to-3D generation within 15 seconds.
Background & Motivation¶
While text-to-3D (TT3D) content generation is in high demand across gaming, AR/VR, and film production, the development of this field is severely constrained by the lack of high-quality, aligned 3D-text asset pairs:
- Insufficient Scale of Existing Datasets: Cap3D contains only around 1 million assets, and 3DTopia contains about 360,000, failing to cover the immense diversity of 3D models.
- Low Annotation Quality: Descriptions generated by single-view VLMs (e.g., BLIP, LLaVA) often suffer from contradictions or inconsistency, and lack the detailed information required for fine-grained 3D reconstruction.
- Poor Scalability: Methods like Cap3D and CLAY rely on proprietary models such as GPT-4, which are highly expensive and difficult to deploy at scale.
- Lack of Domain-Specific Descriptions: Datasets like Objaverse contain diverse models ranging from characters and creatures to historical artifacts, which require domain expertise for accurate annotation.
- Single-Granularity Annotations: Existing approaches only offer single-level labels, failing to flexibly adapt to varying needs of detailed reconstruction versus rapid modeling.
Key Motivation: Leverage open-source multi-view VLMs and LLMs to construct an automated and scalable 3D annotation pipeline, generating multi-level annotations to simultaneously serve both detailed reconstruction and rapid prototyping.
Method¶
Overall Architecture¶
MARVEL is a multi-stage 3D asset annotation pipeline that takes four-view renders of a 3D model and optional human metadata as input, and outputs five levels of text descriptions. The downstream application, MARVEL-FX3D, is a two-stage TT3D pipeline: it first uses a fine-tuned SD3.5 to generate an image, and then utilizes a pre-trained SF3D to convert it into a textured mesh, completing the entire process in under 15 seconds.
Key Designs¶
-
Multi-View VLM-Based Dense Description Generation:
- Function: Generates unified, dense text descriptions from four standard viewpoints (front/back/left/right) of a 3D model.
- Mechanism: Uses InternVL2-40B as the multi-view VLM, taking four \(512 \times 512\) renders along with metadata-enhanced prompts as input to directly output dense descriptions covering five key aspects: (1) structural decomposition and object recognition, (2) geometric properties and symmetry, (3) surface texture and material, (4) color mapping and transition, and (5) environmental context and spatial relationships.
- Design Motivation: This avoids the "describe-then-aggregate" workflow of methods like Cap3D (reducing information loss and conflict). InternVL2-40B delivers performance close to GPT-4o at a significantly lower cost.
-
Human Metadata Fusion and Filtering:
- Function: Injects human-generated metadata from original source datasets into the annotation pipeline to reduce VLM hallucinations and introduce domain-specific information.
- Mechanism: Extracts user-generated metadata (such as names, tags, and descriptions) from datasets like Objaverse, and uses Mistral-Nemo-Instruct-2407 to filter out random, redundant, and sensitive content. The clean information relevant to 3D attributes is then injected into the VLM prompt. For example, metadata helps identify domain-specific entities like the "Monument to Dante" that a VLM cannot infer solely through visual reasoning.
- Design Motivation: Pure visual VLMs are prone to hallucinating when facing complex 3D scenes (due to 2D-3D domain gaps). Human metadata provides strong domain priors, though the raw metadata itself contains noise and must be filtered.
-
Multi-Level Visual Elaboration:
- Function: Progressively compresses the dense descriptions into five distinct levels to adapt to different downstream task requirements.
- Mechanism: Uses Qwen2.5-72B to execute a hierarchical prompting strategy, sequentially specifying which aspects to retain or compress: Level 1 (150-200 words, comprehensive description) → Level 2 (100-150 words, key structure and geometry) → Level 3 (50-100 words, functional semantics) → Level 4 (~30 words, brief summary) → Level 5 (10-20 words, list of semantic tags). Each level gradually compresses aspects like texture, color, and geometry.
- Design Motivation: Directly instructing compression can constrain the model's creative capacity (consistent with findings in recent studies). Hierarchical prompting strikes a balance between detail and brevity. Additionally, different use cases demand different granularities (e.g., fine-grained reconstruction vs. rapid modeling).
Loss & Training¶
- MARVEL-FX3D Stage 1: Fine-tunes Stable Diffusion 3.5 using LoRA (rank=4, alpha=4) on a training set of 798K assets from Objaverse for 5 epochs with half-precision, a batch size of 8, on a single H100 GPU.
- Stage 2: Uses DIS to remove the background, then feeds the image into pre-trained SF3D to generate the textured mesh (5 seconds).
- Annotation Pipeline Throughput: ~24,000 samples per day.
- Ethical Filtering Stage: Employs Qwen2.5-14B to remove offensive, nonsensical, or overly specific personally identifiable information.
Key Experimental Results¶
Main Results¶
| Dataset | Avg. Length | MTLD↑ | Unigram↑ | GPT-4 Win Rate↑ | Human Win Rate↑ |
|---|---|---|---|---|---|
| Cap3D | 16 words | 39.71 | 15,189 | 14.55% | 9.50% |
| 3DTopia | 29 words | 41.43 | 10,329 | 10.80% | 14.00% |
| Kabra | 5 words | 25.85 | 3,862 | 2.24% | 3.10% |
| MARVEL (L4) | 44 words | 47.43 | 27,659 | 72.41% | 73.40% |
Annotation Accuracy¶
| Method | Avg. Length | GPT-4 Accuracy | Human Accuracy |
|---|---|---|---|
| Cap3D | 16 | 76.00% | 72.80% |
| Kabra | 5 | 83.40% | 78.20% |
| MARVEL (L1) | 170 | 84.70% | 82.80% |
TT3D Generation (Human Evaluation)¶
| Method | Time | Prompt Fidelity | Overall Preference |
|---|---|---|---|
| Shap-E | 5s | 2.65 | 2.41 |
| DreamFusion | 30min | 4.22 | 4.09 |
| Lucid-Dreamer | 45min | 6.62 | 6.59 |
| MARVEL-FX3D | 15s | 7.71 | 6.94 |
Ablation Study¶
| Level-to-Level | Semantic Similarity | Compression Ratio |
|---|---|---|
| L1→L2 | 0.91 | 0.30 |
| L2→L3 | 0.92 | 0.27 |
| L3→L4 | 0.88 | 0.47 |
| L4→L5 | 0.72 | 0.22 |
Key Findings¶
- MARVEL annotations outperform existing methods across all metrics, showing an 83% higher language diversity (vs. Kabra) and a 7x larger vocabulary size.
- Despite Level 1 descriptions being up to 170 words long (34x that of Kabra), they maintain a high accuracy rate of 84.7%.
- MARVEL-FX3D generates results within 15 seconds, which is 180x faster than Lucid-Dreamer while achieving higher prompt fidelity.
- Finetuning SD3.5 on MARVEL annotations yields significant improvements over the Cap3D-finetuned version across all metrics, demonstrating the decisive impact of data quality.
- Semantic retention remains high across L1–L4 transitions (0.88–0.92), dropping to 0.72 at L5 due to the transition into tag formats.
Highlights & Insights¶
- Exceptionally High Engineering Value: With 40M+ annotations covering 8.9M 3D assets, this is the largest 3D text annotation dataset to date, providing a fundamental contribution to future 3D foundation model training.
- Open-Source Solution Rivaling GPT-4: The pipeline is built entirely with open-source models like InternVL2 and Qwen2.5, ensuring manageable costs and seamless reproducibility.
- Insights on Metadata Fusion: Human-annotated metadata should not be discarded (as done in Cap3D) but rather filtered and injected as domain priors. This is critical for identifying complex or domain-specific entities like "lunar craters" or the "Monument to Dante."
- Multi-Level Annotation Structure: The five-level annotation design elegantly addresses the challenge of a "one-size-fits-all" annotation failing to accommodate disparate downstream tasks.
Limitations & Future Work¶
- VLMs and LLMs face limitations in numerical precision and directional spatial understanding, still leading to errors when describing scenes with multi-object occlusion.
- InternVL-2 might misinterpret side views of extremely thin objects as separate independent entities.
- When metadata is absent, descriptions of complex 3D structures (e.g., highly fragmented geometry inside architectural models) tend to generalize too much.
- MARVEL-FX3D sometimes generates flat 3D objects due to potential depth ambiguity issues.
- The pipeline is resource-intensive, requiring high-end GPUs (H100 + A6000), which may present deployment challenges for small-to-medium teams.
Related Work & Insights¶
- Core Difference from Cap3D: Cap3D uses single-view BLIP and aggregates them using GPT-4, which easily leads to contradictory descriptions. MARVEL employs a multi-view VLM to directly output highly consistent descriptions.
- Difference from CLAY: CLAY utilizes GPT-4 for multi-view annotations, yielding extremely high costs and preventing open-source reproducibility. In contrast, MARVEL is entirely open-source.
- Insights: The quality of 3D annotations is a crucial decisive factor in downstream TT3D generation quality. The paradigm of multi-level annotation combined with metadata fusion is a generalizable framework for building other cross-modal datasets.
Rating¶
- Novelty: ⭐⭐⭐⭐ While the hierarchical annotation structure and metadata fusion ideas are highly novel, the VLM+LLM annotation pipeline itself is somewhat straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering linguistic evaluations, text-image alignment, annotation accuracy, downstream TT3D performance, ablations, and human evaluations.
- Writing Quality: ⭐⭐⭐⭐ The structure is clear and the tables are rich, though the paper is quite long with occasional redundancies.
- Value: ⭐⭐⭐⭐⭐ The dataset sets a new benchmark in both scale and quality, making a substantial foundational contribution to 3D foundation model research.