Skip to content

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

Conference: ICML 2026
arXiv: 2512.14099
Code: TBD
Area: Image Generation / Diffusion Models
Keywords: Discrete Diffusion, Multi-view Generation, Visual Tokenization, Mask Prediction

TL;DR

Through discrete diffusion models and visual tokenization, multi-view generation is reformulated as a discrete sequence prediction task—leveraging a simple random masking strategy combined with self-attention to naturally achieve cross-view consistency, significantly outperforming continuous diffusion methods.

Background & Motivation

Background: The field of multi-view generation has long been dominated by continuous diffusion methods—geometry-aware approaches based on 3D representations (NeRF, 3D Gaussian), camera-conditioned diffusion models, and image-editing-style multi-view generators. These methods rely on explicit 3D priors or complex cross-view synchronization mechanisms.

Limitations of Prior Work: Continuous diffusion methods require explicit camera parameters or fine-grained geometric constraints to ensure view consistency, and per-view independent generation is prone to inconsistencies in details and textures. Text-to-multi-view tasks require using a T2I model to generate a reference image before multi-view expansion, resulting in a tedious pipeline.

Key Challenge: A trade-off exists between geometric consistency and generation flexibility—strengthening 3D constraints improves consistency but limits diversity; pure 2D methods are flexible but struggle to naturally encode cross-view relationships.

Goal: To explore the potential of discrete diffusion models in multi-view generation and establish a unified vision-language framework.

Key Insight: Discrete diffusion (masked token prediction) has proven effective in multimodal understanding and generation (e.g., LLaDA). Advantages include faster inference (parallel decoding), natural fusion of text and visual tokens, and alignment with LLMs.

Core Idea: Multi-view generation is reformulated as a discrete sequence modeling problem. Each viewpoint is represented as a sequence of visual tokens generated by MAGVIT-v2. Through masked diffusion, tokens are predicted iteratively, utilizing simple random masking and bidirectional self-attention to naturally induce cross-view consistency.

Method

Overall Architecture

The strategy follows a three-stage training paradigm—(1) Pre-training for multimodal alignment (image-text caption alignment); (2) Image-to-Multi-View (I2MV, conditioned on a reference view); (3) Text-to-Multi-View (T2MV, conditioned on text descriptions). During inference, the process starts from a fully masked target view, gradually restoring the complete token sequence through iterative prediction and re-masking.

Key Designs

  1. Visual Tokenization and Sequence Fusion:

    • Function: Unify multi-view images into a discrete token sequence.
    • Mechanism: MAGVIT-v2 encodes each image \(I_i \in \mathbb{R}^{H \times W \times 3}\) into a sequence of tokens of length \(L\), with a vocabulary size \(|\mathcal{V}| = 2^{18}\). Cross-view sequences use special tokens ([SOI], [EOI]) to separate viewpoints. In I2MV, one reference view and three generated viewpoints are encoded.
    • Design Motivation: A unified sequence format minimizes structural differences between I2MV and T2MV. Compared to per-view generation, sequentialization naturally supports bidirectional cross-attention for multi-view information flow.
  2. Random Masking and Iterative Denoising:

    • Function: Achieve controllable and parallel multi-view generation via masked token prediction.
    • Mechanism: During training, a mask ratio \(r \sim \text{Uniform}(0,1)\) is sampled to randomly replace target tokens with [MASK]. The cross-entropy loss \(\mathcal{L}_{CE} = -\sum_{i=1}^{3}\sum_{j \in \mathcal{M}_i}\log P(v_j^{(i)}|s_{\setminus\mathcal{M}})\) is utilized. Iterative denoising employs a confidence-based re-masking strategy—cosine, linear, or quadratic scheduling functions determine the number of tokens to re-mask in the next step (low-confidence tokens are re-masked, high-confidence tokens are retained).
    • Design Motivation: Simple random masking combined with bidirectional self-attention naturally induces cross-view consistency (the model must use tokens from unmasked viewpoints to predict missing tokens in the current viewpoint) without requiring explicit geometric priors.
  3. Three-Stage Training Strategy:

    • Function: Transition from weak supervision (image-text alignment) to strong supervision (multi-view consistency).
    • Mechanism: Stage 1 involves pre-training on 1.2M image-text pairs; Stage 2 fine-tunes I2MV on 180K 3D objects (Objaverse + HSSD, rendering 8-frame orbital sequences at 30° elevation); Stage 3 trains T2MV on Objaverse enhanced with Cap3D descriptions to generate 4 viewpoints.
    • Design Motivation: Pre-training ensures basic token understanding; I2MV explicitly learns geometric constraints; T2MV introduces text conditioning to extend the application range.

Key Experimental Results

Main Results

Method Architecture GSO-PSNR↑ GSO-SSIM↑ 3D-FUTURE-PSNR↑ 3D-FUTURE-SSIM↑ Average Rank
Zero-1-to-3 2D Continuous Diffusion 18.82 0.8294 17.05 0.8163 5.2
Zero-1-to-3 XL 2D Continuous Diffusion 19.68 0.8381 18.47 0.8337 3.0
ViVid-1-to-3 2D Continuous Diffusion 19.80 0.8566 18.32 0.8437 3.3
Ours 2D Discrete Diffusion 20.61 0.8561 19.99 0.8650 1.3

3D Reconstruction Consistency

Method GSO-CD↓ GSO-IoU↑ 3D-FUTURE-CD↓ 3D-FUTURE-IoU↑
Zero-1-to-3 0.0163 0.5665 0.0113 0.5005
ViVid-1-to-3 0.0163 0.5841 0.0105 0.5246
Ours 0.0149 0.5847 0.0106 0.5315

Key Findings

  • Masking Schedule Strategy: Cosine (PSNR 18.10) outperforms linear and quadratic, which tend to produce hallucinations.
  • Generalization to High Viewpoint Counts: The model generalizes effectively even beyond the training token budget (8 viewpoints), validating the robustness of discrete sequence modeling.
  • Gain: The IoU on 3D-FUTURE improved by 10.6% compared to the strongest continuous baseline.

Highlights & Insights

  • Elegance of Paradigm Shift: Reframing multi-view generation as a discrete sequence prediction problem eliminates the need for explicit 3D geometric constraints. Simple random masking + bidirectional attention is sufficient to induce consistency.
  • Unified Multimodal Framework: A shared token embedding space naturally fuses text, reference images, and multiple target viewpoints, providing a seamless unification of I2MV and T2MV.
  • Breakthrough over Continuous Diffusion: An average rank of 1.3 is significantly superior to Zero-1-to-3 XL at 3.0.
  • Balance of Inference Efficiency and Quality: Masked diffusion supports parallel prediction, with 20 iteration steps proving sufficient to reach SOTA.

Limitations & Future Work

  • Fixed elevation training data may limit generalization to extreme viewpoints.
  • Resolution bottleneck—currently validated only at 256×256.
  • Text condition richness in the T2MV task is limited—sourced only from Cap3D.
  • Improvements: Multi-resolution tokenization; increased data diversity; exploration of more refined confidence estimation; integration with 3D priors.
  • vs Zero-1-to-3 / Zero-1-to-3 XL: These use 2D continuous diffusion with camera parameter encoding and per-view independent generation; ViewMask uses discrete token sequentialization to naturally encode multi-view relationships.
  • vs TRELLIS / 3D-aware methods: Explicit 3D representations are computationally expensive; ViewMask achieves superior geometric consistency through simple masking and attention without explicit 3D modeling.
  • vs Discrete Diffusion Baselines (LLaDA-V): This work represents the first systematic application of these concepts to multi-view generation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐