Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens¶

Conference: CVPR 2026 arXiv: 2603.19232 Code: GitHub Area: Multimodal VLM Keywords: discrete diffusion model, high-dimensional representation token, visual generation, dimension-wise quantization, unified multimodal

TL;DR¶

This paper proposes CubiD, the first model to perform discrete diffusion generation over high-dimensional representation tokens (768-dim). By conducting fine-grained mask prediction over an $h \times w \times d$ cubic tensor, CubiD achieves high-quality image generation while preserving visual understanding capability.

Background & Motivation¶

Demand for unified multimodal modeling: Language models naturally leverage semantic tokens for both understanding and generation. However, visual models remain fragmented—understanding relies on high-dimensional semantic features while generation relies on low-dimensional compressed tokens (8–32 dim), impeding unified architectures.

Reconstruction advantage of high-dimensional representations: Recent work (e.g., RAE) demonstrates that pretrained features of 768–1024 dimensions enable high-quality reconstruction, yet discrete generative modeling over such representations poses fundamental challenges.

Failure of vector quantization in high-dimensional spaces: Conventional VQ suffers from the curse of dimensionality in high-dimensional spaces—data sparsity renders clustering ineffective, codebook size must grow exponentially, and quantization-induced feature shift severely degrades semantic information.

Feasibility of dimension-wise quantization: Quantizing each dimension independently circumvents the difficulty of joint quantization. As a training-free method, it can be directly applied to frozen pretrained features; however, generative modeling remains a bottleneck.

Limitations of existing generative approaches: Autoregressive generation requires $O(hwd)$ steps, which is infeasible; standard discrete diffusion cannot model intra-position dimensional dependencies.

Core insight: The $h \times w \times d$ tensor has a natural multi-dimensional structure that can break the atomicity constraint of spatial positions, enabling flexible operations across the full three-dimensional space.

Method¶

Overall Architecture¶

CubiD consists of two stages: (1) high-dimensional token discretization—features are extracted using frozen pretrained encoders (DINOv2/SigLIP2) followed by dimension-wise quantization; (2) Cubic Discrete Diffusion—fine-grained mask modeling and iterative generation over a discrete $h \times w \times d$ tensor.

Key Designs¶

Dimension-wise Quantization¶

Each continuous value is independently quantized into $L$ discrete levels: $q_{x,y,i} = \text{Quantize}(z_{x,y,i}; L)$. Using $L=8$ for DINOv2 and $L=16$ for SigLIP2 achieves reconstruction quality on par with continuous features. On LLaVA understanding benchmarks, dimension-wise quantization (DQ) incurs nearly no degradation (GQA: 63.1 vs. 63.2), whereas vector quantization (VQ) degrades severely (54.9).

Fine-grained Cubic Masking¶

Unlike MaskGIT, which masks entire spatial positions, CubiD independently masks arbitrary elements within the $h \times w \times d$ tensor. During training, the masking ratio is sampled from a truncated Gaussian distribution $r \sim \text{TruncNorm}(\mu=1.0, \sigma=0.10, [0,1])$; selected elements are replaced with a learnable [MASK] token. The model is trained with a cross-entropy loss to predict masked tokens: $$\mathcal{L} = -\mathbb{E}\left[\sum_{i \in \mathbf{M}} \log p(q_i | \mathbf{q}_{\bar{\mathbf{M}}})\right]$$

Model Architecture¶

A standard bidirectional attention Transformer is adopted. For each spatial position, the $d$ tokens are dequantized and concatenated into a $d$-dimensional vector as a single token, keeping the sequence length fixed at $h \times w$ regardless of feature dimensionality. An MLP prediction head produces $d \times L$ logits as output.

Loss & Training¶

Cross-entropy loss (Eq. 3) is computed over all masked positions. Inference employs iterative unmasking with a cosine schedule, completing generation in a fixed $T$ steps.

Key Experimental Results¶

Main Results: ImageNet 256×256 Generation¶

Method	Dim	Params	gFID↓ (w/o cfg)	IS↑	gFID↓ (w/ cfg)
MaskGIT	16	227M	6.18	182.1	4.02
CubiD-L (Ours)	768	946M	5.25	—	—
CubiD-XXL (Ours)	768	3.7B	4.68	—	1.88

Ablation Study¶

Ablation	Setting	gFID↓
Masking strategy	Per-dim / Per-spatial / Per-element	120.03 / 22.22 / 5.33
Mask token	Fixed / Random / Learned	5.56 / 56.38 / 5.33
Model scale	946M / 1.4B / 3.7B	5.25 / 4.91 / 4.68
Inference steps	64 / 256 / 512	9.14 / 5.33 / 5.25

Key Findings¶

Element-wise masking is critical: Per-dim masking completely fails (gFID=120) and Per-spatial masking produces blurry results (gFID=22), demonstrating that intra- and inter-position dependencies within high-dimensional tokens are inseparable.
Dimension-wise quantization preserves understanding capability: DQ achieves near-identical performance to continuous features across four LLaVA benchmarks.
The model exhibits favorable scaling behavior from 900M to 3.7B parameters.
The approach generalizes across encoders: both DINOv2 (gFID=5.25) and SigLIP2 (gFID=5.87) are effective.

Highlights & Insights¶

First work to achieve discrete generation over high-dimensional representation tokens, bridging unified representations for understanding and generation.
The fine-grained cubic masking design is elegant, transforming the intractable $O(hwd)$-step problem into parallel iterative generation completed in a fixed $T$ steps.
Experiments verify that discretized high-dimensional tokens can simultaneously serve both understanding and generation tasks.
Ablation studies thoroughly demonstrate the necessity of each design choice.

Limitations & Future Work¶

Validation is currently limited to class-conditional generation on ImageNet; text-guided generation has not been explored.
An external decoder (from RAE) is required to reconstruct images from representations.
Inference still requires hundreds of steps, leaving room for efficiency improvements.
An in-depth comparison with state-of-the-art continuous diffusion models (e.g., DiT) in terms of FID is lacking.

The key distinction from MaskGIT lies in masking granularity: CubiD operates at the dimension level rather than the spatial position level.
CubiD is complementary to RAE: RAE generates high-dimensional representations via continuous diffusion, while CubiD uses discrete diffusion.
The essential difference from low-dimensional discrete generation methods such as TiTok is that CubiD operates directly in the native dimensionality of pretrained features, preserving semantic integrity.
This work lays the groundwork for unified multimodal architectures where a single set of discrete tokens serves both understanding and generation.
The success of dimension-wise quantization offers an important insight for the VQ-VAE community—joint quantization is not necessary in high-dimensional spaces.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐