Skip to content

PixelDiT: Pixel Diffusion Transformers for Image Generation

Conference: CVPR 2026 arXiv: 2511.20645 Code: https://github.com/ Area: Image Generation Keywords: Pixel diffusion, dual-level Transformer, end-to-end generation, pixel modeling, text-to-image

TL;DR

PixelDiT proposes a fully Transformer-based dual-level pixel-space diffusion model: a patch-level DiT captures global semantics while a pixel-level DiT refines textural details, achieving an FID of 1.61 on ImageNet without any VAE, and enabling direct text-to-image training at 1024-resolution in pixel space.

Background & Motivation

  1. Background: Latent-space diffusion is the dominant paradigm for DiT-based models; however, reliance on pretrained autoencoders introduces lossy reconstruction, limiting sampling fidelity and precluding joint optimization.
  2. Limitations of Prior Work: Pixel-space diffusion faces a fundamental challenge: models must simultaneously handle global semantics and high-frequency details. Aggressive patchification loses fine details, whereas small patches or long sequences lead to computational explosion.
  3. Key Challenge: No efficient pixel modeling mechanism exists that can jointly capture global semantics and perform per-pixel updates.
  4. Goal: Design a pure Transformer pixel-space diffusion model with explicitly structured pixel modeling.
  5. Key Insight: Decouple semantic learning from pixel-level updating into two hierarchical levels, each processed by Transformers operating at different granularities.
  6. Core Idea: A patch-level pathway performs long-range semantic attention (coarse granularity), while a pixel-level pathway performs dense per-pixel modeling (fine granularity), connected via pixel-wise AdaLN and token compaction.

Method

Overall Architecture

The dual-level architecture comprises a patch-level DiT that processes short token sequences with an aggressive patch size to capture global layout, and a pixel-level DiT (PiT blocks) that refines textures at pixel granularity. Pixel-wise AdaLN modulates each pixel token with semantic tokens; pixel token compaction compresses pixel tokens before global attention and decompresses them afterwards.

Key Designs

  1. Pixel-wise AdaLN Modulation
  2. Function: Injects patch-level semantic information into the processing of each individual pixel token.
  3. Mechanism: Unlike standard AdaLN, which conditions on a single global signal (e.g., timestep), PixelDiT uses patch-level semantic tokens to generate independent modulation parameters for each pixel token. Each pixel token receives spatially corresponding scale and shift values derived from its associated semantic token.
  4. Design Motivation: A global condition treats all pixels uniformly, yet different spatial locations require distinct semantic guidance. Pixel-level modulation enables spatially adaptive conditioning.

  5. Pixel Token Compaction

  6. Function: Makes global attention over per-pixel tokens computationally tractable while preserving full spatial resolution.
  7. Mechanism: Before global attention, each pixel token is projected to a lower-dimensional representation via a linear projection; after attention, it is decompressed back to its original dimensionality. This allows the pixel-level pathway to perform global attention over the full-resolution token sequence without incurring computational explosion.
  8. Design Motivation: The number of pixel-level tokens is prohibitively large (e.g., 65,536 tokens at 256×256 resolution), making direct full attention infeasible. Compaction reduces dimensionality rather than spatial count, thereby preserving spatial resolution.

  9. Dual-Level Pathway Fusion

  10. Function: Architecturally separates semantic learning from texture refinement.
  11. Mechanism: The patch-level pathway consists of \(N\) enhanced DiT blocks using RMSNorm and 2D RoPE. The pixel-level pathway's PiT blocks receive patch-level outputs as semantic conditioning and produce the final per-pixel velocity predictions via pixel-wise AdaLN and compaction attention.
  12. Design Motivation: Concentrating the majority of semantic reasoning on a low-resolution grid alleviates the burden on the pixel-level pathway and accelerates learning.

Loss & Training

Standard conditional flow matching loss applied directly in pixel space. Multi-modal DiT blocks are used for text-to-image generation.

Key Experimental Results

Main Results

Method Type FID↓ (256) FID↓ (512) Notes
PixelDiT Pixel 1.61 1.81 Pixel-space SOTA
DeCo Pixel 1.62 2.22 Frequency decoupling
DiT-XL/2 Latent 2.27 Requires VAE
PixelFlow Pixel Hierarchical method

Ablation Study

Configuration Key Metric Notes
Pixel-wise AdaLN Outperforms global AdaLN Spatially adaptive modulation is effective
Token Compaction Outperforms no compaction Enables global attention
Dual-level vs. single-level Dual-level substantially better Decoupled design is critical

Key Findings

  • PixelDiT achieves the lowest FID among pixel-space models, demonstrating that a pure Transformer architecture can operate efficiently in pixel space.
  • Pixel-space models naturally avoid VAE reconstruction artifacts in image editing tasks, yielding better background preservation.
  • The model can be trained directly for T2I generation at 1024 resolution in pixel space, achieving 0.74 on GenEval and 83.5 on DPG-Bench.

Highlights & Insights

  • Fully end-to-end: A VAE-free pure Transformer architecture represents the simplest possible generative pipeline.
  • Token Compaction is a pragmatic engineering innovation: compressing in the channel dimension rather than reducing spatial tokens preserves full spatial resolution.
  • The work demonstrates that pixel-space diffusion can approach or surpass latent-space diffusion across all metrics.

Limitations & Future Work

  • Training cost remains higher than that of LDM-based approaches.
  • Text-to-image benchmark scores are slightly below the best LDM-based models (e.g., FLUX).
  • Future work may incorporate more advanced training techniques to further close this gap.
  • vs. DeCo: DeCo employs an attention-free linear decoder, whereas PixelDiT uses attention-equipped PiT blocks. The two approaches share a similar motivation but differ substantially in implementation.
  • vs. PixNerd: PixNerd uses neural field layers to predict pixel-level velocity; PixelDiT adopts a pure Transformer design, making it more architecturally standard.

Rating

  • Novelty: ⭐⭐⭐⭐ — Dual-level pixel Transformer design is novel, though concurrent with DeCo.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Validated across ImageNet, T2I, and editing tasks.
  • Writing Quality: ⭐⭐⭐⭐ — Architecture description is detailed and clear.
  • Value: ⭐⭐⭐⭐ — Revives pixel-space diffusion as a viable paradigm.