PixelDiT: Pixel Diffusion Transformers for Image Generation¶
Conference: CVPR 2026 arXiv: 2511.20645 Code: https://github.com/ Area: Image Generation Keywords: Pixel diffusion, dual-level Transformer, end-to-end generation, pixel modeling, text-to-image
TL;DR¶
PixelDiT proposes a fully Transformer-based dual-level pixel-space diffusion model: a patch-level DiT captures global semantics while a pixel-level DiT refines textural details, achieving an FID of 1.61 on ImageNet without any VAE, and enabling direct text-to-image training at 1024-resolution in pixel space.
Background & Motivation¶
- Background: Latent-space diffusion is the dominant paradigm for DiT-based models; however, reliance on pretrained autoencoders introduces lossy reconstruction, limiting sampling fidelity and precluding joint optimization.
- Limitations of Prior Work: Pixel-space diffusion faces a fundamental challenge: models must simultaneously handle global semantics and high-frequency details. Aggressive patchification loses fine details, whereas small patches or long sequences lead to computational explosion.
- Key Challenge: No efficient pixel modeling mechanism exists that can jointly capture global semantics and perform per-pixel updates.
- Goal: Design a pure Transformer pixel-space diffusion model with explicitly structured pixel modeling.
- Key Insight: Decouple semantic learning from pixel-level updating into two hierarchical levels, each processed by Transformers operating at different granularities.
- Core Idea: A patch-level pathway performs long-range semantic attention (coarse granularity), while a pixel-level pathway performs dense per-pixel modeling (fine granularity), connected via pixel-wise AdaLN and token compaction.
Method¶
Overall Architecture¶
The dual-level architecture comprises a patch-level DiT that processes short token sequences with an aggressive patch size to capture global layout, and a pixel-level DiT (PiT blocks) that refines textures at pixel granularity. Pixel-wise AdaLN modulates each pixel token with semantic tokens; pixel token compaction compresses pixel tokens before global attention and decompresses them afterwards.
Key Designs¶
- Pixel-wise AdaLN Modulation
- Function: Injects patch-level semantic information into the processing of each individual pixel token.
- Mechanism: Unlike standard AdaLN, which conditions on a single global signal (e.g., timestep), PixelDiT uses patch-level semantic tokens to generate independent modulation parameters for each pixel token. Each pixel token receives spatially corresponding scale and shift values derived from its associated semantic token.
-
Design Motivation: A global condition treats all pixels uniformly, yet different spatial locations require distinct semantic guidance. Pixel-level modulation enables spatially adaptive conditioning.
-
Pixel Token Compaction
- Function: Makes global attention over per-pixel tokens computationally tractable while preserving full spatial resolution.
- Mechanism: Before global attention, each pixel token is projected to a lower-dimensional representation via a linear projection; after attention, it is decompressed back to its original dimensionality. This allows the pixel-level pathway to perform global attention over the full-resolution token sequence without incurring computational explosion.
-
Design Motivation: The number of pixel-level tokens is prohibitively large (e.g., 65,536 tokens at 256×256 resolution), making direct full attention infeasible. Compaction reduces dimensionality rather than spatial count, thereby preserving spatial resolution.
-
Dual-Level Pathway Fusion
- Function: Architecturally separates semantic learning from texture refinement.
- Mechanism: The patch-level pathway consists of \(N\) enhanced DiT blocks using RMSNorm and 2D RoPE. The pixel-level pathway's PiT blocks receive patch-level outputs as semantic conditioning and produce the final per-pixel velocity predictions via pixel-wise AdaLN and compaction attention.
- Design Motivation: Concentrating the majority of semantic reasoning on a low-resolution grid alleviates the burden on the pixel-level pathway and accelerates learning.
Loss & Training¶
Standard conditional flow matching loss applied directly in pixel space. Multi-modal DiT blocks are used for text-to-image generation.
Key Experimental Results¶
Main Results¶
| Method | Type | FID↓ (256) | FID↓ (512) | Notes |
|---|---|---|---|---|
| PixelDiT | Pixel | 1.61 | 1.81 | Pixel-space SOTA |
| DeCo | Pixel | 1.62 | 2.22 | Frequency decoupling |
| DiT-XL/2 | Latent | 2.27 | — | Requires VAE |
| PixelFlow | Pixel | — | — | Hierarchical method |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| Pixel-wise AdaLN | Outperforms global AdaLN | Spatially adaptive modulation is effective |
| Token Compaction | Outperforms no compaction | Enables global attention |
| Dual-level vs. single-level | Dual-level substantially better | Decoupled design is critical |
Key Findings¶
- PixelDiT achieves the lowest FID among pixel-space models, demonstrating that a pure Transformer architecture can operate efficiently in pixel space.
- Pixel-space models naturally avoid VAE reconstruction artifacts in image editing tasks, yielding better background preservation.
- The model can be trained directly for T2I generation at 1024 resolution in pixel space, achieving 0.74 on GenEval and 83.5 on DPG-Bench.
Highlights & Insights¶
- Fully end-to-end: A VAE-free pure Transformer architecture represents the simplest possible generative pipeline.
- Token Compaction is a pragmatic engineering innovation: compressing in the channel dimension rather than reducing spatial tokens preserves full spatial resolution.
- The work demonstrates that pixel-space diffusion can approach or surpass latent-space diffusion across all metrics.
Limitations & Future Work¶
- Training cost remains higher than that of LDM-based approaches.
- Text-to-image benchmark scores are slightly below the best LDM-based models (e.g., FLUX).
- Future work may incorporate more advanced training techniques to further close this gap.
Related Work & Insights¶
- vs. DeCo: DeCo employs an attention-free linear decoder, whereas PixelDiT uses attention-equipped PiT blocks. The two approaches share a similar motivation but differ substantially in implementation.
- vs. PixNerd: PixNerd uses neural field layers to predict pixel-level velocity; PixelDiT adopts a pure Transformer design, making it more architecturally standard.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Dual-level pixel Transformer design is novel, though concurrent with DeCo.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Validated across ImageNet, T2I, and editing tasks.
- Writing Quality: ⭐⭐⭐⭐ — Architecture description is detailed and clear.
- Value: ⭐⭐⭐⭐ — Revives pixel-space diffusion as a viable paradigm.