Skip to content

DiP: Taming Diffusion Models in Pixel Space

Conference: CVPR 2026
arXiv: 2511.18822
Code: GitHub
Area: Image Generation / Pixel-space Diffusion
Keywords: Pixel-space Diffusion, Patch Detailer Head, Global-Local Decoupling, End-to-End Generation, Efficient Inference

TL;DR

The paper proposes DiP, an efficient pixel-space diffusion framework. By utilizing a DiT backbone to model global structures on large patches combined with a lightweight Patch Detailer Head to recover local details, it achieves computational efficiency comparable to LDMs without requiring a VAE, reaching a 1.79 FID on ImageNet 256×256.

Background & Motivation

Background: Latent Diffusion Models (LDMs) compressed into latent space via VAE have become the de facto standard, but VAEs introduce information loss and preclude end-to-end training. Pixel-space diffusion models preserve the full signal but suffer from high computational costs.

Limitations of Prior Work: (a) The VAE in LDMs acts as an information bottleneck, introducing reconstruction artifacts and limiting the upper bound of image fidelity; (b) Existing pixel-space models (e.g., PixelFlow, SiD) use small patches (2×2 or 4×4), where sequence length grows quadratically with resolution, making training and inference infeasible.

Key Challenge: Pixel-space models face a quality-efficiency dilemma: small patches retain detail but cause sequence explosion; large patches are efficient but lose high-frequency information, as the self-attention mechanism of DiT compresses rich spatial information within a patch into a single token.

Goal: To achieve efficiency comparable to LDMs in pixel space while avoiding VAE information loss and retaining the advantages of end-to-end training.

Key Insight: Decoupling global structure modeling from local detail recovery—using DiT with large patches (16×16) for efficient global modeling and a lightweight CNN head for local detail restoration.

Core Idea: A DiT backbone operates on large patches to maintain efficiency, while a co-trained convolutional U-Net Patch Detailer Head injects local inductive biases, adding only 0.3% more parameters.

Method

Overall Architecture

Given a noisy image \(x_t \in \mathbb{R}^{H \times W \times 3}\), it is divided into \(N = (H \times W)/P^2\) large patches (\(P=16\)). The DiT backbone processes the patch sequence to output global features \(S_{\text{global}} \in \mathbb{R}^{N \times D}\). The Patch Detailer Head processes each patch independently and in parallel: it receives the corresponding global feature \(s_i\) and the original noisy pixel patch \(p_i\) to predict the noise component \(\epsilon_i\).

Key Designs

  1. Global Structure Modeling (DiT Backbone):

    • Function: Uses large patches with \(P=16\) to model the global layout and semantic content of the image.
    • Mechanism: Divides a 256×256 image into 256 tokens (aligned with the sequence length of LDMs in latent space), capturing long-range dependencies through DiT block self-attention to output context-aware features.
    • Design Motivation: Large patches dramatically reduce sequence length, aligning computational complexity with LDMs. Single-image overfitting experiments (Fig.3) verify: DiT-only can successfully capture global layout and tones but fails to render fine textures and sharp edges—an inherent limitation of lacking local inductive bias.
  2. Patch Detailer Head (Lightweight U-Net):

    • Function: Recovers high-frequency details for each large patch.
    • Mechanism: A shallow convolutional U-Net (4 downsampling + 4 upsampling stages), where each block contains Conv+SiLU+Pooling. The global feature \(s_i \in \mathbb{R}^{D \times 1 \times 1}\) is concatenated with the downsampled output channels at the bottleneck layer to guide local refinement.
    • Design Motivation: The natural inductive bias of convolutions (locality, translation equivariance) is highly suitable for denoising local textures and edges. Experiments comparing four architectures—Standard MLP (no spatial bias), Coord-based MLP (NeRF-like), Intra-Patch Attention, and Convolutional U-Net—showed the U-Net is optimal with the fewest parameters (only 0.3% increase in total parameters).
  3. Post-hoc Refinement:

    • Function: The Head is placed after the final layer of the DiT.
    • Mechanism: Three placement strategies—post-hoc, intermediate injection, and hybrid—were all effective, but post-hoc performed best.
    • Design Motivation: Treating DiT as a black-box backbone avoids modifying internal structures, maximizing simplicity and allowing the use of pre-trained DiT weights.

Loss & Training

  • Supports DDPM noise prediction and Flow Matching frameworks.
  • Uses DDT (a DiT variant) as the backbone with the AdamW optimizer.
  • EMA decay of 0.9999, batch size 256.
  • Patch Detailer Head uses kernel=3, padding=1 for intermediate layers, and kernel=1 for the final layer.
  • Uses an Euler-100 sampler.

Key Experimental Results

Main Results

Method Type FID↓ sFID↓ IS↑ Prec.↑ Rec.↑ Latency Params
DiT-XL (LDM) Latent 2.27 4.60 278.2 0.83 0.57 2.09s 675M+86M
SiT-XL (LDM) Latent 2.06 4.50 270.3 0.82 0.59 2.09s 675M+86M
PixelFlow-XL/4 Pixel 1.98 5.83 282.1 0.81 0.60 7.50s 677M
VDM++ Pixel 2.12 - 278.1 - - - 2.46B
DiP-XL/16 (600ep) Pixel 1.79 4.59 281.9 0.80 0.63 0.92s 631M

Ablation Study (Patch Detailer Head Architecture)

Architecture FID↓ sFID↓ IS↑ Training Cost Latency
DiT-only (629M) 5.28 6.56 243.8 84×8 GPU h 0.88s
+ Standard MLP 6.92 7.27 210.9 93×8 GPU h 0.91s
+ Coord-based MLP 2.20 4.49 284.6 123×8 GPU h 0.95s
+ Intra-Patch Attn 2.98 5.16 275.0 96×8 GPU h 0.94s
+ Conv U-Net (Ours) 2.16 4.79 276.8 87×8 GPU h 0.92s
Expand DiT-only vs Add Head FID↓ Params Training Cost Latency
DiT-only 1536 hidden dim 2.83 1.1B 149×8 h 1.49s
DiT-XL + Conv U-Net 2.16 631M 87×8 h 0.92s

Key Findings

  • More efficient than scaling: Adding a Head with 0.3% parameters is more effective than scaling DiT to 1.1B (2.16 vs 2.83 FID) and is 38% faster.
  • Comparison with PixelFlow: Among pixel-space methods, DiP's inference latency is only 0.92s vs 7.50s (8× faster), with better FID.
  • Value of Local Inductive Bias: MLP was completely ineffective (FID actually worsened), indicating that simple intra-patch transformations are insufficient and spatial priors from convolution are necessary.
  • t-SNE Validation: After adding the Head, intra-class aggregation in the feature space is tighter, and inter-class separation is clearer.

Highlights & Insights

  • Elegant Design Philosophy: The global-local decoupling principle is simple yet effective; adding just 0.3% parameters solves the core bottleneck of pixel-space diffusion models.
  • Efficiency-Quality Pareto Optimality: Reaches a new Pareto frontier in the FID-latency space (Fig.2).
  • End-to-End Advantages: No VAE pre-training required, avoiding information bottlenecks and the defects of non-end-to-end training.

Limitations & Future Work

  • Currently only validated on ImageNet 256×256; higher resolutions (512+) and text-guided generation remain for exploration.
  • The Patch Detailer Head processes each patch independently, which might pose risks to boundary consistency across patches.
  • Comparison with the latest LDM methods (e.g., FLUX) on text-to-image tasks is not yet sufficient.
  • Difference from PixelNerd: PixelNerd is tightly coupled with NeRF rendering mechanisms, limiting architectural exploration; DiP proposes a more general design principle.
  • Difference from JiT: JiT models high-dimensional pixel data by predicting clean images; DiP maintains efficiency via global-local decoupling.
  • Insight: The logic of large patches + local refinement may be applicable to other tasks requiring efficient processing of high-resolution inputs.

Rating

  • Novelty: ⭐⭐⭐⭐ Global-local decoupling is simple and effective, though not conceptually complex.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely systematic, including architecture comparisons (4 heads), placement strategies (3 types), scale-up comparisons, and multiple training budgets.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is well-validated (single-image overfitting experiment is very persuasive), with clear charts.
  • Value: ⭐⭐⭐⭐⭐ Provides a practical and efficient solution for pixel-space diffusion, potentially driving the development of VAE-free generation.