Pyramidal Patchification Flow for Visual Generation¶

Conference: ICLR 2026 arXiv: 2506.23543 Code: GitHub Area: Diffusion Model Acceleration / Image Generation Keywords: Pyramidal patchification, flow matching, DiT, inference acceleration, variable token count

TL;DR¶

This paper proposes Pyramidal Patchification Flow (PPFlow), which employs larger patches at high-noise timesteps and smaller patches at low-noise timesteps, achieving 1.6–2.0× denoising speedup while preserving generation quality, without requiring any re-noising tricks.

Background & Motivation¶

DiT applies a fixed patch size (typically \(2\times2\)) across all timesteps, resulting in wasteful computation at high-noise timesteps.
Limitations of existing acceleration methods:
- Step reduction (DDIM, distillation): sacrifices generation quality.
- Per-step cost reduction (quantization, pruning): limited potential gains.
- Pyramidal/cascaded generation (Pyramidal Flow): introduces "jump points" requiring complex re-noising tricks.
Core observation: Spatial details are unimportant at high noise levels and can be represented with fewer tokens.

Method¶

Pyramidal Patchification Scheme¶

The timestep range is divided into multiple stages, each using a different patch size.

Three-stage example: - \([0, t_{s_1})\): patch size \(4\times4\) (high noise, \(L = (I/4)^2\) tokens) - \([t_{s_1}, t_{s_2})\): patch size \(4\times2\) - \([t_{s_2}, 1]\): patch size \(2\times2\) (low noise, standard DiT)

Patchify/Unpatchify: each stage has an independent linear projection matrix \(\mathbf{W}_{s_i} \in \mathbb{R}^{d \times d_{s_i}}\).
DiT blocks: parameters are shared across all stages.
Key point: The cost of Patchify is independent of patch size (\(L_s \times d_s \times d = I^2 C d\)), while the cost of DiT blocks scales with token count.

Computational Complexity¶

Per DiT block complexity: \(\mathcal{O}(L_s^2 d + L_s d)\)

In DiT-XL/2, approximately 99.8% of FLOPs reside in DiT blocks — reducing token count directly reduces computation. - Two-stage PPFlow: 37.8% FLOPs reduction - Three-stage PPFlow: 50.6% FLOPs reduction

Initialization from Pretrained DiT¶

Patchify initialization (averaging): extends the \(2\times2\) patch projection to \(4\times4\):

\[\mathbf{W}_2 = \frac{1}{4}[\mathbf{W}, \mathbf{W}, \mathbf{W}, \mathbf{W}]\]

Unpatchify initialization (replication):

\[\mathbf{W}_2^u = [(\mathbf{W}^u)^\top, (\mathbf{W}^u)^\top, (\mathbf{W}^u)^\top, (\mathbf{W}^u)^\top]^\top\]

Key Differences from Pyramidal Flow¶

Property	PPFlow	Pyramidal Flow
Operating resolution	Full-resolution latent space	Pyramidal (multi-resolution)
Continuity equation	Satisfied	Not satisfied
Jump points	None	Present (requires re-noising tricks)
Inference pipeline	Identical to standard DiT	Requires special handling

Key Experimental Results¶

Training from Scratch (ImageNet 256×256, SiT-B)¶

Method	Training Steps	FID-50K(↓)	IS(↑)	Test FLOPs(%)	Speedup
SiT-B/2	7M	4.46	-	100%	1.00×
PPF-B-2	7M	3.83	-	~62%	1.61×
PPF-B-3	7M	4.43	-	~49%	2.04×

Fine-tuning from Pretrained SiT-XL/2¶

Method	Additional Training FLOPs	FID-50K(↓)	Test Speedup
SiT-XL/2	Baseline	~2.06	1.00×
PPF-XL-2	+8.9%	Comparable	1.60×
PPF-XL-3	+7.1%	Comparable	2.02×

Text-to-Image (Based on FLUX.1-dev)¶

Resolution	Speedup	GenEval	DPG-bench
512×512	1.61×	Comparable	Comparable
1024×1024	1.76×	Comparable	Comparable
2048×2048	1.86×	Comparable	Comparable

Key Findings¶

Two-stage and three-stage PPFlow achieve 1.6× and 2.0× inference speedup, respectively, with quality preserved.
Fine-tuning from a pretrained model requires only ~8% additional training cost.
PPF-B-2 trained from scratch even surpasses the SiT-B/2 baseline (FID: 3.83 vs. 4.46).
Speedup is more pronounced at higher resolutions (1.86× at 2048), as the large-patch stage reduces tokens more aggressively.
Stage-aware CFG scheduling (e.g., [1.5, 3.5, 4.0]) further improves generation quality.

Highlights & Insights¶

Minimalist design: only the linear projections in Patchify/Unpatchify are modified; DiT blocks are fully shared across stages.
No re-noising tricks: the method always operates in the full-resolution latent space, eliminating the complexity inherent in Pyramidal Flow.
Training–inference consistency: each patch size is trained only at its corresponding timesteps (unlike FlexiDiT/Lumina-Video, which train all patch sizes across all timesteps).
Patch n' Pack: variable-length token packing is exploited to reduce training FLOPs.

Limitations & Future Work¶

More aggressive pyramidal schemes with more than three stages remain underexplored.
The selection of patch sizes and timestep split points is largely heuristic.
At small resolutions (e.g., 256×256), large patches may discard excessive spatial information.
Stage-aware CFG scheduling expands the hyperparameter search space.

Inference acceleration: DDIM, Progressive Distillation, Consistency Models
Pyramidal/cascaded generation: Pyramidal Flow, PixelFlow, Cascaded Diffusion
Variable patch size: FlexiViT, FlexiDiT, Lumina-Video
DiT architectures: DiT, SiT, FLUX

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea is straightforward, yet the full-resolution operation that distinguishes PPFlow from Pyramidal Flow constitutes a key innovation.
Technical Depth: ⭐⭐⭐ — The method is intuitive and the theoretical analysis is relatively simple.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Validated via both from-scratch training and pretrained fine-tuning, covering class-conditional and text-to-image settings.
Practical Value: ⭐⭐⭐⭐⭐ — Plug-and-play; significant speedup achievable at low fine-tuning cost.