Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems¶

Conference: CVPR 2025
arXiv: 2603.13069
Code: None
Area: Image Generation / Diffusion Model Theory
Keywords: Diffusion Models, Fractal Geometry, PIFS, DDIM, Noise Schedule, Theoretical Analysis

TL;DR¶

Proves that the DDIM deterministic reverse chain is a Partitioned Iterated Function System (PIFS), deriving three computationally accessible geometric quantities (contraction threshold \(L_t^*\), expansion function \(f_t(\lambda)\), and global expansion threshold \(\lambda^{**}\)) that require no model evaluation. Based on this, it theoretically explains four existing empirical design choices (cosine offset, resolution logSNR shift, Min-SNR weighting, and Align Your Steps).

Background & Motivation¶

Background: The theoretical foundation of diffusion models is established on SDEs/ODEs, providing global guarantees for distribution convergence. However, the continuous perspective treats the score network as a black box, failing to explain two core phenomena: (a) Why does the denoising chain assemble global spatial context at high noise levels and synthesize local details at low noise levels? (b) Why is self-attention so effective?

Limitations of Prior Work: Many design choices in diffusion models remain empirical—why is the offset=0.008 in the cosine schedule good? Why is Min-SNR weighting effective? A unified theoretical framework to understand and predict these designs is lacking.

Key Challenge: Elegant theory but lacking structural insights—SDE theory tells us "it converges" but not "how it converges".

Goal: To provide a unified design language for understanding and optimizing schedules, architectures, and training objectives of diffusion models.

Key Insight: In 1988, Barnsley proposed that natural images exhibit local self-similarity, which can be compressed using Partitioned Iterated Function Systems (PIFS). This work finds that the DDIM reverse chain is precisely a PIFS—each denoising step is a partitioned contractive mapping.

Core Idea: The denoising chain of a diffusion model is a PIFS, whose fractal geometry fully characterizes the two-stage structure of denoising dynamics.

Method¶

Overall Architecture¶

Treating the single-step DDIM operator \(\Phi_t(x) = \frac{\sqrt{\bar\alpha_{t-1}}}{\sqrt{\bar\alpha_t}} x + b_t \hat\varepsilon_\theta(x, t)\) as a single step of a PIFS. The core is to analyze the contraction/expansion properties of its Jacobian—specifically, the interactions between diagonal blocks (intra-patch dynamics) and cross-patch blocks (attention coupling).

Key Designs¶

Contraction Structure (Section 3):
- Derives two contraction conditions: (EC) Euclidean Contraction and (PC) Patchwise Maximum Norm Contraction.
- Contraction threshold \(L_t^* = (\sqrt{\bar\alpha_{t-1}/\bar\alpha_t} - 1) / |b_t|\)—solely determined by the noise schedule, independent of the data or model.
- Score-matching training is the diffusion analog of Barnsley's collage theorem error minimization.
- \(L_2\)-\(W_1\) bridge: the training loss controls the Wasserstein distance of the PIFS attractor.
Two-Stage Structure (Section 4):
- Regime I (High Noise): Diffuse attention maintains strong cross-patch coupling (large \(\delta_t^{cross}\)), where the learned "directional inhibition field" \(S_{k,t}\) holds diagonal blocks below the expansion threshold \(\rightarrow\) global context assembly.
- Regime II (Low Noise): Attention localizes, and inhibition is released patch-by-patch in order of variance \(\rightarrow\) local detail synthesis.
- Why self-attention is effective: It precisely controls \(\delta_t^{cross}\) (via upper-bounding softmax weights), making it a natural primitive for PIFS contraction.
- The two-stage transition aligns with the spontaneous symmetry breaking reported by Raya & Ambrogioni (2023).
Attractor Geometry (Section 5):
- The Kaplan-Yorke dimension of the PIFS attractor is determined by the discrete Moran S-equation: \(\prod_t f_t(\lambda^{**}) = 1\).
- A patch direction contributes to sample diversity \(\iff\) its leading variance exceeds \(\lambda^{**}\).
Three Design Guidelines (Section 6):
- Guideline 1: Maximize the minimum contraction threshold of the weakest link \(\min_t L_t^*\) (inject noise early, raising \(v_1\)).
- Guideline 2: Balance the Lyapunov contribution of each step = minimize \(\text{Var}_t(\Delta d_t)\) \(\approx\) Information Capacity Criterion.
- Guideline 3: Balance the workload of sampling steps—concentration of steps where \(L_t^*\) is minimized.

Theoretical Explanation of Four Empirical Designs¶

Empirical Design	Corresponding Guideline	PIFS Explanation
Cosine offset \(s_{off}=0.008\)	Guideline 1	Increases \(L_1^*\) from \(7.9 \times 10^{-4}\) to \(3.2 \times 10^{-3}\) (4x), enhancing the contraction margin of the weakest step
Resolution logSNR shift	Prerequisite for Guideline 1	The schedule must cover the logSNR range of detail patch transitions
Min-SNR weighting	Guideline 2	Balances the information gain of each step, equivalent to balancing KY dimension growth
Align Your Steps	Guideline 3	Concentrates sampling steps where the geometric contribution is maximized

Key Experimental Results¶

Schedule Comparison¶

Schedule	Steps	Average \(L_t^*\)	CV(\(L_t^*\))	Finest Step \(L_t^*\)
Linear (DDPM)	1000	0.805	0.341	0.00500
Cosine (\(s_{off}=0\))	1000	0.637	0.483	0.00079
Cosine (\(s_{off}=0.008\))	1000	0.641	0.474	0.00321
50-step DDIM	50	0.637	0.483	0.01571

Information Gain Balancing¶

| Schedule | CV(IG_t) | CV(|Δd_t|) | Spearman ρ(IG, Δd) | |------|---------|-----------|-------------------| | Linear | 1.107 | 0.836 | 0.9999 | | Cosine | 0.867 | 0.570 | 0.9998 |

Key Findings¶

\(L_t^*\) is smallest at \(t=1\) (finest step): \(L_t^* \approx \frac{1}{2}\sqrt{v_t}\), indicating that detail synthesis is the most constrained stage.
All \(8 \times 8\) patches under CIFAR-10 are expansion-forced throughout the entire 1000-step chain: The leading eigenvalue far exceeds \(\lambda^*(t) \approx 1.002\).
IG and KY dimension growth are near-perfectly proportional: Spearman \(\rho > 0.999\), validating the tightness of the theoretical CS inequality.
Linear schedule has good \(L_t^*\) balancing but poor IG balancing; Cosine is the opposite: No single schedule is optimal for both.

Highlights & Insights¶

Unifying fractal image compression from 1988 with diffusion models from 2020: A deep mathematical connection where Barnsley's self-similar structure drives the success of diffusion models. Score-matching is collage theorem error minimization—not an analogy, but a mathematical identity.
Three model-evaluation-free geometric quantities form the "design language": \(L_t^*\), \(f_t(\lambda)\), and \(\lambda^{**}\) are entirely determined by the schedule and data covariance spectrum. The behavior of schedules can be predicted before training any model.
Extremely clear PIFS explanation of the two stages: Regime I's "inhibition field" maintains contraction \(\rightarrow\) global structure assembly; Regime II's inhibition release \(\rightarrow\) detail emergence. This is an inevitable mathematical consequence, not a post-hoc description.
Structural proof for the necessity of self-attention: It controls the cross-patch coupling \(\delta_t^{cross}\), serving as the natural contraction primitive required by PIFS.

Limitations & Future Work¶

Analysis limited to DDIM (deterministic sampling): The PIFS structure under DDPM (stochastic sampling) remains an open problem.
Experiments primarily based on CIFAR-10 analysis: Not yet validated on high-resolution datasets (ImageNet 256/512).
Gaussian Assumption: The attractor dimension analysis relies on a block-diagonal Gaussian assumption.
Practical training effects of the PIFS regularizer are not fully validated: Theoretically, it can widen the contraction margin, but when the extra computational cost becomes worthwhile remains unclear.

vs Raya & Ambrogioni (2023): They discovered two-stage behavior and symmetry breaking, whereas this work provides a precise geometric mechanistic explanation.
vs Kingma et al. (2021) Information Capacity Criterion: This work derives equivalent conclusions from a completely different perspective (equalizing KY dimension growth), offering a deeper understanding.
Insights: The research paradigm of "reversing from existing empirical designs to theoretical optimal solutions" is highly inspiring—seeking to understand why existing practices work rather than proposing new methods.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The unification of diffusion models and fractal geometry is a brand-new perspective, with profound mathematical contributions.
Experimental Thoroughness: ⭐⭐⭐ Primarily theoretical, with empirical validation concentrated on CIFAR-10 analysis.
Writing Quality: ⭐⭐⭐⭐ Rigorous mathematical derivations, but with a high entry barrier for non-theoretical readers.
Value: ⭐⭐⭐⭐⭐ Provides a unified theoretical design language for diffusion models, explaining multiple empirical designs.