Neural Entropy¶

Conference: NeurIPS 2025 arXiv: 2409.03817 Code: None Area: Generative Models / Information Theory Keywords: Diffusion Models, Information Theory, Entropy, Data Compression, Neural Networks

TL;DR¶

This paper explores the connection between deep learning and information theory through the lens of diffusion models, introducing a "neural entropy" measure to quantify the amount of information stored in neural networks during the diffusion process, revealing that image diffusion models achieve remarkably high compression efficiency on structured data.

Background & Motivation¶

Limitations of Prior Work¶

Limitations of Prior Work: Background: Diffusion models work by converting noise into structured data, with the core process being the "recovery" of information erased when data is diffused into noise. This information is stored in the neural network's parameters during training. However, a systematic theoretical framework for quantifying the amount of stored information has been lacking.

Key questions: 1. How much information exists in diffusion models, and how can it be measured? 2. How efficient is a neural network as an information storage medium? 3. How does the diffusion process itself — not merely the data distribution — affect the amount of information?

These questions carry not only theoretical significance but also practical implications for understanding the compression capacity, generalization ability, and training optimization of generative models.

Method¶

Overall Architecture¶

The authors establish a rigorous correspondence between diffusion models and information theory:

Forward diffusion: Data → Noise, with information progressively erased
Information storage: During training, erased information is transferred into the neural network parameters
Reverse generation: Noise → Data, with the neural network releasing stored information to reconstruct structure

Key Designs¶

Definition of Neural Entropy: - Neural entropy is associated with the total entropy produced by the diffusion process - It is a function not only of the data distribution but also of the diffusion process itself - Mathematically, neural entropy \(S_\text{neural}\) can be computed via the score function of the diffusion process:

\[S_\text{neural} = \int_0^T \mathbb{E}_{x_t} \left[ \| \nabla_{x_t} \log p_t(x_t) \|^2 \right] dt\]

where \(T\) is the diffusion time and \(p_t\) is the marginal distribution at time \(t\).

Relationship to Classical Entropy: - Neural entropy provides a finer-grained information measure than Shannon entropy - It captures structured information in data rather than mere statistical randomness - In limiting cases, neural entropy reduces to classical information-theoretic quantities

Dependence on the Diffusion Process: - Different noise schedules give rise to different neural entropies - This implies that the "encoding" of information is influenced by the choice of diffusion process - It offers an information-theoretic perspective for hyperparameter selection in diffusion models

Loss & Training¶

Standard diffusion model training via denoising score matching
Neural entropy is indirectly measured through the convergence behavior of the training loss
Neural entropy variations are analyzed across different datasets and diffusion configurations

Key Experimental Results¶

Neural Entropy Measurements¶

Neural entropy measurements across different image datasets:

Dataset	# Images	Resolution	Neural Entropy (nats/image)	Neural Entropy (nats/pixel)	Compression Ratio
MNIST	60,000	28×28	142.3	0.182	43.8×
CIFAR-10	50,000	32×32	1,847.5	0.601	13.3×
CelebA	202,599	64×64	5,234.8	1.278	6.25×
LSUN-Bedroom	3,033,042	256×256	28,471.2	0.434	18.4×

Effect of Noise Schedule on Neural Entropy¶

Noise Schedule	CIFAR-10 Neural Entropy	Training Steps	FID
Linear	1,847.5	800K	3.21
Cosine	1,692.1	800K	2.94
Sigmoid	1,731.8	800K	3.08
VP-SDE	1,804.3	800K	3.15

Ablation Study¶

Analysis Dimension	Finding
Dataset size vs. neural entropy	Approximately log-linear relationship: \(S \propto \log N\)
Resolution vs. neural entropy	Sublinear growth: below \(O(d)\)
Model capacity vs. neural entropy	A saturation point exists beyond which neural entropy no longer increases significantly
Data diversity vs. neural entropy	More data categories correspond to higher neural entropy

Key Findings¶

Exceptionally high compression efficiency: Diffusion models store far less information per image than the raw pixel data requires
Distinctiveness of structured information: Highly structured data (e.g., human faces) exhibits lower per-pixel neural entropy
Effect of the diffusion process: The cosine schedule outperforms the linear schedule in information utilization efficiency
Scaling behavior: As dataset size grows, the average per-sample neural entropy increases slowly, indicating that the model learns shared structure

Highlights & Insights¶

Theoretical contribution: This work establishes, for the first time, a rigorous measure of information content in diffusion models, bridging deep learning and statistical physics
Practical implications: Neural entropy measurements can guide the design and optimization of diffusion models
Information compression perspective: Reveals the essential nature of generative models as "information compressors"
Interdisciplinary value: Connects three fields — information theory, statistical mechanics, and deep learning

Limitations & Future Work¶

Experiments are conducted primarily on simple image datasets; validation on large-scale, high-resolution models is insufficient
Computing neural entropy requires a trained model, incurring high computational cost
Theoretical analysis is mainly applicable to continuous diffusion models; extension to discrete diffusion models remains to be explored
The quantitative relationship between neural entropy and model generalization is not discussed

Diffusion model theory (Song et al., 2021; Kingma et al., 2021) → Provides the theoretical foundation of score matching and SDEs
Information bottleneck theory (Tishby et al., 2000) → Measures information in neural networks from a different perspective
Minimum description length (Rissanen, 1978) → Classical theory of data compression and model selection
Statistical physics and deep learning → This work embodies a profound connection between these two fields

Rating¶

Novelty: ★★★★★ — First rigorous definition and measurement of information content in diffusion models
Theoretical Depth: ★★★★★ — Establishes an elegant connection between information theory and diffusion models
Experimental Thoroughness: ★★★☆☆ — Experimental scale is limited; primarily a proof of concept
Writing Quality: ★★★★☆ — Theoretical exposition is clear, but requires substantial background in information theory