Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning¶
Conference: NeurIPS 2025 arXiv: 2505.12387 Code: None Area: Optimization Theory / Representation Learning Keywords: Entropic forces, SGD dynamics, parameter symmetry, Platonic representation hypothesis, gradient balance
TL;DR¶
This paper establishes a "neural thermodynamics" framework, proving that emergent entropic forces arising from stochasticity and discrete-time updates in SGD training systematically break continuous parameter symmetries while preserving discrete ones. This leads to a gradient balance phenomenon analogous to thermodynamic equipartition, thereby (a) providing the first theoretical proof of the Platonic Representation Hypothesis (that different models learn similar representations), and (b) reconciling the seemingly contradictory observations of sharpness-seeking and flatness-seeking behavior in deep learning optimization.
Background & Motivation¶
Background: Deep learning and large language models continue to exhibit surprising emergent phenomena — models trained with different architectures and different data converge to approximately the same internal representations (the Platonic Representation Hypothesis, proposed by Huh et al. 2024 but supported only by empirical observation). SGD simultaneously exhibits flatness-seeking behavior (implicit regularization, loss landscape preference) and, in certain experiments, sharpness-seeking behavior. These phenomena lack a unified theoretical explanation.
Limitations of Prior Work: (1) The Platonic Representation Hypothesis, despite broad empirical attention, has received no rigorous mathematical proof or mechanistic explanation. (2) The debate over whether SGD finds flat or sharp minima has persisted for years — methods such as SAM explicitly optimize for flatness, yet SGD empirically converges to sharp regions in some settings, leaving both sides with evidence but no unifying framework. (3) Information-theoretic and statistical-mechanics analyses have largely relied on overly strong assumptions (e.g., approximating SGD as continuous-time Langevin dynamics), neglecting discrete-time effects — which this paper shows are precisely the source of the key entropic forces.
Key Challenge: Neural network parameter spaces contain a large number of symmetry-induced equivalent configurations — for example, multiplying one layer's weights by \(\alpha\) and dividing the next layer's weights by \(\alpha\) leaves the network's function unchanged (rescaling symmetry). How SGD selects a particular configuration from these equivalence classes during training, and why this selection is consistent across models, is currently not understood theoretically.
Goal: (1) Establish a rigorous entropic force theory describing SGD dynamics under discrete time steps and finite learning rates. (2) Explain through symmetry-breaking mechanisms why different models converge to similar representations. (3) Provide a unified explanation of SGD's sharp/flat behavior across different directions of parameter space.
Key Insight: The authors draw from statistical physics, treating SGD's stochastic gradient noise as thermal fluctuations and the symmetries of parameter space as degrees of freedom. The core insight is that under discrete time steps (rather than the continuous-time approximation), the "forces" generated by SGD include not only the gradient force (\(-\nabla L\)) but also an emergent entropic force arising from the geometric structure of parameter space. This force originates from differences in the "volume" of different regions of parameter space — analogous to the free energy \(F = E - TS\) in statistical mechanics.
Core Idea: The discrete stochasticity of SGD generates entropic forces that selectively break continuous symmetries while preserving discrete ones, pushing different models toward the same "typical" region of parameter space, thereby causing representation convergence.
Method¶
Overall Architecture¶
This is a theory-driven work and proposes no new algorithm. The logical chain is: (1) define an "entropic loss landscape" — augmenting the standard loss with an entropic term determined by the volume factor of parameter space; (2) derive effective dynamical equations for SGD on this landscape and prove the mathematical form of the entropic force; (3) analyze the differential effects of the entropic force on two classes of parameter symmetries (continuous vs. discrete); (4) derive the gradient balance theorem from symmetry breaking; (5) apply gradient balance to two core problems — representation alignment and sharp/flat selection.
Key Designs¶
-
Entropic Loss Landscape:
- Function: Explicitly encodes the implicit regularization effect of SGD as a correction to the loss landscape.
- Mechanism: For networks with parameter symmetries, multiple distinct parameter configurations \(\theta\) can realize the same input-output mapping \(f_\theta\). The standard loss \(\mathcal{L}(\theta)\) only measures mapping quality, whereas the entropic loss additionally accounts for the "volume" of the parameter region corresponding to each mapping: \(\mathcal{L}_{\text{entropic}}(\theta) = \mathcal{L}(\theta) - T \cdot S(\theta)\), where \(T\) is related to the learning rate and noise strength, and \(S(\theta)\) is the number of microstates (the entropy term) in parameter space. Regions with larger volume exert stronger attraction on SGD — because random walks more readily "find" larger-volume regions.
- Design Motivation: Under the continuous-time approximation (Langevin dynamics), the noise effects of SGD can be exactly canceled (detailed balance), but this cancellation fails under discrete time steps — discreteness renders the entropy term non-negligible, making it the key driver of emergent phenomena.
-
Symmetry-Breaking Mechanism — Continuous vs. Discrete:
- Function: Explains why different models learn similar representations and which parameter degrees of freedom are constrained.
- Mechanism: Neural network parameter spaces contain two classes of symmetries. Continuous symmetries, such as inter-layer weight rescaling (\(W_1 \to \alpha W_1,\ W_2 \to W_2/\alpha\) for any \(\alpha > 0\)), form a continuous equivalence class. Discrete symmetries, such as neuron permutation (swapping all connection weights of two hidden neurons leaves the network unchanged), have only finitely many equivalent configurations. The authors prove that the entropic force breaks continuous symmetries — selecting a specific \(\alpha\) value within the rescaling equivalence class (the one with maximum volume/entropy) — while preserving discrete symmetries — without spontaneously selecting a particular neuron permutation. Breaking continuous symmetry means selecting a "typical" parameter configuration within a functional equivalence class; if different models are pushed by the same entropic force toward the same typical configuration, representation convergence follows naturally.
- Design Motivation: This "selective breaking" mechanism is central to the theory — it explains why representation alignment occurs along specific normalization dimensions rather than all dimensions.
-
Gradient Balance Theorem (Gradient Equipartition):
- Function: Provides observable, verifiable predictions that ground the abstract theory in measurable quantities.
- Mechanism: By analogy with the thermodynamic equipartition theorem (each quadratic degree of freedom receives on average \(kT/2\) energy), the authors derive that at the stationary state of SGD training, if two layers \(W_1, W_2\) share a rescaling symmetry \(W_1 \to \alpha W_1,\ W_2 \to W_2/\alpha\), then at the end of training the gradient norms of these two layers satisfy a specific balance relation: \(\|\nabla_{W_1}\mathcal{L}\|^2 \propto \|\nabla_{W_2}\mathcal{L}\|^2\). More generally, for any continuous symmetry, the gradient along the symmetry direction is zero at equilibrium — this is "gradient equipartition."
- Design Motivation: Falsifiability of the theory — if gradient balance fails empirically, the entire entropic force framework requires revision. The authors validate this prediction across multiple architectures.
Loss & Training¶
This paper proposes no new training method. The core contribution is the proof that standard SGD training implicitly optimizes \(\mathcal{L}_{\text{entropic}} = \mathcal{L}_{\text{standard}} - T \cdot S(\theta)\), where \(T\) is jointly determined by the learning rate \(\eta\) and the gradient noise covariance \(\Sigma\). As \(\eta \to 0\), \(T \to 0\), the entropic force vanishes, and the theory reduces to standard gradient descent.
Key Experimental Results¶
Main Results¶
| Verification Target | Model/Setting | Theoretical Prediction | Experimental Observation | Consistency |
|---|---|---|---|---|
| Gradient balance | MLP, CNN, ResNet, Transformer | Adjacent-layer gradient norm ratio → 1.0 | ratio ≈ 1.0 | ✓ |
| Continuous symmetry breaking | 2-layer linear network rescaling | Weight norm converges to theoretical value | Matches theory | ✓ |
| Discrete symmetry preservation | Neuron permutation symmetry | Permutation not spontaneously broken | No spontaneous permutation selection | ✓ |
| Representation alignment | Different initializations, same architecture | CKA similarity approaches 1 | CKA > 0.9 | ✓ |
Ablation Study¶
| Learning Rate \(\eta\) | Gradient Balance Strength | Representation Alignment (CKA) | Notes |
|---|---|---|---|
| Large (\(\eta = 0.1\)) | Strong | High | Strong entropic force, complete symmetry breaking |
| Medium (\(\eta = 0.01\)) | Moderate | Moderate | Intermediate entropic force effect |
| Small (\(\eta = 0.001\)) | Weak | Low | Weak entropic force, approaches continuous-time limit |
| Continuous-time limit (\(\eta \to 0\)) | Absent | Random | Entropic force vanishes, theory degenerates |
Key Findings¶
- Gradient balance is a universal phenomenon: Gradient norm balance between adjacent layers is observed across MLP, CNN, ResNet, and Transformer architectures, supporting the "neural thermodynamics" analogy. This is not an architectural coincidence, but a universal consequence of SGD combined with symmetry.
- Learning rate is the control knob: A larger learning rate yields a higher effective temperature \(T\), stronger entropic forces, and more pronounced representation alignment. This explains the empirical observation that models trained with larger learning rates tend to learn more "general" representations, and further implies that learning rate scheduling is essentially controlling the "exploration temperature" of parameter space.
- Unified explanation for sharp vs. flat: Along continuous symmetry directions (e.g., rescaling directions), SGD is pushed by entropic forces toward high-volume regions (i.e., "flat" directions); along functionally non-equivalent directions (i.e., directions that change network outputs), SGD is pushed by gradient forces toward low-loss regions (which may be "sharp"). The two are not contradictory — they operate in different subspaces of parameter space.
- Theoretical support for the Platonic Representation Hypothesis: If all SGD-trained models are pushed by the same entropic forces toward the representative of the "maximum-volume" equivalence class in parameter space, then different models learning similar representations is an inevitable consequence of symmetry breaking.
Highlights & Insights¶
- Exceptional theoretical elegance: The framework of statistical mechanics unifies multiple seemingly unrelated emergent phenomena in deep learning (representation alignment, sharp/flat behavior, effects of batch normalization, etc.). A theory that explains multiple phenomena within a single framework has high academic value and extensibility.
- First theoretical proof of the Platonic Representation Hypothesis: The empirical observation that "different models learn similar representations" is elevated from conjecture to mathematical theorem — the logical core being: SGD + parameter symmetry → entropic force → selective symmetry breaking → all models converge to the same typical configuration.
- Discreteness as a key theoretical insight: Much prior work analyzes SGD via continuous-time approximation; this paper shows that doing so discards the most important term — the entropic force. This serves as a reminder to theorists that discrete-time SGD is not equivalent to Langevin dynamics, with the difference manifesting at the level of emergent phenomena.
- Transferable symmetry analysis framework: Identify parameter space symmetries → analyze the effect of entropic forces on different symmetries → predict training behavior. This analytical paradigm can be applied to novel architecture design (e.g., deliberately introducing or breaking symmetries to control the direction of representation learning).
Limitations & Future Work¶
- Theoretical derivation relies on second-order Taylor expansion: Expanding parameter updates to second order may be insufficiently precise for very deep networks or very large learning rates.
- Experimental validation at limited scale: Predictions such as gradient balance are validated primarily on relatively small models (MLPs, small ResNets); large-scale validation on LLM-scale models is lacking — yet the Platonic Representation Hypothesis was originally observed precisely in large models.
- Only SGD and simple variants are analyzed: Adaptive optimizers such as Adam and AdaGrad alter the geometry of parameter space through preconditioning, potentially changing the form and effects of entropic forces entirely; this paper does not address them.
- Practical implications of permutation symmetry are unclear: The theory predicts that discrete symmetries are preserved (different models have different neuron permutations), but in practice, aligning representations across different models requires resolving permutation correspondences — how to translate this theoretical insight into better model stitching methods remains unclear.
Related Work & Insights¶
- vs. Platonic Representation Hypothesis (Huh et al., 2024): That work originally proposed the hypothesis and provided empirical evidence; the present paper provides a mathematical proof based on entropic forces and symmetry breaking, representing a significant theoretical advance in this direction.
- vs. SAM (Sharpness-Aware Minimization): SAM explicitly seeks flat minima; this paper proves that SGD itself exhibits this tendency (via entropic forces), but only along continuous symmetry directions. This suggests that SAM and entropic forces may be doing something similar — pushing toward flatness along symmetry directions.
- vs. Gromov (2024) and related symmetry analysis work: Prior symmetry analyses largely focus on specific symmetries (e.g., scale invariance introduced by BatchNorm); the present paper provides a unified framework for analyzing the fate of arbitrary parameter symmetries.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Unifying multiple emergent deep learning phenomena through a statistical mechanics perspective; theoretical originality is exceptional.
- Experimental Thoroughness: ⭐⭐⭐ Experiments serve primarily as theoretical validation; large-scale model verification is lacking.
- Writing Quality: ⭐⭐⭐⭐ Theoretical exposition is rigorous, but demands substantial background in statistical mechanics; readers without this background will require additional effort.
- Value: ⭐⭐⭐⭐⭐ Provides a fundamentally new theoretical framework for understanding deep learning; offers the first proof of the Platonic Representation Hypothesis; likely to have lasting impact.