Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers¶
Conference: ICLR 2026
arXiv: 2509.22445
Code: None
Area: Model Compression / Learning Theory
Keywords: Minimum Description Length, Kolmogorov Complexity, Transformer, Computational Universality, Variational Objectives
TL;DR¶
Starting from Kolmogorov complexity theory, this paper proposes a theoretical framework for "asymptotically optimal description length objectives." It proves the existence of such objectives for Transformers through a new proof of their computational universality and conducts empirical validation via a differentiable variational objective based on adaptive Gaussian mixture priors, revealing significant optimization challenges.
Background & Motivation¶
The Minimum Description Length (MDL) principle is a formalized framework of Occam's Razor in machine learning: the best model is the one that provides the shortest description of the data. MDL has deep theoretical foundations in statistical model selection, regularization, and generalization theory. However, applying the MDL principle to deep neural networks like Transformers faces fundamental difficulties—a lack of a principled, universal measure of model complexity.
Traditional MDL methods (e.g., BIC, two-part codes) rely on simple complexity measures like parameter counts, which are clearly inapplicable to over-parameterized deep networks—a Transformer with billions of parameters might only require very few "effective parameters" to solve a specific task. Existing deep learning compression methods (e.g., pruning, quantization), while effective in practice, lack theoretical connections to information-theoretic optimality.
The Core Idea of this paper is to utilize Kolmogorov complexity (the shortest program length that describes a string—a measure that is uncomputable but theoretically optimal) to define a class of "asymptotically optimal" description length objectives. Specifically, if the minimizer of an objective function achieves the optimal compression of any dataset (up to an additive constant), it is considered asymptotically optimal. The Core Problem is: Do such objectives exist for Transformers, and if so, are they computable?
Method¶
Overall Architecture¶
The paper follows a theoretical path of "define first, prove existence, then construct": it rigorously formulates the mathematical definition of "asymptotically optimal description length objectives," proves their existence for Transformers leveraging computational universality, and finally implements a differentiable variational objective for empirical testing. The first two steps comprise a pure theoretical existence proof, while the third translates abstract guarantees into a specific loss function optimizable via gradient descent.
Key Designs¶
1. Definition of Asymptotically Optimal Description Length Objectives: Formulating "Optimal Compression" as an Optimizable Goal
The difficulty of MDL lies in the lack of a clean definition for "complexity." The paper uses Kolmogorov complexity as an anchor. It defines the optimization target as a class of universal two-part codes: given a family of parameterized models \(\{f_\theta : \theta \in \Theta\}\) and a description length function \(L(\theta, x)\) (the bits required to describe data \(x\) using parameters \(\theta\)), \(L\) is asymptotically optimal if its minimizer \(\theta^* = \arg\min_\theta L(\theta, x)\) satisfies—for all data \(x\)—that \(L(\theta^*, x)\) approaches the Kolmogorov complexity \(K(x)\) as the model's resource upper bound grows, within an additive constant independent of \(x\). The paper proves that such "universal" codes form an equivalence class, meaning optimality is independent of the specific prior choice. This transforms the uncomputable task of finding the shortest description into a theoretically operable task of minimizing an objective function.
2. New Proof of Transformer Computational Universality: Bridging Existence with Turing Machines
To answer the existence question, it is necessary to prove that Transformers possess computational universality (i.e., they can simulate any Turing machine in finite steps). The paper provides a new constructive proof showing how a Transformer encoder simulates the computation of a universal prefix Turing machine, providing more precise resource bounds than prior work. With universality established, the existence proof follows: since Transformers can simulate any computable compression algorithm, they can simulate the algorithm achieving optimal compression. Thus, there must exist a family of description length objectives whose minimizers reach asymptotically optimal compression in the limit.
3. Variational Objective with Adaptive Gaussian Mixture Priors: Translating Uncomputable Guarantees into Differentiable Loss
While existence is proven, the proof does not provide a computational method. The paper defines an adaptive Gaussian mixture distribution as a prior \(p(\theta)\) over the Transformer weight space to materialize the abstract goal. This prior consists of multiple Gaussian components where mixture weights and variance parameters are optimized simultaneously, making it more flexible than a single Gaussian. Under this prior, the total description length decomposes into two parts—bits to encode model parameters (determined by the prior) and bits to describe data (determined by the likelihood). The objective becomes differentiable via variational inference. The paper proves that if the prior is sufficiently flexible and the optimizer reaches a global optimum, this variational objective is asymptotically optimal; however, this "if" prefigures the optimization hurdles exposed in experiments.
Loss & Training¶
The variational description length objective adopts a two-part code form:
Where the prior \(p(\theta)\) is the adaptive Gaussian mixture, and the likelihood \(p(x|\theta)\) is the predictive distribution of the Transformer. Minimizing this total length essentially balances "data fitting" and "model simplicity": the first term penalizes parameter encoding costs, forcing model simplicity, while the second ensures the model explains the data.
Key Experimental Results¶
Main Results¶
Empirical analysis was conducted on algorithmic tasks to verify if the description length objective selects simple, well-generalizing solutions:
| Metric | Variational MDL Objective | Standard Cross-Entropy | Note |
|---|---|---|---|
| Solution Complexity | Low | High | MDL objective prefers simple solutions |
| Generalization | Strong | Weak | Low-complexity solutions generalize better |
| Optimization Difficulty | High (Random init fails) | Low | Key Finding |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| Gaussian Mixture vs. Single Gaussian | Mixture is more flexible | Adaptive mixtures capture weight distributions better |
| Different Prior Weights | Affects Compression-Fit balance | Requires careful tuning |
| Starting from Good Init | MDL objective finds good solutions | Initialization quality is critical |
| Starting from Random Init | Standard optimizers fail | Highlights the core optimization challenge |
Key Findings¶
- Existence proof is constructive: The existence of asymptotically optimal description length objectives for Transformers provides a theoretical foundation for MDL in deep learning.
- Variational objectives select low-complexity solutions: On algorithmic tasks, the MDL objective prefers simpler, better-generalizing solutions compared to standard training.
- Critical negative result: Standard optimizers (e.g., Adam) cannot reliably find these low-complexity solutions from random initialization. This implies that even if optimal objectives exist theoretically, optimization barriers remain the core bottleneck in practice.
- This shifts the research focus from "designing better objectives" toward "designing better optimization algorithms."
Highlights & Insights¶
- Theoretical bridge between Kolmogorov complexity and deep learning: An ambitious work attempting to provide an information-theoretic explanation for deep learning generalization.
- New proof of Transformer universality: Provides precise resource bounds, serving as an independent theoretical contribution.
- Honest reporting of negative results: The paper does not shy away from the fact that standard optimizers fail to find theoretical optima, identifying this as a core discovery for future research.
- Modernizing the MDL principle: Revisits a classical principle in the deep learning era using modern tools (variational inference + Transformer universality).
- Theoretical value outweighs practical utility: The value lies in providing a theoretical framework rather than immediate, "out-of-the-box" algorithms.
Limitations & Future Work¶
- Optimization Gap is the core constraint: While the existence of optimal solutions is proven, the computational problem of finding them remains unsolved. This is a central open problem in the MDL-deep learning intersection.
- Limited experimental scale: Validated only on simple algorithmic tasks; a large gap remains to real-world NLP/CV tasks. Whether asymptotic guarantees manifest at finite model scales is unknown.
- Tightness of variational bounds: The variational objective is only an upper bound on true MDL; the tightness of this bound directly impacts performance.
- Connection to practical compression: The framework is not yet linked to practical methods like quantization or knowledge distillation.
- Future Directions: Combining neural pruning/sparsification with description length optimization; developing specialized optimizers for MDL objectives (e.g., simulated annealing, evolutionary strategies); extending the framework to autoregressive LLMs.
Related Work & Insights¶
- MDL Principle (Rissanen, Grünwald): This work provides one of the most systematic theoretical analyses of MDL in deep learning.
- Transformer Expressivity (Pérez et al., Yun et al.): The new universality proof advances this sub-field.
- Kolmogorov Complexity (Solomonoff, Kolmogorov, Chaitin): Used here as the benchmark for asymptotic optimality.
- Compression and Generalization (PAC-Bayes, Information Bottleneck): This work supports the deep connection that "a good model is a good compressor" from a new perspective.
- Insight: For understanding LLM generalization, the information-theoretic/compression perspective may offer deeper insights than pure statistical learning theory. The view that "LLMs are essentially data compressors" receives formal theoretical support here.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐
Related Papers¶
- [ICLR 2026] Lookup multivariate Kolmogorov-Arnold Networks
- [ICLR 2026] Faster Vision Transformers with Adaptive Patches
- [ICLR 2026] Textual Equilibrium Propagation for Deep Compound AI Systems
- [ICLR 2026] Ensembling Pruned Attention Heads for Uncertainty-Aware Efficient Transformers
- [ICLR 2026] Compute-Optimal Quantization-Aware Training