Skip to content

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Conference: ICLR 2026 arXiv: 2509.22445 Code: None Area: Model Compression / Learning Theory Keywords: Minimum Description Length, Kolmogorov Complexity, Transformer, Computational Universality, Variational Objective

TL;DR

Starting from Kolmogorov complexity theory, this paper proposes a theoretical framework of "asymptotically optimal description length objectives," proves the existence of such objectives for Transformers via a novel proof of their computational universality, and empirically validates the framework through a differentiable variational objective based on an adaptive Gaussian mixture prior, revealing significant optimization challenges.

Background & Motivation

The Minimum Description Length (MDL) principle is the formal instantiation of Occam's razor in machine learning: the best model is the one that provides the shortest description of the data. MDL has deep theoretical foundations in statistical model selection, regularization, and generalization theory. However, applying MDL to deep neural networks such as Transformers faces a fundamental difficulty — the lack of a principled, universal measure of model complexity.

Classical MDL methods (e.g., BIC, two-part coding) rely on simple complexity proxies such as parameter count, which are clearly inadequate for overparameterized deep networks — a Transformer with billions of parameters may require very few "effective parameters" to solve a specific task. Existing deep learning compression methods (e.g., pruning, quantization) are practically effective but lack theoretical connections to information-theoretic optimality.

The core idea of this paper is to use Kolmogorov complexity — the length of the shortest program that describes the data, an uncomputable yet theoretically optimal complexity measure — to define a class of "asymptotically optimal" description length objectives. Specifically, an objective is asymptotically optimal if its minimizer achieves optimal compression of any dataset up to an additive constant. The key questions are: does such an objective exist for Transformers, and if so, is it computable?

Method

Overall Architecture

The paper's technical approach proceeds in three steps: (1) formalize the mathematical notion of asymptotically optimal description length objectives; (2) prove that Transformers are computationally universal, thereby establishing the existence of asymptotically optimal objectives; (3) construct a concrete, differentiable variational objective and conduct empirical analysis.

Key Designs

  1. Definition of Asymptotically Optimal Description Length Objectives: Given a parameterized model family \(\{f_\theta : \theta \in \Theta\}\) and a description length function \(L(\theta, x)\) (representing the number of bits needed to describe data \(x\) using parameters \(\theta\)), the objective \(L\) is called asymptotically optimal if its minimizer \(\theta^* = \arg\min_\theta L(\theta, x)\) satisfies: for all data \(x\), as the model resource bound grows, \(L(\theta^*, x)\) converges to the Kolmogorov complexity \(K(x)\) up to an additive constant independent of \(x\). In other words, minimizing such an objective is equivalent to finding the (approximately) shortest description of the data.

  2. Computational Universality of Transformers (Novel Proof): To prove the existence of an asymptotically optimal objective for Transformers, it is first necessary to establish that Transformers are computationally universal — i.e., that a Transformer can simulate any Turing machine within a finite number of steps. The paper provides a new constructive proof showing how a Transformer of finite depth and width can simulate the computation of a universal Turing machine. This result is itself a significant theoretical contribution, as it provides tighter resource bounds than prior work. Building on this, the paper proves that since Transformers can simulate any compression algorithm, there exists a description length objective whose minimizer achieves asymptotically optimal compression.

  3. Adaptive Gaussian Mixture Prior Variational Objective: To translate the theoretical results into practice, the paper constructs a computable, differentiable variational objective. Specifically, an adaptive Gaussian mixture distribution is defined as a prior over the Transformer's weight space:

  4. The prior \(p(\theta)\) is a mixture of multiple Gaussian components, with mixture weights and variance parameters jointly optimized

  5. The description length comprises two terms: the number of bits to encode the model parameters (determined by the prior) plus the number of bits to describe the data given the model (determined by the likelihood)
  6. The objective is made differentiable via variational inference and can be optimized using standard gradient descent

The paper proves that under appropriate conditions, this variational objective is asymptotically optimal, provided the prior is sufficiently flexible and the optimizer reaches the global optimum.

Loss & Training

The variational description length objective takes the form of a two-part code:

\[L(\theta, x) = \underbrace{-\log p(\theta)}_{\text{model complexity}} + \underbrace{-\log p(x | \theta)}_{\text{data fit}}\]

where the prior \(p(\theta)\) is based on an adaptive Gaussian mixture and \(p(x|\theta)\) is the Transformer's predictive distribution under parameters \(\theta\). The training objective is to minimize this total description length, which essentially seeks the optimal trade-off between data fitting capacity and model parsimony.

Key Experimental Results

Main Results

Empirical analysis is conducted on algorithmic tasks to verify whether the description length objective selects simple, well-generalizing solutions:

Metric Variational MDL Objective Standard Cross-Entropy Notes
Solution complexity Low High MDL objective favors simpler solutions
Generalization Strong Weak Low-complexity solutions generalize better
Optimization difficulty High (fails from random init.) Low Key finding

Ablation Study

Configuration Key Metric Notes
Gaussian mixture vs. single Gaussian prior Mixture is more flexible Adaptive mixture better captures weight distributions
Different prior weights Affects compression–fit trade-off Requires careful tuning
Starting from good initialization MDL objective finds good solutions Initialization quality is critical
Starting from random initialization Standard optimizers fail Exposes the core optimization challenge

Key Findings

  • The existence proof is affirmative: Asymptotically optimal description length objectives do exist for Transformers, providing a theoretical foundation for applying the MDL principle in deep learning.
  • The variational objective does select low-complexity solutions: On algorithmic tasks, the MDL objective favors simpler, better-generalizing solutions compared to standard training.
  • A critical negative result: Standard optimizers (e.g., Adam) starting from random initialization cannot reliably find these low-complexity solutions. This implies that even when the optimal objective theoretically exists, the optimization barrier remains a central bottleneck in practice.
  • This finding shifts the research focus from "designing better objective functions" to "designing better optimization algorithms."

Highlights & Insights

  • Building a theoretical bridge between Kolmogorov complexity and deep learning: This is a highly ambitious theoretical work that seeks to provide a deep information-theoretic explanation for the generalization capacity of deep learning models.
  • Novel proof of Transformer computational universality: The tighter resource bounds constitute an independently valuable theoretical contribution.
  • Honest reporting of negative results: The paper does not shy away from the fact that standard optimizers fail to find theoretically optimal solutions; instead, it foregrounds this as a central finding and points toward future research directions. This academic honesty is commendable.
  • Modernization of the MDL principle: Revisiting MDL in the deep learning era and furnishing new theoretical understanding via modern tools (variational inference and Transformer universality).
  • Greater theoretical than practical significance: The current value lies primarily in providing a theoretical framework and guiding research directions, rather than in directly deployable algorithms.

Limitations & Future Work

  • The optimization gap is the central limitation: The theory establishes the existence of optimal solutions, but the computational problem of finding them remains unsolved — an open problem central to the intersection of MDL and deep learning.
  • Limited experimental scale: Validation is restricted to simple algorithmic tasks, far from real-world NLP/CV settings. Whether the asymptotic guarantees of the theory manifest as empirical advantages at finite model scales is unknown.
  • Tightness of the variational bound: The variational objective is merely an upper bound on the true MDL; the looseness of this bound directly affects practical performance.
  • Relationship to practical compression methods: The paper does not connect the theoretical framework to practical compression techniques such as quantization or knowledge distillation, leaving a gap between theory and practice.
  • Potential directions: Integrating neural network pruning/sparsification with description length optimization; developing optimization algorithms specifically tailored to MDL objectives (e.g., simulated annealing, evolutionary strategies); extending the framework to sequence modeling and autoregressive LLMs.
  • Minimum Description Length principle (Rissanen, Grünwald): This paper is among the most systematic theoretical analyses of MDL in the context of deep learning.
  • Computational expressiveness of Transformers (Pérez et al., Yun et al.): The novel universality proof advances this line of research.
  • Kolmogorov Complexity (Solomonoff, Kolmogorov, Chaitin): The uncomputable ultimate complexity measure, used here as the benchmark for asymptotic optimality.
  • Connection between compression and generalization (PAC-Bayes, Information Bottleneck): This paper corroborates, from a different angle, the deep connection that "a good model is a good compressor."
  • Insights: An information-theoretic/compression-theoretic perspective may offer deeper insights into understanding and improving the generalization of LLMs than purely statistical learning theory. The view that "LLMs are fundamentally data compressors" receives formal theoretical support here.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐