Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training¶

Conference: NeurIPS 2025 arXiv: 2508.15390 Code: github.com/Chung-Kim/vocab-imbalance Area: LLM Evaluation Keywords: tokenization, vocabulary size, frequency imbalance, Kolmogorov complexity, language model scaling

TL;DR¶

Through controlled experiments, this paper reveals the fundamental mechanism by which larger vocabularies improve language model performance: expanding the vocabulary reduces the Kolmogorov complexity of tokenized text, exploiting vocabulary frequency imbalance to substantially lower the loss on high-frequency tokens, thereby driving down global cross-entropy and improving downstream task performance.

Background & Motivation¶

Background: It has been empirically observed that enlarging the BPE vocabulary (e.g., from 24K to 196K) consistently reduces language model perplexity and improves downstream accuracy. However, the underlying mechanism behind this benefit has remained unclear.
Limitations of Prior Work: Intuitively, a larger vocabulary encodes more frequent words as single tokens, bringing the unigram model closer to optimal. Yet this explanation does not transfer directly to neural language models, because: (1) language models perform context-conditioned prediction rather than unigram modeling; (2) the conditional probability of rare tokens is far lower than their marginal probability, making mispredictions costly; and (3) rare tokens nonetheless contribute little to the total loss.
Key Challenge: When the vocabulary is enlarged, it is unclear how model capacity is redistributed between high-frequency and low-frequency tokens, and what effect this redistribution has on global loss and downstream performance.

Method¶

Overall Architecture¶

Controlled experimental design: Data volume, computation, and optimization strategy are held fixed; only vocabulary size is varied (24K → 49K → 98K → 196K). Models of 85M parameters are trained on FineWeb-Edu and OpenWebText.

Analysis pipeline: 1. Quantify the complexity of tokenized text via Kolmogorov complexity 2. Disentangle vocabulary frequency imbalance from tokenization efficiency 3. Decompose token-level loss to identify the source of gains 4. Verify consistency across datasets and model scales 5. Analyze high-frequency token overlap to explain downstream transfer

Key Designs¶

Upper bound on Kolmogorov complexity: For the bit string \(X^N\) produced by BPE tokenization:

\[K(X^N) \leq N \cdot H(p) + V\log_2 N + O(\log N)\]

where \(N\) is the total number of tokens, \(H(p)\) is the unigram Shannon entropy, and \(V\log_2 N\) encodes the token frequency table. Given the scale of modern training data (billions of tokens), the term \(N \cdot H(p)\) dominates, so \(H(p)\) is used as a proxy for complexity.

Normalized Compression Ratio (NCR): \(\text{NCR}(x;C) = |C(x)|/|x|\), where \(|C(x)| \approx N \cdot H(p)\).

Token-level loss decomposition metrics: - Total Loss: \(\text{Total Loss}(v) = \sum_{t \in N}\sum_{i=1}^{|t|}\mathbb{1}(v=t_i)[-\ln p(t_i|t_{<i})]\) - Per-token Mean Loss: \(\mu(\ell_v) = \text{Total Loss}(v) / T_v(N)\) - Global Cross-Entropy Loss: \(\text{Global CE Loss} = \sum_v \frac{T_v(N)}{T(N)} \mu(\ell_v)\)

Loss & Training¶

Model: 85M non-embedding parameters, Pre-LN Transformer
Optimizer: AdamW (\(\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-8}\))
Learning rate: \(6 \times 10^{-4}\), cosine decay, warmup over 350M tokens
Weight decay: 0.1; gradient clipping: 1.0
Training data: ~40B characters (~7.5B tokens at 49K vocabulary)
Each experiment repeated over 5 random seeds

Key Experimental Results¶

Main Results¶

Kolmogorov complexity decreases as vocabulary size increases (45.97B bytes, FineWeb-Edu):

Vocabulary Size	\(K(X^N)\) (bytes)	NCR
24K	10.74B	0.234
49K	10.43B	0.227
98K	10.23B	0.223
196K	10.16B	0.221

Token-level loss decomposition (85M model, 10B tokens, FineWeb-Edu):

Vocabulary	Top-2500 High-Freq Mean Loss	Bottom-20000 Low-Freq Mean Loss	Global CE Loss	High-Freq Loss Share
24K	~3.85	~11.18	3.179	~75%
49K	~3.80	~11.90	3.163	—
98K	~3.77	~12.60	3.148	—
196K	~3.75	~13.40	3.136	~75%

Tokenization efficiency is already saturated: At 24K vocabulary, over 95% of the top-2500 high-frequency words are already encoded as single tokens. Further vocabulary expansion primarily amplifies frequency imbalance (JSD increases monotonically).

Ablation Study¶

Cross-scale validation: - 85M + 30B tokens: high-frequency token loss 3.845→3.742 nats; global CE 2.991→2.941 (same trend) - 450M + 10B tokens: high-frequency token loss 3.770→3.675 nats; global CE 2.989→2.888 (same trend)

Parameter scaling vs. vocabulary scaling (Pythia series):

Model	Top-2500 High-Freq Mean Loss	Global CE	High-Freq Loss Share
Pythia-160M	4.48	4.32	~48%
Pythia-1B	3.76	2.51	~70%
Pythia-6.9B	3.38	2.26	~70%

Key difference: parameter scaling does not exacerbate low-frequency token loss, whereas vocabulary scaling does.

SuperBPE validation: Among cross-whitespace BPE variants, the variant with the lowest complexity (NCR=0.218) achieves the best downstream performance (Avg=43.8%), while the BPE baseline (NCR=0.222) performs worst (Avg=39.8%).

Key Findings¶

Vocabulary expansion beyond 24K primarily exploits frequency imbalance rather than improvements in tokenization efficiency.
High-frequency tokens overlap substantially (~75%) between pre-training and downstream tasks, explaining the causal chain from loss reduction to performance gains.
The top-2500 high-frequency tokens cover 76–78% of tokens in ARC, HellaSwag, and SciQ.
Parameter scaling and vocabulary scaling follow the same path: both reduce global CE by lowering uncertainty over high-frequency tokens.
It is possible to reduce complexity without exacerbating frequency imbalance (as demonstrated by SuperBPE).

Highlights & Insights¶

The paper reframes "bigger vocabularies help" as "lowering the complexity of tokenized text helps", providing a unified theoretical perspective.
Kolmogorov complexity serves as a principled tool for tokenizer design, offering stronger theoretical grounding than heuristic metrics.
The finding that tokenization efficiency saturates at 24K challenges the common intuition that larger vocabularies yield better token-level representations.
The data-centric insight that frequency imbalance is a feature rather than a bug is relevant to all models using BPE.
The SuperBPE analysis reveals that complexity can be reduced without amplifying frequency imbalance.

Limitations & Future Work¶

Model scale is modest (85M / 450M); full consistency at 7B+ scale has not been verified.
Only BPE tokenizers are evaluated; other methods such as Unigram and SentencePiece are not examined.
The Kolmogorov complexity upper bound (unigram entropy approximation) is loose and may not precisely characterize all scenarios.
In machine translation, vocabulary frequency imbalance is reportedly harmful (due to limited source/target vocabulary overlap); the proposed explanation remains incomplete for this setting.
No practical guidelines are proposed for selecting the optimal vocabulary size (i.e., when does further expansion cease to be worthwhile?).

Rajaraman et al. (2025) theoretically predict the benefits of large vocabularies from a unigram perspective; the present work extends this analysis to conditional language models.
Tao et al. (2024) identify a log-linear relationship between vocabulary size and loss; the present work explains the underlying mechanism.
Cross-whitespace merging strategies in SuperBPE and Boundless BPE are consistent with the paper's findings: reducing complexity is the key factor.
The findings directly inform tokenizer–model co-design: Kolmogorov complexity can serve as a principled selection criterion.
Deduplication can be reinterpreted through the lens of frequency imbalance—an interesting direction for future work.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First systematic investigation of the underlying mechanism behind vocabulary expansion benefits
Experimental Design: ⭐⭐⭐⭐⭐ — Rigorous controlled variables; validated across multiple datasets and scales
Theoretical Depth: ⭐⭐⭐⭐ — The Kolmogorov complexity framework is well-motivated and convincing
Practical Value: ⭐⭐⭐⭐ — Directly applicable to tokenizer design
Overall: 9.0/10