Skip to content

Beyond Text Compression: Evaluating Tokenizers Across Scales

Conference: ACL 2025
arXiv: 2506.03101
Code: None
Area: LLM Efficiency / Model Compression / Tokenizer Evaluation
Keywords: tokenizer evaluation, scaling consistency, Zipf's law, multilingual, text compression

TL;DR

This paper systematically evaluates the impact of 6 tokenizers on 350M and 2.7B parameter models. It finds that tokenizer selection has an extremely minor impact on English tasks but has a significant and scale-consistent impact on multilingual tasks (such as machine translation). The paper also proposes a novel family of intrinsic evaluation metrics based on Zipf's law, which predict downstream performance in multilingual scenarios significantly better than text compression rates.

Background & Motivation

Background: Tokenizer selection is a foundational decision in LLM training, yet in practice, many models directly reuse existing tokenizers (e.g., Llama using the one from Phi-3-mini) with little systematic justification.

Limitations of Prior Work: - Extrinsic evaluation is too expensive: Evaluating tokenizer quality requires training a full model, which prevents rapid iteration. - Compression rate is unreliable: Text compression rate is frequently used as a proxy for tokenizer quality, but recent studies have questioned its robustness. - Insufficient multilingual evaluation: Existing evaluations are mostly limited to monolingual or classification tasks.

Key Challenge: Low-cost intrinsic metrics are needed to predict tokenizer impacts on downstream tasks, but existing metrics (primarily compression rate) exhibit poor predictive capability in multilingual and generative tasks.

Goal: (1) Does tokenizer-induced variance scale consistently across model sizes? (2) Which intrinsic metrics can reliably predict downstream performance?

Key Insight: Leveraging scaling consistency—if tokenizer variance truly matters, it should be observable in small models and remain consistent in larger models.

Core Idea: Utilizing performance rankings from a 350M model to predict rankings for a 2.7B model (saving 85% of compute costs), and introducing a family of intrinsic metrics based on Zipf's law to replace the single compression rate metric.

Method

Overall Architecture

6 tokenizers (4 English-centric + 2 multilingual) \(\times\) 2 model scales (350M/2.7B) \(\rightarrow\) Evaluation on 3 categories of downstream tasks (multiple-choice questions/summarization/translation) \(\rightarrow\) Correlation analysis with 5 intrinsic metrics.

Key Designs

  1. Scaling Consistency Experimental Design:

    • Function: Isolates the impact of tokenizers—all models share the identical architecture, data, and training configurations, with the tokenizer being the sole variable.
    • Mechanism: Train 12 models (6 tokenizers \(\times\) 2 scales) on the same dataset (FineWeb 100B tokens) and compare the consistency of performance rankings across scales using Kendall's \(\tau\).
    • Design Motivation: Since the training cost of a small model (350M) is only ~15% of a large model (2.7B), consistent rankings allow low-cost tokenizer evaluation and selection.
  2. Intrinsic Metrics based on Zipf's Law:

    • Function: Proposes 4 new metrics to complement the text compression rate.
    • Cardinality (Unique Token Count): The number of unique token types produced after tokenization, reflecting vocabulary coverage and dependency on byte-level fallback.
    • Rank-frequency AUC: Area under the log-log rank-frequency curve (integrated using Simpson's rule).
    • Slope: Linear fit slope of the log-log rank-frequency curve, where the ideal Zipfian distribution is -1.
    • Power Law: MAE of the linear function fit, measuring how much the token distribution deviates from Zipf's law.
    • Design Motivation: Since the word frequency distribution of natural language follows Zipf's law, a token distribution closer to Zipfian is more conducive to language model learning.
  3. Two-Stage Prediction Framework:

    • Function: Combines multiple intrinsic metrics into a reliable evaluation framework.
    • Mechanism: Independent ranking using individual metrics first, followed by an aggregated ranking via combinations (such as multi-metric voting or weighting).
    • Difference from traditional methods: Instead of relying on a single compression rate metric, it captures multiple dimensions of tokenizer behavior.

Loss & Training

  • Decoder-only Transformer based on GPT-3 configurations.
  • 350M parameters: 24 layers, 1024 dimensions, 16 heads; 2.7B parameters: 32 layers, 2560 dimensions, 32 heads.
  • Training data: A subset of FineWeb with 100B GPT-2 tokens (predominantly English).
  • Fixed batch size of 2M tokens for all models.

Key Experimental Results

Main Results (Multiple-choice benchmarks)

Tokenizer Type 350M Avg(R) 2.7B Avg(R) English Impact
Phi-3-mini English 48.0 54.7 Minimal
GPT-2 English 49.0 55.3 Minimal
GPT-NeoX English 48.8 55.6 Minimal
Falcon English 48.7 56.3 Minimal
tiktoken Multilingual 48.9 55.9 Minimal
Aya 23 Multilingual 49.2 56.0 Minimal

Machine Translation (MetricX ↓ = lower is better):

Tokenizer 350M MT Avg 2.7B MT Avg Rank Change
Aya 23 8.7 6.8 Consistently 1st
GPT-2 14.5 9.6
GPT-NeoX 11.3 8.7
Phi-3-mini 10.0 7.2

Spearman Correlation between Intrinsic Metrics and Downstream Performance

Metric Multiple-Choice Summarization Machine Translation
compression -0.59 -0.09 0.77**
cardinality 0.29 -0.09 -0.79
auc 0.19 0.14 0.77**
power law 0.0 0.14 0.78**
slope 0.0 -0.43 -0.44

Key Findings

  • Tokenizer selection has virtually no impact on English tasks: Performance differences among the six tokenizers on multiple-choice and summarization tasks are minimal, and the rank order is not consistent across scales.
  • Significant and scale-consistent impact on multilingual tasks: In machine translation, Kendall's \(\tau = 0.87\) is highly significant, showing strong consistency in ranking between 350M and 2.7B scales.
  • Aya 23 (a multilingual tokenizer) on the 350M scale achieves translation performance comparable to or even exceeding GPT-2 on the 2.7B scale—proving that a superior tokenizer can compensate for a 5x parameter gap.
  • Cardinality is the strongest multilingual predictor (\(\rho = -0.79\)), showing higher reliability than traditional compression rates.
  • Compression rate has zero predictive power for English generative tasks (\(\rho = -0.09\)), challenging the traditional belief that higher compression rates yield better performance.

Highlights & Insights

  • The finding that "tokenizer selection does not matter for English" is highly valuable: This suggests that developers can confidently utilize multilingual tokenizers in English scenarios without performance degradation, offering empirical support for building universal tokenizers.
  • The unique perspective of evaluating tokenizers through Zipf's law is highly inspiring: Integrating linguistic statistical regularities into tokenizer evaluation provides a solid theoretical foundation beyond simple compression ratios, which can be transferred to vocabulary evaluation in other sequence-modeling contexts (such as vector quantization codebooks).
  • The experimental design is exceptionally clean: With variables rigorously controlled, the 12 models differ only in their tokenizers, establishing a great paradigm for tokenizer research.

Limitations & Future Work

  • Training data is predominantly English-centric: FineWeb is primarily English data; the strengths of multilingual tokenizers might be more pronounced when trained on multilingual pre-training corpora.
  • Evaluations are limited to 350M and 2.7B scales: It remains uncertain whether the same scaling consistency persists at scales of 7B+.
  • Zipfian metrics are ineffective for English: Because all tokenizers behave similarly under Zipf distributions on English text, the metrics lose discriminative power.
  • Training efficiency is not accounted for: While massive vocabularies (e.g., Aya 23's 256k) offer excellent translation performance, they entail higher training and inference latency and overhead.
  • vs Goldman et al. (2024): Goldman et al. suggest that compression rate strongly predicts English generation performance, a conclusion refuted by the larger-scale experiments in this study.
  • vs Ali et al. (2024): While Ali et al.'s multilingual tokenizer evaluation was restricted to classification tasks, this study extends the evaluation to generative tasks, revealing a much stronger impact.
  • vs Schmidt et al. (2024): Both share concerns regarding the reliability of compression rate; this work further introduces alternative metrics to address this.

Rating

  • Novelty: ⭐⭐⭐⭐ Zipf's law metrics offer a fresh perspective, and the scaling consistency experimental design is meticulously crafted.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely systematic and comprehensive, covering 6 tokenizers \(\times\) 2 scales \(\times\) 3 task types.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, rigorous statistical analysis, and informative tables.
  • Value: ⭐⭐⭐⭐ Directly provides guiding significance for LLM developers in selecting tokenizers.