Byte Latent Transformer: Patches Scale Better Than Tokens¶
Conference: ACL 2025
arXiv: 2412.09871
Code: https://github.com/facebookresearch/blt
Area: Others
Keywords: byte-level LLM, tokenizer-free, dynamic patching, entropy-based segmentation, scaling laws
TL;DR¶
Proposes Byte Latent Transformer (BLT), a tokenizer-free byte-level LLM architecture that aggregates bytes into variable-length patches via entropy-based dynamic grouping. It matches the performance of token-based models at the 8B scale for the first time, while unlocking a new scaling dimension of improving inference efficiency by simultaneously scaling both patch and model sizes.
Background & Motivation¶
Background: Almost all modern LLMs use tokenizers like BPE to convert byte sequences into tokens from a fixed vocabulary. Tokenization is the only non-end-to-end, heuristic preprocessing step in the entire training pipeline, having become the "default choice."
Limitations of Prior Work: Tokenization with a fixed vocabulary introduces multiple defects: (1) the same word can be split into different tokens across different contexts, causing inconsistency; (2) extreme sensitivity to input noise (typos, casing variations); (3) lack of orthographic knowledge (not knowing what characters make up a word); (4) multilingual inequity (longer tokens and higher computational costs for low-resource languages); and (5) domain/modality sensitivity. Prior byte-level models (such as MegaByte) suffered from a computational explosion due to excessively long sequences, making them uncompetitive at scale.
Key Challenge: While byte-level modeling eliminates all problems associated with tokenization, the computational cost of running a Transformer directly on bytes is dominated by large FFN layers (rather than attention), where sequence length expansion leads to a linear cost increase. A key insight is that most byte predictions are simple (e.g., bytes following or within a word) and do not require the full computational power of a large Transformer.
Goal: How to efficiently train LLMs at the byte level—retaining the advantages of byte-level modeling while matching token-based models in both efficiency and performance.
Key Insight: Dynamically allocate computation based on the entropy of the next-byte prediction—allocating more computation where information density is high and less where it is low. This is more principled than BPE's compression heuristics.
Core Idea: Use the entropy estimates of a small byte-level language model to dynamically segment bytes into variable-length patches. The large Transformer operates only at the patch level, while lightweight models handle intra-patch bytes.
Method¶
Overall Architecture¶
BLT consists of three components: (1) Local Encoder—a lightweight Transformer (layers \(l_\mathcal{E} \ll l_\mathcal{G}\)) that encodes input bytes into patch representations by pooling bytes into patches using cross-attention; (2) Latent Global Transformer—a large autoregressive Transformer that operates on patch representations and consumes the vast majority of FLOPs; (3) Local Decoder—a lightweight Transformer that decodes patch representations back into byte sequences. Pipeline: Input bytes \(\rightarrow\) dynamic grouping via entropy patching \(\rightarrow\) Local Encoder encoding to patches \(\rightarrow\) Global Transformer processing \(\rightarrow\) Local Decoder decoding to bytes.
Key Designs¶
-
Entropy Patching (Entropy-based Dynamic Grouping):
- Function: Dynamically allocate computational resources according to data complexity.
- Mechanism: Train a 100M parameter byte-level language model to compute the next-byte entropy at each byte position: \(H(x_i) = \sum_{v \in \mathcal{V}} p_e(x_i=v|\mathbf{x}_{<i}) \log p_e(x_i=v|\mathbf{x}_{<i})\). A new patch is started when the entropy exceeds a global threshold \(\theta_g\). For instance, in "George R.R. Martin", "G" has high entropy (uncertain next character) and thus becomes the start of a new patch, while "eorge" has low entropy and is grouped into the same patch. An approximate monotonicity constraint is also applied: segment when \(H(x_t) - H(x_{t-1}) > \theta_r\). Average patch size can be arbitrarily controlled by adjusting these thresholds.
- Design Motivation: BPE compresses based on frequency statistics, which is not necessarily aligned with prediction difficulty. Entropy patching allows the model to dedicate full Transformer computation to difficult-to-predict locations (e.g., the beginning of a new sentence) while bypassing easy areas (e.g., inside words) at almost zero cost.
-
Hash N-gram Embeddings:
- Function: Inject local contextual information into byte positions.
- Mechanism: For each byte position \(i\), construct 3-gram to 8-gram byte spans and map them to a 500K-sized embedding table using a polynomial rolling hash, adding them to the byte embedding: \(e_i = x_i + \sum_{n=3}^{8} E_n^{hash}(\text{Hash}(g_{i,n}))\).
- Design Motivation: A single byte (0-255) contains very little information. N-gram embeddings allow each position to "see" patterns of the preceding bytes (common prefixes, suffixes), compensating for the lack of subword information in byte-level models.
-
Encoder-Decoder Cross-Attention:
- Function: Efficiently transfer information between byte and patch representations.
- Mechanism: In the Encoder, patches serve as queries and bytes as keys/values (Perceiver-style), pooling byte information into patches. In the Decoder, this is reversed: bytes act as queries and patches as keys/values. Queries are initialized via max-pooling. Each patch query only attends to bytes within its corresponding patch. The patch dimension \(h_\mathcal{G}\) is formed by concatenating multiple heads of dimension \(h_\mathcal{E}\).
- Design Motivation: Cross-attention is more efficient than global self-attention and naturally fits the byte-to-patch scale transformation. Masking strategies ensure causal compliance.
Loss & Training¶
Standard byte-level autoregressive cross-entropy loss is employed. The Local Decoder outputs 256-dimensional logits (corresponding to the byte vocabulary size). The model is optimized using AdamW (\(\beta_1=0.9, \beta_2=0.95\)), a learning rate of 4e-4 with cosine decay to 0, a 2000-step warmup, and weight decay of 0.1. Scaling laws are studied on Llama 2 data (2T tokens), and full training is conducted on the BLT-1T high-quality dataset. The batch size is held at 16M bytes/batch, avoiding padding by packing patches.
Key Experimental Results¶
Main Results (Downstream Task Evaluation of 8B Model)¶
| Model | Arc-E | Arc-C | HellaSwag | PIQA | MMLU | MBPP | HumanEval | Average |
|---|---|---|---|---|---|---|---|---|
| Llama 3 (1T tokens) | 77.6 | 53.3 | 79.1 | 80.7 | 58.1 | 40.2 | 31.1 | 60.0 |
| BLT-Space (6T bytes) | 75.4 | 49.8 | 79.6 | 81.1 | 54.8 | 37.6 | 27.4 | 58.0 |
| BLT-Entropy (4.5T bytes) | 79.6 | 52.1 | 80.6 | 80.6 | 57.4 | 41.8 | 35.4 | 61.1 |
(FLOP-matched. BLT-Entropy is roughly equivalent to or outperforms Llama 3 on 7 out of 7 tasks.)
Robustness and Character-level Tasks¶
| Task | Llama 3 (1T) | Llama 3.1 (16T) | BLT (1T) |
|---|---|---|---|
| HellaSwag Noise Avg | 56.9 | 64.3 | 64.3 |
| CUTE Character Understanding | 27.5 | 20.0 | 54.1 |
| Spelling | 1.1 | - | 99.9 |
| Spelling Inverse | 30.1 | 3.6 | 99.9 |
| Contains Char | 0.0 | 0.0 | 55.9 |
| Substitute Char | 0.4 | 1.2 | 48.7 |
Key Findings¶
- BLT-Entropy outperforms Llama 3 in average performance given equivalent training FLOPs (61.1 vs 60.0), with significant improvements in coding tasks (HumanEval 35.4 vs 31.1).
- Character-level understanding vastly outperforms token models: Spelling accuracy achieves 99.9% vs 1.1%, and character inclusion checks reach 55.9% vs 0.0%. Token-based models fundamentally lack access to individual characters inside tokens.
- Noise robustness improves by 8 percentage points, matching Llama 3.1 trained on 16x more data—indicating that byte-level awareness is not something "more data can easily compensate for."
- Scaling under fixed inference FLOPs: BLT can simultaneously increase patch size and model parameters while maintaining a constant inference cost. The Patch size 8 model outperforms the BPE model after approximately 2.5x compute-optimal training data.
- Significant improvements in low-resource language translation: Armenian 1.7 \(\rightarrow\) 6.3, Georgian 1.7 \(\rightarrow\) 7.4, Bengali 4.7 \(\rightarrow\) 12.7 (BLEU), showing an especially pronounced advantage for non-Latin alphabets.
- BLT-Space (patch size 6) performs slightly below Llama 3 but saves approximately 30% in inference FLOPs, offering a flexible trade-off between performance and efficiency.
Highlights & Insights¶
- Entropy patching elegantly solves computation allocation: It uses the "uncertainty" of a small model to guide the computational allocation of a large model, letting simple bytes pass at near-zero cost while difficult bytes receive full processing. This is a perfect reflection of "spending computing resources where they matter most."
- A new scaling dimension: Scaling the vocabulary of token-based models is limited by embedding layer growth. BLT's patch size can be increased arbitrarily without affecting the parameter count, unlocking a unique path of "larger model + larger patch = better performance + constant inference cost." As the model size scales up, the FLOP ratio of the Local Encoder/Decoder gradually shrinks, making the advantages of larger patch sizes even more prominent.
- The "free lunch" of byte-level modeling: Capabilities like spelling, character manipulation, noise robustness, and multilingual processing—which token models require massive amounts of data to partially recover—are naturally built-in with BLT.
Limitations & Future Work¶
- The entropy model introduces preprocessing overhead (though this can be optimized using smaller models or lookup tables).
- Inference requires step-by-step determination of patch boundaries (incremental patching), presenting higher engineering complexity than BPE.
- It has only been validated up to the 8B scale; whether the trend persists at 70B+ sizes requires verification.
- BLT-Space performs worse than Llama 3 under equivalent FLOPs, suggesting that excessively large patch sizes lead to information loss—the optimal patch size needs to be adjusted in accordance with scale.
- Exploration in instruction tuning and RLHF scenarios is currently lacking.
Related Work & Insights¶
- vs MegaByte (Yu et al., 2023): MegaByte uses fixed-stride grouping without considering complexity and lacks n-gram embeddings and cross-attention. Each innovation in BLT directly addresses a specific limitation of MegaByte.
- vs SpaceByte (Slagle, 2024): Grouping by space is better than a fixed stride but cannot handle non-space-segmented languages (such as Chinese or Japanese), nor can it adjust the patch size. BLT's entropy patching serves as a universally applicable, adjustable alternative.
- vs Llama 3: BLT matches performance while providing extra leverage for inference efficiency and character-level capabilities. It exhibits a superior long-term scaling trend—the most compelling evidence of its potential.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Demonstrates for the first time at scale that byte-level models can match token-based models. Entropy patching and patch scaling are highly original contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Provides a comprehensive evaluation with full scaling laws from 400M to 8B, downstream tasks, robustness, multilingual capabilities, and detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐⭐ Logically clear, with transparent and reproducible FLOP calculations and excellent visualization design.
- Value: ⭐⭐⭐⭐⭐ Carries the potential to shift the preprocessing paradigm of LLMs, fundamentally resolving several long-standing pain points of tokenization.