Skip to content

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Conference: ICLR 2026 arXiv: 2603.03583 Code: Not released Area: NLP / tokenizer-free LM (categorized under segmentation track) Keywords: byte-level LM, tokenizer-free, coding rate, hierarchical architecture, self-tokenization

TL;DR

This paper proposes ByteFlow Net, a tokenizer-free hierarchical byte-level language model that leverages information-theoretic coding rate to adaptively compress raw byte streams into semantic units, outperforming BPE baselines and existing byte-level architectures on both pretraining loss and downstream tasks.

Background & Motivation

Starting Point

Goal: Background: 1. Modern LLMs rely on fixed BPE tokenizers that operate at a fixed granularity once trained. 2. Fixed tokenization leads to brittle behavior in counting, arithmetic, structured data, and multilingual settings. 3. Tokenization is the only non-learnable stage in the pipeline, breaking end-to-end modeling. 4. Existing tokenizer-free approaches include pure byte-level models (computationally expensive due to long sequences) and heuristic chunking methods (fixed stride/whitespace boundaries with strong inductive bias). 5. Dynamic chunking methods (e.g., BLT using entropy thresholds) require multi-stage training and are not truly end-to-end. 6. A principled approach for dynamically allocating FLOPs is lacking.

Method

Architecture: Five-stage hierarchical pipeline — Local Encoder → Downsampling → Global Transformer → Upsampling → Decoder

Local Encoder: - A shallow and narrow Transformer using sliding window attention (SWA) + Canon Layer for efficient byte-level token mixing. - Canon Layer: similar to a causal conv1d with kernel=4, promoting local information propagation.

Coding-Rate Chunking (Core Innovation): - Computes the marginal coding rate at each position: \(\Delta R_t = R_\varepsilon(h_{1:t}) - R_\varepsilon(h_{1:t-1})\) - Positions with high coding rate = large information gain = natural segmentation boundaries. - Top-K positions with highest \(\Delta R_t\) are selected as chunk boundaries, maintaining a static computation graph. - Avoids dynamic-length and OOM issues caused by global thresholds.

Global Transformer: Deep and wide, performing full attention over the compressed \(K \ll T\) sequence (where the majority of FLOPs are concentrated).

Upsampling: Multilinear reconstruction + large residual connections, mapping global representations back to the byte level.

Decoder: Symmetric to the Local Encoder, performing next-byte prediction.

Method

Overall Architecture

Five-stage hierarchical pipeline: byte embedding → Local Encoder (byte-level contextualization) → Downsampling (coding-rate chunking) → Global Transformer (high-level abstract modeling) → Upsampling + Decoder (byte-level prediction).

Key Design 1: Local Encoder (Shallow / Narrow)

  • A multi-layer compact Transformer using sliding window attention (SWA) + Canon Layer.
  • SWA reduces complexity from \(O(T^2)\) to \(O(T \cdot w_{local})\).
  • Canon Layer (\(\approx\) causal conv1d with kernel=4): \(Canon(h_t) = w_0 \odot h_t + w_1 \odot h_{t-1} + w_2 \odot h_{t-2} + w_3 \odot h_{t-3}\), promoting local token mixing.
  • Motivation for Canon Layer: SWA alone requires \(T/w_{local}\) layers to ensure global connectivity; the Canon Layer compensates for this at negligible cost.

Key Design 2: Coding-Rate Chunking (Core Innovation)

  • Objective: Select \(K \ll T\) positions from \(T\) byte positions as high-level tokens.
  • Coding rate: \(R_\varepsilon(h_{1:T}) = \frac{1}{2} \log \det(I + \frac{d_{local}}{\varepsilon^2} h_{1:T} h_{1:T}^\top)\)
    • High coding rate → representations span diverse directions in feature space → high information content → should serve as segmentation boundaries.
  • Marginal coding rate \(\Delta R_t = R_\varepsilon(h_{1:t}) - R_\varepsilon(h_{1:t-1})\): measures the information gain of the \(t\)-th byte.
  • Top-K selection (rather than a global threshold): maintains fixed length \(K\), preserves a static computation graph, and avoids OOM and ragged tensor issues from dynamic-length outputs.
  • Design Motivation: Segmentation decisions are cast as an online information-theoretic optimization problem — positions are promoted to higher computational levels based on their information encoding cost.

Key Design 3: Global Transformer (Deep / Wide)

  • Performs full causal attention over the compressed sequence of length \(K\): \(g_{1:K} = \text{Transformer}_{global}(z_{1:K})\)
  • Since \(K \ll T\), a deeper and wider architecture can be employed to concentrate compute on high-level reasoning.
  • FLOPs \(\approx O(G \cdot K^2 \cdot d_{global}^2)\)

Key Design 4: Upsampling + Decoder

  • Multilinear reconstruction: each byte position selects a corresponding projection matrix \(W_{bin(t)}\) based on its chunk and bin assignment.
  • Large residual connection \(s_t = h_t + \tilde{s}_t\): fuses local encoder features with global upsampled information.
  • Decoder is symmetric to the Local Encoder: SWA + Canon, performing next-byte prediction.

Loss & Training

  • Standard cross-entropy loss, normalized per byte as Bits-Per-Byte (BPB).
  • AdamW optimizer, lr=1e-3, gradient clipping=1.0, batch size 8, sequence length 8192→3200→8192.
  • Pretrained on FineWeb-Edu-100B (~500B bytes).

Key Experimental Results

Main Results

Model (1.3B, 500B tokens) HellaSwag WinoGrande BoolQ PIQA ARC-e ARC-c Avg
LLaMA (BPE) 54.12 53.74 73.26 70.43 72.38 36.95 60.15
MambaByte 49.21 52.97 72.48 69.67 71.53 36.42 58.71
SpaceByte 48.76 53.15 72.04 69.18 71.12 36.05 58.38
AU-Net 50.34 54.12 73.85 74.87 72.91 37.43 60.59
ByteFlow Net 55.42 56.93 76.48 74.25 75.87 40.36 63.19

Ablation Study

Chunking Strategy BPB Avg Score Notes
Fixed stride 0.75 57.2 MegaByte-style
Whitespace tokenization 0.73 58.4 SpaceByte-style
Cosine similarity 0.71 60.1 H-Net-style
Coding rate (ByteFlow) 0.68 63.2 Information-theoretically optimal

Key Findings

  • ByteFlow at the 1.3B scale surpasses not only all byte-level methods but also the BPE baseline LLaMA (+3.04 average score).
  • Superior scaling behavior: the gain from 600M→1.3B is larger than that of competing methods.
  • Coding-rate chunking substantially outperforms heuristic chunking in ablations (+3–6 points), confirming that information-theoretically driven segmentation produces better semantic units.
  • The Top-K selection strategy ensures consistent memory allocation during training, avoiding OOM issues associated with dynamic chunking methods.

Highlights & Insights

  • Principled chunking: Coding rate is a theoretically grounded information measure — segmentation boundaries are determined not by intuition but by information-theoretic necessity.
  • Fully end-to-end: No pretrained tokenizer or entropy model is required; the entire pipeline from bytes to predictions is unified.
  • Computational efficiency: The hierarchical design directs the majority of FLOPs to the Global Transformer operating on compressed representations, keeping byte-level processing lightweight.
  • Manifold preservation: The authors show that the coding-rate objective uniquely preserves the geometric structure of the latent manifold, avoiding the fragmentation observed in other chunking methods.

Limitations & Future Work

  • Validation is limited to academic scale (≤1.3B parameters, 500B tokens); whether the gap with industrial-scale BPE LLMs narrows at larger scale remains unknown.
  • Pretraining is conducted solely on FineWeb-Edu; performance on multilingual, code, and structured data settings has not been verified.
  • Coding-rate computation involves log-det operations (\(O(T^3)\) or approximations); actual training overhead and approximation error are not rigorously quantified.
  • Downstream evaluation covers only 6 zero-shot tasks; comprehensive benchmarks such as MMLU and GSM8K are absent.
  • A fixed Top-K chunking rate may not be optimal for all inputs — information-dense passages and redundant passages may benefit from different compression ratios.
  • vs. MegaByte / SpaceByte / AU-Net: These methods rely on heuristic chunking via fixed stride or whitespace boundaries; ByteFlow's coding-rate chunking is the first principled alternative.
  • vs. BLT: BLT employs a pretrained entropy model for dynamic chunking with a global threshold, constituting a two-stage, non-end-to-end approach; ByteFlow is fully end-to-end.
  • vs. H-Net (concurrent work): H-Net uses cosine similarity for adaptive chunking; ByteFlow uses information-theoretic coding rate, better preserving latent manifold geometry.
  • vs. MambaByte: A purely byte-level SSM without hierarchical structure; ByteFlow's hierarchical design achieves superior efficiency.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The coding-rate-driven self-segmentation scheme is elegant both theoretically and practically.
  • Experimental Thoroughness: ⭐⭐⭐ Scale is limited (≤1.3B); downstream benchmarks are narrow; larger-scale validation is anticipated.
  • Writing Quality: ⭐⭐⭐⭐ Clear and systematic, with well-articulated theoretical motivation.
  • Value: ⭐⭐⭐⭐ A significant advance in tokenizer-free LM research, with potential to reshape foundational paradigms in language modeling.