Broken Tokens: Your Language Model Can Secretly Handle Non-Canonical Tokenization¶

Conference: NeurIPS 2025 arXiv: 2506.19004 Code: Available Area: LLM Pretraining Keywords: non-canonical tokenization, character-level, robustness, embedding space, vocabulary attack

TL;DR¶

This paper reveals that LLMs can secretly handle non-canonical tokenizations (e.g., splitting "Hello" into "He"+"llo" instead of the canonical whole-word token)—even when the input token sequence differs from training, models exhibit surprising robustness. This capability stems from the property that sub-word embeddings in the embedding space can linearly combine to approximate whole-word embeddings.

Background & Motivation¶

Background: Modern LLMs employ sub-word tokenizers such as BPE/WordPiece, with identical canonical tokenization used during both training and inference. However, adversarial attacks, multilingual mixing, and OCR noise may produce non-canonical tokenizations.

Limitations of Prior Work: - It is assumed that LLMs can only process canonical token sequences seen during training. - Non-canonical tokenizations (e.g., decomposing a word into finer fragments) are believed to cause catastrophic performance degradation. - A systematic understanding of LLM robustness to tokenization variations is lacking.

Key Challenge: Are LLMs truly this fragile—requiring precisely canonical token sequences to function? If not, where does this robustness originate?

Goal: Systematically test and explain LLM behavior under non-canonical tokenizations.

Key Insight: Canonical tokens are randomly split into sub-tokens according to various strategies; model performance is evaluated across diverse tasks; and the geometric structure of the embedding space is examined to explain the source of robustness.

Core Idea: The LLM embedding space exhibits "sub-word linear additivity"—the embedding sequence of split tokens can approximately reconstruct the representation of the canonical token after a few Transformer layers.

Method¶

Overall Architecture¶

Experimental design: Canonical tokenized input → split into non-canonical token sequences under different strategies → fed into an unmodified LLM → measure the degree of output quality degradation. Analysis: Examine the representational distance between split tokens and original tokens in the embedding space.

Key Designs¶

Non-Canonical Tokenization Strategies:
- Function: Systematically generate different types of non-canonical tokenizations.
- Mechanism: (a) Random split—split each token into two sub-tokens at a random position; (b) Character-level—decompose all words to individual characters; (c) Maximal/minimal sub-word—split using different greedy strategies.
- Design Motivation: Different splitting strategies test varying degrees of deviation, ranging from mild to extreme.
Embedding Space Analysis:
- Function: Explain the source of robustness.
- Mechanism: Measure the cosine similarity between the hidden states of split sub-tokens after \(k\) Transformer layers and those of canonical tokens. High similarity is observed as early as intermediate layers.
- Design Motivation: If a split sequence can be "repaired" back to the canonical representation within a few layers, this explains why the final output is largely unaffected.
Cross-Task Evaluation:
- Function: Validate robustness across diverse NLP tasks.
- Mechanism: Evaluate on tasks including text completion, QA, classification, and translation, using models such as GPT-2, LLaMA, and Mistral.
- Design Motivation: Confirm that robustness is not an artifact of a specific task or model.

Loss & Training¶

No training is required—this is a purely inference-time analysis.

Key Experimental Results¶

Main Results¶

Performance retention rates of various models under random splitting:

Model	Canonical	Random Split (50%)	Character-Level	Retention Rate
GPT-2	100%	~85–90%	~70–80%	High
LLaMA-7B	100%	~90–95%	~75–85%	Higher
Mistral-7B	100%	~90–95%	~80–85%	Higher

Ablation Study: Embedding Space Alignment¶

Layer Depth	Cosine Similarity: Split Token vs. Canonical Token
Layer 0 (input embedding)	~0.6
Layer 4	~0.85
Layer 8	~0.92
Final layer	~0.95

Key Findings¶

LLMs exhibit surprising robustness to non-canonical tokenizations: Randomly splitting 50% of tokens leads to only ~5–15% performance degradation.
Larger models are more robust: LLaMA-7B retains more performance than GPT-2 (124M).
Embedding repair occurs in early layers: As few as 4–8 Transformer layers suffice for the representations of split tokens to align closely with canonical representations.
Extreme character-level splitting remains acceptable: Even when fully decomposed to the character level, models can still complete most tasks.
Security implication: Adversarial attacks based on tokenization manipulation may be less effective than previously assumed.

Highlights & Insights¶

"Sub-word linear additivity in the embedding space" is a theoretically valuable finding—it suggests that the early layers of a Transformer act as an implicit "re-tokenization" mechanism.
This has direct practical implications for robust LLM deployment: even when the tokenizer makes errors (e.g., due to OCR noise or malicious input), the model will not fail catastrophically.
It prompts a reassessment of adversarial robustness research: many token-manipulation-based attacks may have been overestimated.

Limitations & Future Work¶

Only token splitting (decomposing larger tokens into smaller ones) is tested; token merging (combining adjacent tokens) is not examined.
Tasks with stronger dependencies on precise tokenization, such as mathematical reasoning, may behave differently.
The embedding alignment analysis is observational and lacks a rigorous theoretical explanation.
The effect of non-canonical tokenization on syntax-sensitive tasks such as code generation is not evaluated.
Analysis is limited to English and Latin scripts; robustness for non-Latin scripts (e.g., Chinese, Arabic) may differ.

vs. adversarial NLP attack research: Such work assumes that tokenization manipulation can effectively attack LLMs; this paper challenges that assumption.
vs. character-level LLMs: Models such as ByT5 are trained from scratch at the character level; this paper demonstrates that standard BPE-trained models can "secretly" also process character-level input.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ A counter-intuitive finding that challenges the assumption that tokenization is critical to LLM functioning.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple models, tasks, splitting strategies, and embedding analyses.
Writing Quality: ⭐⭐⭐⭐ Clear and intuitive.
Value: ⭐⭐⭐⭐ Important implications for LLM robustness and security research.