SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook¶
Conference: ICLR 2026 arXiv: 2503.06764 Area: LLM Pretraining Keywords: image tokenizer, hierarchical codebook, semantic guidance, unified understanding + generation, SGHC, MLLM
TL;DR¶
This paper proposes SemHiTok — a tokenizer that unifies visual understanding and generation via a Semantic-Guided Hierarchical Codebook (SGHC): pixel sub-codebooks are constructed on top of a pretrained semantic codebook, with structure and training fully decoupled (stage-wise optimization) to avoid the semantic–pixel conflict in joint training. Under the LLaVA setting, SemHiTok achieves state-of-the-art performance in both understanding and reconstruction among discrete tokenizers.
Background & Motivation¶
Background: Unified MLLMs require tokenizers that simultaneously support understanding (high-level semantics) and generation (low-level pixels).
Limitations of Prior Work: - (1) CLIP-based methods → strong semantics but poor pixel fidelity; VQGAN-based methods → good pixel reconstruction but weak semantics. - (2) Joint training approaches (VILA-U uses mixed losses → sub-optimal; SDE encoder decouples encoders but mixes codebooks). - (3) Dual-encoder designs (Janus) → token count doubles or vocabulary explodes → inefficient. - (4) TokenFlow uses a shared mapping but joint training still degrades performance.
Key Insight: The observation that patches sharing the same semantic code exhibit similar pixel distributions motivates building sub-codebooks under each semantic code, achieving full decoupling in both structure and training.
Method¶
Overall Architecture¶
A semantic branch (VQKD aligned with SigLIP) produces fixed \(C_\text{sem}\); a pixel branch (ViT) learns \(C_\text{pix}\). Semantic and pixel tokens are concatenated along the channel dimension to form a unified discrete representation.
Key Designs¶
-
Semantic Codebook: SigLIP → EMA vector quantization → cosine + L1 distillation → frozen after training.
-
SGHC: \(C_\text{pix} = \{C_\text{pix}^1, \ldots, C_\text{pix}^K\}\) (\(K\) semantic codes × \(m\) sub-codes); for patch \(i\), the semantic branch first assigns index \(k\), then the \(k\)-th sub-codebook quantizes the pixel features.
-
Stage-wise Training: Stage 1 trains the semantic branch (VQKD) and freezes it; Stage 2 trains the pixel branch (L1 + perceptual + GAN losses) — the two stages are conflict-free.
-
Unified MLLM Integration: Tokens are flattened as \(h = i \times m + j\); a Dual-MLP adapter projects semantic and pixel tokens separately, then concatenates them before feeding into the LLM.
Loss & Training¶
- SigLIP frozen; \(K\) semantic codes, \(m=8\) sub-codes → total codebook size 196,608; base model: Qwen2.5-7B-Instruct.
Key Experimental Results¶
Reconstruction (Table 1, ImageNet-50k)¶
| Method | Type | Codebook | rFID↓ |
|---|---|---|---|
| LlamaGen | Only Recon | 16,384 | 2.19 |
| IBQ | Only Recon | 262,144 | 1.00 |
| VILA-U | Unified | 16,384 | 1.80 |
| TokenFlow | Unified | 32,768 | 1.37 |
| SemHiTok | Unified | 196,608 | 1.16 |
| SemHiTok-384 | Unified | 196,608 | 0.66 |
Understanding (Table 2, LLaVA-v1.5)¶
| Model | Resolution | POPE | MME-P | SEED | GQA |
|---|---|---|---|---|---|
| SigLIP (continuous) | 256 | 83.8 | 1481 | 65.3 | 61.9 |
| VILA-U | 256 | 81.6 | 1312 | 56.9 | 55.3 |
| SemHiTok | 256 | 82.5 | 1356 | 62.9 | 60.3 |
| SemHiTok-384 | 384 | 86.3 | 1466 | 64.1 | 62.3 |
Key Findings¶
- Achieves state-of-the-art understanding among discrete tokenizers, approaching or partially surpassing continuous SigLIP.
- rFID of 1.16 / 0.66 → state-of-the-art reconstruction among unified tokenizers.
- POPE 82.5 vs. VILA-U 81.6 (+0.9); SEED 62.9 vs. 56.9 (+6.0).
- Total codebook size \(K \times m = 196\text{K}\) is comparable to LLM text vocabulary size (Qwen2 ~150K) → no vocabulary explosion.
Highlights & Insights¶
- SGHC Design: The observation that same-semantic patches share similar pixels motivates sub-codebook refinement — an elegant and principled design.
- Stage-wise Training: Completely eliminates the semantic–pixel conflict, yielding a better trade-off — a key contribution.
- No Token Explosion: The flattened codebook size (196K) is compatible with existing LLM vocabularies, enabling seamless integration.
- Non-interfering Extension: Pixel branch training does not affect the frozen semantic codebook, preserving understanding capability.
Limitations & Future Work¶
- Validation is limited to 256/384 resolutions; scalability to higher resolutions remains untested.
- The sub-codebook size \(m=8\) is fixed; adaptive \(m\) selection is unexplored.
- Only Qwen2.5-7B and Vicuna-7B are evaluated; larger LLMs await investigation.
- Evaluation of generation quality (MJHQ/GenEval) is limited in scope.
- The effect of semantic codebook size \(K\) on performance lacks thorough ablation.
- Pixel sub-spaces under certain semantic codes may be data-insufficient, potentially causing sub-codebook underfitting.
- Sensitivity analysis of loss weights (\(\lambda_1 / \lambda_2 / \lambda_3\)) in the training strategy is limited.
Supplementary Unified MLLM Results¶
- Outperforms prior unified discrete MLLMs on both understanding and generation tasks.
- Achieves state-of-the-art on most benchmarks in the Und. & Gen. Discrete category.
- Performance is comparable to some continuous-tokenizer (Only Und.) baselines.
Related Work & Insights¶
- VILA-U joint loss → sub-optimal → SemHiTok resolves this via stage-wise training.
- TokenFlow shared mapping → joint training conflict remains → SemHiTok achieves full decoupling.
- VQKD semantic codebook → SemHiTok extends it with a pixel layer.
- Insight: A hierarchical structure (semantics → pixels) may represent the optimal paradigm for unified visual tokenizers.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First proposal of SGHC + stage-wise training
- Technical Depth: ⭐⭐⭐⭐ Simple, effective, and well-motivated
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers reconstruction, understanding, and generation
- Practicality: ⭐⭐⭐⭐⭐ Directly integrable into existing MLLMs for unified understanding + generation
- Overall: ⭐⭐⭐⭐⭐ An elegant solution for unified visual tokenization