Skip to content

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook

Conference: ICLR 2026 arXiv: 2503.06764 Area: LLM Pretraining Keywords: image tokenizer, hierarchical codebook, semantic guidance, unified understanding + generation, SGHC, MLLM

TL;DR

This paper proposes SemHiTok — a tokenizer that unifies visual understanding and generation via a Semantic-Guided Hierarchical Codebook (SGHC): pixel sub-codebooks are constructed on top of a pretrained semantic codebook, with structure and training fully decoupled (stage-wise optimization) to avoid the semantic–pixel conflict in joint training. Under the LLaVA setting, SemHiTok achieves state-of-the-art performance in both understanding and reconstruction among discrete tokenizers.

Background & Motivation

Background: Unified MLLMs require tokenizers that simultaneously support understanding (high-level semantics) and generation (low-level pixels).

Limitations of Prior Work: - (1) CLIP-based methods → strong semantics but poor pixel fidelity; VQGAN-based methods → good pixel reconstruction but weak semantics. - (2) Joint training approaches (VILA-U uses mixed losses → sub-optimal; SDE encoder decouples encoders but mixes codebooks). - (3) Dual-encoder designs (Janus) → token count doubles or vocabulary explodes → inefficient. - (4) TokenFlow uses a shared mapping but joint training still degrades performance.

Key Insight: The observation that patches sharing the same semantic code exhibit similar pixel distributions motivates building sub-codebooks under each semantic code, achieving full decoupling in both structure and training.

Method

Overall Architecture

A semantic branch (VQKD aligned with SigLIP) produces fixed \(C_\text{sem}\); a pixel branch (ViT) learns \(C_\text{pix}\). Semantic and pixel tokens are concatenated along the channel dimension to form a unified discrete representation.

Key Designs

  1. Semantic Codebook: SigLIP → EMA vector quantization → cosine + L1 distillation → frozen after training.

  2. SGHC: \(C_\text{pix} = \{C_\text{pix}^1, \ldots, C_\text{pix}^K\}\) (\(K\) semantic codes × \(m\) sub-codes); for patch \(i\), the semantic branch first assigns index \(k\), then the \(k\)-th sub-codebook quantizes the pixel features.

  3. Stage-wise Training: Stage 1 trains the semantic branch (VQKD) and freezes it; Stage 2 trains the pixel branch (L1 + perceptual + GAN losses) — the two stages are conflict-free.

  4. Unified MLLM Integration: Tokens are flattened as \(h = i \times m + j\); a Dual-MLP adapter projects semantic and pixel tokens separately, then concatenates them before feeding into the LLM.

Loss & Training

  • SigLIP frozen; \(K\) semantic codes, \(m=8\) sub-codes → total codebook size 196,608; base model: Qwen2.5-7B-Instruct.

Key Experimental Results

Reconstruction (Table 1, ImageNet-50k)

Method Type Codebook rFID↓
LlamaGen Only Recon 16,384 2.19
IBQ Only Recon 262,144 1.00
VILA-U Unified 16,384 1.80
TokenFlow Unified 32,768 1.37
SemHiTok Unified 196,608 1.16
SemHiTok-384 Unified 196,608 0.66

Understanding (Table 2, LLaVA-v1.5)

Model Resolution POPE MME-P SEED GQA
SigLIP (continuous) 256 83.8 1481 65.3 61.9
VILA-U 256 81.6 1312 56.9 55.3
SemHiTok 256 82.5 1356 62.9 60.3
SemHiTok-384 384 86.3 1466 64.1 62.3

Key Findings

  • Achieves state-of-the-art understanding among discrete tokenizers, approaching or partially surpassing continuous SigLIP.
  • rFID of 1.16 / 0.66 → state-of-the-art reconstruction among unified tokenizers.
  • POPE 82.5 vs. VILA-U 81.6 (+0.9); SEED 62.9 vs. 56.9 (+6.0).
  • Total codebook size \(K \times m = 196\text{K}\) is comparable to LLM text vocabulary size (Qwen2 ~150K) → no vocabulary explosion.

Highlights & Insights

  • SGHC Design: The observation that same-semantic patches share similar pixels motivates sub-codebook refinement — an elegant and principled design.
  • Stage-wise Training: Completely eliminates the semantic–pixel conflict, yielding a better trade-off — a key contribution.
  • No Token Explosion: The flattened codebook size (196K) is compatible with existing LLM vocabularies, enabling seamless integration.
  • Non-interfering Extension: Pixel branch training does not affect the frozen semantic codebook, preserving understanding capability.

Limitations & Future Work

  • Validation is limited to 256/384 resolutions; scalability to higher resolutions remains untested.
  • The sub-codebook size \(m=8\) is fixed; adaptive \(m\) selection is unexplored.
  • Only Qwen2.5-7B and Vicuna-7B are evaluated; larger LLMs await investigation.
  • Evaluation of generation quality (MJHQ/GenEval) is limited in scope.
  • The effect of semantic codebook size \(K\) on performance lacks thorough ablation.
  • Pixel sub-spaces under certain semantic codes may be data-insufficient, potentially causing sub-codebook underfitting.
  • Sensitivity analysis of loss weights (\(\lambda_1 / \lambda_2 / \lambda_3\)) in the training strategy is limited.

Supplementary Unified MLLM Results

  • Outperforms prior unified discrete MLLMs on both understanding and generation tasks.
  • Achieves state-of-the-art on most benchmarks in the Und. & Gen. Discrete category.
  • Performance is comparable to some continuous-tokenizer (Only Und.) baselines.
  • VILA-U joint loss → sub-optimal → SemHiTok resolves this via stage-wise training.
  • TokenFlow shared mapping → joint training conflict remains → SemHiTok achieves full decoupling.
  • VQKD semantic codebook → SemHiTok extends it with a pixel layer.
  • Insight: A hierarchical structure (semantics → pixels) may represent the optimal paradigm for unified visual tokenizers.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First proposal of SGHC + stage-wise training
  • Technical Depth: ⭐⭐⭐⭐ Simple, effective, and well-motivated
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers reconstruction, understanding, and generation
  • Practicality: ⭐⭐⭐⭐⭐ Directly integrable into existing MLLMs for unified understanding + generation
  • Overall: ⭐⭐⭐⭐⭐ An elegant solution for unified visual tokenization