Skip to content

Switchable Token-Specific Codebook Quantization for Face Image Compression

Conference: NeurIPS 2025 arXiv: 2510.22943 Code: Not available Area: Human Understanding Keywords: face image compression, vector quantization, codebook learning, low bitrate, face recognition

TL;DR

This paper proposes a Switchable Token-Specific Codebook Quantization (STSCQ) mechanism that employs a hierarchical dynamic structure combining image-level codebook routing and token-level codebook partitioning, achieving significant improvements in reconstruction quality and recognition accuracy for face image compression at ultra-low bitrates.

Background & Motivation

With the explosive growth of image data generated by smart devices, lossy compression inevitably degrades visual quality and machine perception performance (e.g., face recognition) while reducing storage costs. Existing codebook-based compression methods (VQ-VAE, TiTok, etc.) face a fundamental bottleneck:

Limitations of globally shared codebooks: All tokens share a single codebook, which must be large enough to cover the diverse features of all images. Reducing the bitrate requires shrinking the codebook size or the number of tokens, both of which lead to severe quality degradation.

Intra-class correlations are neglected: Face images exhibit clear clustering patterns along attributes such as gender, age, and ethnicity; images with similar attributes share similar feature distributions, yet global codebooks fail to exploit this prior.

Token-level semantic differences are neglected: Different tokens implicitly or explicitly encode distinct semantic information (e.g., eye regions vs. nasal regions). Forcing all tokens to share a single codebook increases learning difficulty and leads to uneven codebook utilization.

Core Problem: Can the codebook structure be reorganized to decompose the global quantization problem into smaller, more tractable sub-problems?

Method

Overall Architecture

Building upon a standard latent-space model (encoder → quantizer → decoder), the proposed method replaces the static global codebook with a hierarchical switchable token-specific codebook. A routing module first selects an appropriate codebook group for each image; within the selected group, an independent sub-codebook is assigned to each token for quantization.

Key Designs

  1. Switchable Codebook Quantization (SCQ)

The original codebook \(\mathcal{C}_{orig} \in \mathbb{R}^{N \times d}\) is replaced by \(M\) learnable codebooks \(\{C^i \in \mathbb{R}^{N/2^s \times d}\}_{i=1}^M\), where \(s \leq M\). The storage cost is reduced from \(T \times \lceil\log_2 N\rceil\) bits to \(T \times \lceil\log_2 K\rceil + \lceil\log_2 M\rceil\) bits.

For example, 256 tokens with a 4096-entry codebook require 3072 bits; replacing it with 256 codebooks of 256 entries each requires only 2056 bits — a 33% reduction. Since the multiplicative bitwidth savings far outweigh the additive routing overhead, a larger total codebook capacity is achieved at lower bpp.

  1. Codebook Routing Mechanism

A differentiable routing network \(G_\theta\) is designed for codebook selection. During training, probabilistic routing is employed:

\(G_\theta(\mathbf{z}_e) = \arg\max_{i \in \{1,...,M\}} g_\theta^i(\mathbf{z}_e)\)

To ensure all codebooks are sufficiently utilized and to prevent collapse, three auxiliary losses are introduced:

  • Entropy maximization loss \(\mathcal{L}_{ent}\): maximizes the entropy of the codebook selection distribution within a batch, preventing bias toward a small subset of codebooks.
  • Decision clarity loss \(\mathcal{L}_{dec}\): reduces ambiguity in routing predictions by concentrating probability mass on the optimal codebook.
  • Quantization guidance loss \(\mathcal{L}_{qua}\): guides the router to select the codebook that yields lower quantization error.

Total routing loss: \(\mathcal{L}_{router} = \mathcal{L}_{qua} + \lambda_1\mathcal{L}_{ent} + \lambda_2\mathcal{L}_{dec}\)

The learnable router \(G_\theta\) is used during training, while a naive nearest-neighbor search \(G_{naive}\) is employed at inference to ensure quantization fidelity.

  1. Token-Specific Codebook Quantization (TSC)

Each codebook group is further decomposed into token-level sub-codebooks:

\(\mathcal{C}_{tsc} = [\mathcal{C}_1 \oplus \mathcal{C}_2 \oplus \cdots \oplus \mathcal{C}_T] \in \mathbb{R}^{T \times K \times d}\)

Each sub-codebook \(\mathcal{C}_t\) independently learns the feature distribution of the \(t\)-th token. Although the total codebook size increases (\(T \times K\) vs. \(K\)), the per-token bitwidth remains unchanged (\(b = \lceil\log_2 K\rceil\)). Token-specific sub-codebooks achieve higher sampling density within each token's feature subspace, directly improving reconstruction fidelity.

Loss & Training

A three-stage progressive training paradigm is adopted:

  • Stage 1 (100K steps): The encoder and decoder are frozen; only the switchable token-shared codebook and routing network are trained. \(\mathcal{L}_{Stage1} = \|\mathbf{z}_e - \text{Quant}_{\mathcal{C}^i}(\mathbf{z}_e)\|_2^2 + \mathcal{L}_{router}\)
  • Stage 2 (400K steps): The token-specific codebook is initialized from the Stage 1 codebook. The encoder and decoder remain frozen; only the token-specific codebook and routing network are trained.
  • Stage 3 (100K steps): The codebook is frozen; only the decoder is fine-tuned to adapt to the updated codebook representations. An ArcFace identity loss is incorporated to preserve facial semantic consistency: \(\mathcal{L}_{Stage3} = \|x - \hat{x}\|_2^2 + \lambda_p\mathcal{L}_{per} + \lambda_f\mathcal{L}_{face}\)

Key Experimental Results

Main Results

Method Model Type #Tokens MeanAcc (%) IDS bpp
JPEG2000 / / 56.98 0.031 0.010
JPEG2000 / / 85.64 0.355 0.050
CodeFormer 2D 256 89.99 0.621 0.039
MaskGit-VQGAN 2D 256 90.70 0.631 0.047
TiTok-S 1D 128 87.56 0.576 0.023
TiTok-L 1D 32 65.07 0.181 0.006
Ours (MaskGit) 2D 256 93.51 (+2.81) 0.666 0.047
Ours (TiTok-S) 1D 128 91.66 (+4.10) 0.612 0.023
Ours (TiTok-L) 1D 32 73.13 (+8.06) 0.258 0.006

Ablation Study

Configuration Token-Shared Token-Specific Search Strategy MeanAcc (%) IDS bpp
Original single codebook - - - 88.11 0.536 0.020
+ Switchable (routing) - CR 88.24 0.541 0.020
+ Token-specific (NN) - NN 89.28 0.570 0.020
+ Token-specific (routing) - CR 89.89 0.574 0.020

Codebook utilization: the globally shared codebook achieves an average utilization of 54.17% (STD 14.71), while the proposed method reaches 74.02% (STD 9.14), an improvement of approximately 20%.

Key Findings

  1. At the same bpp, accuracy on TiTok-S improves from 87.56% to 91.66% (+4.10 pp).
  2. At the same accuracy level, bpp on TiTok-S decreases from 0.0234 to 0.0157 (−32.9%).
  3. Under the routing inference mode, both inference latency and storage overhead can be substantially reduced by loading only the selected codebook group.
  4. Token-specific codebooks increase average codebook utilization by 20%, effectively alleviating the uneven utilization problem.

Highlights & Insights

  1. Problem decomposition: The global codebook is decomposed into a hierarchical image-level × token-level structure, reducing quantization difficulty by dividing it into smaller sub-problems.
  2. Plug-and-play: The method integrates seamlessly into any codebook-based compression framework (VQGAN, TiTok, etc.).
  3. Routing mechanism inspired by MoE: The routing network design draws on Mixture-of-Experts; three auxiliary losses effectively prevent codebook collapse.
  4. Additive vs. multiplicative bit allocation: The routing overhead for multiple codebooks is additive (\(\log_2 M\)), whereas the per-token bit savings are multiplicative (\(n \times s\)), ensuring a net reduction in total bits.

Limitations & Future Work

  • Performance is highly dependent on the quality of the base autoencoder; no specific improvements are made to the encoder or decoder.
  • Validation is limited to face images; generalizability to general image compression remains uncertain.
  • The storage overhead introduced by multiple codebooks requires routing optimization at inference to be mitigated.
  • Stage 2 requires 400K training steps, constituting the majority of the total training time.
  • TiTok: Compresses images into 1D token sequences; the proposed method improves upon its codebook structure.
  • VQGAN: A classical codebook-based image compression and generation framework.
  • MoE routing mechanism: Inspires the design of the codebook routing network.

Rating

  • Novelty: ⭐⭐⭐⭐☆ — The hierarchical codebook decomposition idea is clear but not groundbreaking.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ — Comprehensive multi-baseline and multi-configuration comparisons, though limited to the face domain.
  • Writing Quality: ⭐⭐⭐⭐☆ — Method description is thorough with consistent notation.
  • Value: ⭐⭐⭐⭐☆ — Plug-and-play nature confers strong practical utility.