RDVQ: Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression¶
Conference: CVPR 2026 arXiv: 2604.10546 Code: https://github.com/CVL-UESTC/RDVQ Area: Image Compression / Restoration Keywords: Vector Quantization, Rate-Distortion Optimization, Generative Image Compression, Entropy Model, Differentiable Relaxation
TL;DR¶
RDVQ introduces a differentiable relaxation over the codebook distribution, enabling for the first time end-to-end rate-distortion joint optimization for VQ-based image compression. At extremely low bitrates, the method achieves superior or competitive perceptual quality with less than 20% of the parameters of prior approaches.
Background & Motivation¶
Background: Learned image compression predominantly relies on scalar quantization (SQ), where differentiable approximations (e.g., additive noise or STE) allow gradient backpropagation to the encoder, enabling end-to-end rate-distortion optimization. Vector quantization (VQ) preserves richer structural information and perceptual quality, making it particularly suitable for extremely low bitrates.
Limitations of Prior Work: The discrete nearest-neighbor assignment in VQ blocks gradient flow from the rate loss to the encoder. The encoder-induced implicit prior distribution cannot be directly optimized by the rate objective, leading to a fundamental decoupling between representation learning and the entropy model.
Key Challenge: While VQ offers advantages in reconstruction quality, it cannot support end-to-end rate-distortion joint optimization as SQ does; bitrate control must rely on heuristics such as codebook size adjustment and selective transmission.
Goal: Restore a differentiable gradient path from the rate objective to the encoder in VQ-based compression, achieving true end-to-end rate-distortion optimization.
Key Insight: Replace hard nearest-neighbor assignment with a distance-aware soft distribution, used exclusively in the rate estimation branch, while reconstruction continues to use standard hard quantization.
Core Idea: During training, a softmax-relaxed codebook distribution is used to estimate the rate, enabling rate gradients to flow to the encoder; at inference, the system reverts to standard hard VQ, maintaining full compatibility.
Method¶
Overall Architecture¶
The analysis transform \(g_a\) extracts multi-scale features → flattened into a sequence → the VQ module produces hard-quantized embeddings (for reconstruction), discrete indices (for coding), and a relaxed distribution (used only for rate estimation during training) → the synthesis transform \(g_s\) reconstructs the image. The entropy model is a Masked Transformer that autoregressively predicts conditional probabilities over the relaxed distribution.
Key Designs¶
-
Differentiable Soft Relaxation:
- Function: Restores the gradient path from the rate objective to the encoder.
- Mechanism: The distance \(d_{b,l,k}\) between the encoder output and each codeword is computed, and a temperature-scaled softmax yields the relaxed distribution \(p_{\text{soft}}(b,l,k) = \text{softmax}_k(-d_{b,l,k}/\tau)\). During training, the rate objective is computed as a cross-entropy over this continuous distribution, while reconstruction still uses hard quantization.
- Design Motivation: Introducing relaxation only in the rate estimation branch leaves the reconstruction and inference pipelines unchanged, ensuring training–inference consistency.
-
Dependency-Aware Autoregressive Entropy Model:
- Function: Accurately models the conditional probability distribution over codebook indices.
- Mechanism: Multi-scale features are organized spatially and hierarchically into a unified sequence; a dependency-aware ordering vector \(o\) is constructed, and masked attention \(M = (o > o^\top)\) enables autoregressive factorization under parallel training. Coarse scales are encoded first, and fine scales are conditioned on coarse scales.
- Design Motivation: The multi-scale structure naturally exhibits hierarchical dependencies; dependency-aware ordering captures these relationships more effectively than simple raster-scan ordering.
-
Test-Time Bitrate Adjustment:
- Function: Enables bitrate control within a limited range without retraining.
- Mechanism: A prefix of the index sequence is transmitted, and the autoregressive entropy model completes the remaining indices. Joint rate-distortion optimization renders the latent space highly predictable, so quality degradation from prefix completion is smooth.
- Design Motivation: Practical deployment requires flexible bitrate control; prefix transmission combined with autoregressive completion provides an elegant solution.
Loss & Training¶
Three-stage training: (1) pretraining the autoencoder and codebook (reconstruction loss); (2) pretraining the entropy model (rate objective); (3) joint fine-tuning of the full model (rate + distortion), followed by adaptation on high-resolution data. The loss comprises a GAN loss, LPIPS perceptual loss, and a relaxed cross-entropy rate loss.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | RDVQ | RDEIC | BD-Rate Savings |
|---|---|---|---|---|
| DIV2K-val | DISTS | Best | 2nd | -75.71% |
| DIV2K-val | LPIPS | Best | 2nd | -37.63% |
| Kodak | DISTS | SOTA | - | - |
| CLIC2020 | CLIPIQA | SOTA | - | - |
Ablation Study¶
| Configuration | bpp | DISTS | LPIPS | FID |
|---|---|---|---|---|
| RDVQ (full) | 0.0247 | 0.1005 | 0.2321 | 19.96 |
| w/o Relaxation | 0.0464 | 0.2147 | 0.5031 | 86.93 |
| K-means VQ | 0.0247 | 0.1253 | 0.2831 | 28.08 |
Key Findings¶
- Removing the differentiable relaxation causes a sharp performance drop; even at higher bitrates, the degraded variant falls far short of the full model, confirming that relaxation is the cornerstone of end-to-end rate-distortion optimization.
- K-means-based bitrate control yields noticeably inferior quality at the same bitrate compared to RDVQ, demonstrating that heuristic methods cannot eliminate redundancy in the index distribution.
- As the bitrate decreases, encoder features become progressively smoother and codebook utilization more concentrated, indicating that the model autonomously learns an adaptive compression strategy.
Highlights & Insights¶
- Elegant Separation of Relaxation: The relaxation is applied exclusively in the rate estimation branch during training; the reconstruction path always uses hard quantization, requiring no modification at inference. This dual-path design simultaneously resolves the gradient issue and maintains deployment compatibility.
- Unified Perspective on Image Tokenization and Compression: Existing VQ tokenizers can be converted into compression models by introducing entropy constraints, and conversely, compression objectives can improve tokenizer efficiency.
Limitations & Future Work¶
- Test-time bitrate adjustment is limited in range (0.02–0.32 bpp); quality degrades noticeably outside this range.
- At 251.9M parameters, the model is significantly smaller than baselines but cannot be considered lightweight.
- Future work may explore applying this framework to entropy-aware training of visual tokenizers.
Related Work & Insights¶
- vs. OSCAR/RDEIC: These diffusion- or large-model-prior-based methods require substantially more parameters; RDVQ is trained from scratch with less than 20% of their parameter count.
- vs. DLF: This dual-branch SQ+VQ hybrid approach fundamentally cannot perform rate-distortion optimization on the VQ branch.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to achieve end-to-end rate-distortion optimization for VQ, with a clear theoretical contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, multiple metrics, with comprehensive ablations and analyses.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is precise and mathematical derivations are clearly presented.
- Value: ⭐⭐⭐⭐⭐ Significant implications for both VQ-based compression and image tokenization.