RDVQ: Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression¶

Conference: CVPR 2026 arXiv: 2604.10546 Code: https://github.com/CVL-UESTC/RDVQ Area: Image Compression / Restoration Keywords: Vector Quantization, Rate-Distortion Optimization, Generative Image Compression, Entropy Model, Differentiable Relaxation

TL;DR¶

RDVQ introduces a differentiable relaxation over the codebook distribution, enabling for the first time end-to-end rate-distortion joint optimization for VQ-based image compression. At extremely low bitrates, the method achieves superior or competitive perceptual quality with less than 20% of the parameters of prior approaches.

Background & Motivation¶

Background: Learned image compression predominantly relies on scalar quantization (SQ), where differentiable approximations (e.g., additive noise or STE) allow gradient backpropagation to the encoder, enabling end-to-end rate-distortion optimization. Vector quantization (VQ) preserves richer structural information and perceptual quality, making it particularly suitable for extremely low bitrates.

Limitations of Prior Work: The discrete nearest-neighbor assignment in VQ blocks gradient flow from the rate loss to the encoder. The encoder-induced implicit prior distribution cannot be directly optimized by the rate objective, leading to a fundamental decoupling between representation learning and the entropy model.

Key Challenge: While VQ offers advantages in reconstruction quality, it cannot support end-to-end rate-distortion joint optimization as SQ does; bitrate control must rely on heuristics such as codebook size adjustment and selective transmission.

Goal: Restore a differentiable gradient path from the rate objective to the encoder in VQ-based compression, achieving true end-to-end rate-distortion optimization.

Key Insight: Replace hard nearest-neighbor assignment with a distance-aware soft distribution, used exclusively in the rate estimation branch, while reconstruction continues to use standard hard quantization.

Core Idea: During training, a softmax-relaxed codebook distribution is used to estimate the rate, enabling rate gradients to flow to the encoder; at inference, the system reverts to standard hard VQ, maintaining full compatibility.

Method¶

Overall Architecture¶

The analysis transform \(g_a\) extracts multi-scale features → flattened into a sequence → the VQ module produces hard-quantized embeddings (for reconstruction), discrete indices (for coding), and a relaxed distribution (used only for rate estimation during training) → the synthesis transform \(g_s\) reconstructs the image. The entropy model is a Masked Transformer that autoregressively predicts conditional probabilities over the relaxed distribution.

Key Designs¶

Differentiable Soft Relaxation:
- Function: Restores the gradient path from the rate objective to the encoder.
- Mechanism: The distance \(d_{b,l,k}\) between the encoder output and each codeword is computed, and a temperature-scaled softmax yields the relaxed distribution \(p_{\text{soft}}(b,l,k) = \text{softmax}_k(-d_{b,l,k}/\tau)\). During training, the rate objective is computed as a cross-entropy over this continuous distribution, while reconstruction still uses hard quantization.
- Design Motivation: Introducing relaxation only in the rate estimation branch leaves the reconstruction and inference pipelines unchanged, ensuring training–inference consistency.
Dependency-Aware Autoregressive Entropy Model:
- Function: Accurately models the conditional probability distribution over codebook indices.
- Mechanism: Multi-scale features are organized spatially and hierarchically into a unified sequence; a dependency-aware ordering vector \(o\) is constructed, and masked attention \(M = (o > o^\top)\) enables autoregressive factorization under parallel training. Coarse scales are encoded first, and fine scales are conditioned on coarse scales.
- Design Motivation: The multi-scale structure naturally exhibits hierarchical dependencies; dependency-aware ordering captures these relationships more effectively than simple raster-scan ordering.
Test-Time Bitrate Adjustment:
- Function: Enables bitrate control within a limited range without retraining.
- Mechanism: A prefix of the index sequence is transmitted, and the autoregressive entropy model completes the remaining indices. Joint rate-distortion optimization renders the latent space highly predictable, so quality degradation from prefix completion is smooth.
- Design Motivation: Practical deployment requires flexible bitrate control; prefix transmission combined with autoregressive completion provides an elegant solution.

Loss & Training¶

Three-stage training: (1) pretraining the autoencoder and codebook (reconstruction loss); (2) pretraining the entropy model (rate objective); (3) joint fine-tuning of the full model (rate + distortion), followed by adaptation on high-resolution data. The loss comprises a GAN loss, LPIPS perceptual loss, and a relaxed cross-entropy rate loss.

Key Experimental Results¶

Main Results¶

Dataset	Metric	RDVQ	RDEIC	BD-Rate Savings
DIV2K-val	DISTS	Best	2nd	-75.71%
DIV2K-val	LPIPS	Best	2nd	-37.63%
Kodak	DISTS	SOTA	-	-
CLIC2020	CLIPIQA	SOTA	-	-

Ablation Study¶

Configuration	bpp	DISTS	LPIPS	FID
RDVQ (full)	0.0247	0.1005	0.2321	19.96
w/o Relaxation	0.0464	0.2147	0.5031	86.93
K-means VQ	0.0247	0.1253	0.2831	28.08

Key Findings¶

Removing the differentiable relaxation causes a sharp performance drop; even at higher bitrates, the degraded variant falls far short of the full model, confirming that relaxation is the cornerstone of end-to-end rate-distortion optimization.
K-means-based bitrate control yields noticeably inferior quality at the same bitrate compared to RDVQ, demonstrating that heuristic methods cannot eliminate redundancy in the index distribution.
As the bitrate decreases, encoder features become progressively smoother and codebook utilization more concentrated, indicating that the model autonomously learns an adaptive compression strategy.

Highlights & Insights¶

Elegant Separation of Relaxation: The relaxation is applied exclusively in the rate estimation branch during training; the reconstruction path always uses hard quantization, requiring no modification at inference. This dual-path design simultaneously resolves the gradient issue and maintains deployment compatibility.
Unified Perspective on Image Tokenization and Compression: Existing VQ tokenizers can be converted into compression models by introducing entropy constraints, and conversely, compression objectives can improve tokenizer efficiency.

Limitations & Future Work¶

Test-time bitrate adjustment is limited in range (0.02–0.32 bpp); quality degrades noticeably outside this range.
At 251.9M parameters, the model is significantly smaller than baselines but cannot be considered lightweight.
Future work may explore applying this framework to entropy-aware training of visual tokenizers.

vs. OSCAR/RDEIC: These diffusion- or large-model-prior-based methods require substantially more parameters; RDVQ is trained from scratch with less than 20% of their parameter count.
vs. DLF: This dual-branch SQ+VQ hybrid approach fundamentally cannot perform rate-distortion optimization on the VQ branch.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to achieve end-to-end rate-distortion optimization for VQ, with a clear theoretical contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, multiple metrics, with comprehensive ablations and analyses.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is precise and mathematical derivations are clearly presented.
Value: ⭐⭐⭐⭐⭐ Significant implications for both VQ-based compression and image tokenization.