QVGen: Pushing the Limit of Quantized Video Generative Models¶
Conference: ICLR 2026 arXiv: 2505.11497 Code: https://github.com/ModelTC/QVGen Area: Image Generation Keywords: Video Diffusion Models, Quantization-Aware Training, Low-Bit Quantization, Rank Decay Strategy, Auxiliary Modules
TL;DR¶
This paper proposes QVGen, a quantization-aware training (QAT) framework for video diffusion models. It introduces auxiliary modules to reduce gradient norms and improve convergence, and designs a rank decay strategy to progressively eliminate the inference overhead of auxiliary modules during training. QVGen is the first method to achieve near full-precision video generation quality under 4-bit quantization.
Background & Motivation¶
- Background: Video diffusion models (e.g., CogVideoX, Wan) can generate high-quality videos but demand enormous computation and memory — Wan 14B requires over 30 minutes and 50 GB of VRAM on a single H100 to generate a 10-second 720p video. Model quantization is an effective compression approach: 4-bit quantization can achieve approximately 3× speedup and 4× model size reduction.
- Limitations of Prior Work: Directly transferring quantization methods from image diffusion models to video diffusion models yields poor results. Existing QAT methods (e.g., Q-DM, EfficientDM, LSQ) suffer severe quality degradation under 4-bit video quantization.
- Key Challenge: Quantized video models exhibit significant convergence difficulties.
Method¶
Overall Architecture¶
QVGen consists of two core components: 1. Auxiliary Module \(\Phi\): Attached to quantized linear layers to compensate for quantization error and reduce gradient norms to improve convergence. 2. Rank Decay Strategy: Progressively eliminates \(\Phi\) via SVD decomposition and rank regularization, ensuring no additional inference overhead at deployment.
Key Design 1: Auxiliary Modules for Improved Convergence¶
Theoretical Analysis: Based on regret analysis, the upper bound on average regret is:
When the number of training steps \(T\) is sufficiently large, the first term becomes negligible. Thus, minimizing the gradient norm \(\|\mathbf{g}_t\|_2\) is key to improving QAT convergence.
With the auxiliary module \(\Phi\), the forward computation of the quantized linear layer becomes:
where \(\Phi(\mathcal{Q}_b(\mathbf{X})) = \mathbf{W}_\Phi \mathcal{Q}_b(\mathbf{X})\), and \(\mathbf{W}_\Phi\) is initialized as the weight quantization error \(\mathbf{W} - \mathcal{Q}_b(\mathbf{W})\).
Key Design 2: Rank Decay Strategy¶
\(\Phi\) introduces additional full-precision matrix multiplication overhead at inference and must be progressively removed during training.
Key Observation: SVD analysis of \(\mathbf{W}_\Phi\) reveals that the proportion of small singular values grows from 73% (step 0) to 99% (step 2K) as training progresses, indicating that an increasing fraction of components contribute minimally.
Procedure: 1. Perform SVD decomposition on \(\mathbf{W}_\Phi\): \(\mathbf{W}_\Phi = \sum_{s=1}^d \sigma_s \mathbf{u}_s \mathbf{v}_s^\top\) 2. Rewrite in low-rank form: \(\Phi(\mathcal{Q}_b(\mathbf{X})) = \mathbf{L}\mathbf{R}\mathcal{Q}_b(\mathbf{X})\) 3. Apply rank regularization \(\boldsymbol{\gamma}\):
where \(\boldsymbol{\gamma} = \text{concat}([1]_{n \times (1-\lambda)r}, [u]_{n \times \lambda r})\), and \(u\) decays from 1 to 0 via cosine annealing.
- Once \(u\) reaches 0, truncate low-contribution components, reducing the rank from \(r\) to \((1-\lambda)r\).
- Repeat until \(r=0\), fully eliminating \(\Phi\).
Loss & Training¶
A knowledge distillation (KD) objective is adopted with the full-precision model as teacher:
Key Experimental Results¶
Main Results¶
Results on VBench:
| Method | Bits (W/A) | Imaging Quality↑ | Dynamic Degree↑ | Scene Consistency↑ |
|---|---|---|---|---|
| CogVideoX-2B Full Precision | 16/16 | 59.15 | 67.78 | 36.24 |
| SVDQuant (PTQ) | 4/6 | 58.27 | 40.83 | 27.69 |
| Q-DM (QAT) | 4/4 | 54.96 | 48.61 | 28.02 |
| QVGen (Ours) | 4/4 | 60.16 | 67.22 | 31.42 |
| QVGen (Ours) | 3/3 | 58.36 | 53.89 | 23.85 |
3-bit QVGen surpasses Q-DM by +25.28 on Dynamic Degree and +8.43 on Scene Consistency.
Ablation Study¶
| Component | FID↓ |
|---|---|
| No auxiliary module (plain QAT) | Poor baseline |
| Auxiliary module + decay all parameters directly | Suboptimal |
| Auxiliary module + rank decay (\(\lambda=1/2\)) | Best |
Key Findings¶
- QVGen is the first video QAT method to achieve full-precision-comparable quality under 4-bit quantization.
- The framework generalizes across both CogVideoX and Wan video model families.
- When applied to Wan 14B (one of the largest open-source video models), negligible performance loss is observed on VBench-2.0.
- Gradient norm analysis confirms that \(\|\mathbf{g}_t\|_2\) in QVGen consistently remains lower than in Q-DM.
Highlights & Insights¶
- First work to theoretically analyze convergence in video QAT, establishing the relationship between gradient norms and convergence behavior.
- The rank decay strategy is elegantly designed, exploiting the natural shrinkage of singular values observed during training.
- Significantly outperforms all baselines at the extreme low-bit regimes of 3-bit and 4-bit quantization.
Limitations & Future Work¶
- Training cost is high (Wan 14B requires 32×H100 GPUs for 16 epochs).
- A full-precision teacher model is required for knowledge distillation.
- Current validation is limited to linear layer quantization; other components such as attention mechanisms are not yet addressed.
Related Work & Insights¶
- PTQ Methods: Post-training quantization approaches such as ViDiT-Q and SVDQuant show limited effectiveness at very low bit-widths.
- QAT Methods: Quantization-aware training methods including Q-DM, EfficientDM, and LSQ face convergence difficulties on video models.
- Model Compression: Alternative compression techniques such as low-rank decomposition and pruning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of auxiliary modules and rank decay strategy is novel.
- Theory: ⭐⭐⭐⭐ — Convergence analysis is rigorous and well-grounded.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 4 state-of-the-art video models ranging from 1.3B to 14B parameters.
- Value: ⭐⭐⭐⭐⭐ — Directly addresses a critical bottleneck in deploying video generative models.