IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models¶
Conference: ICML2025
arXiv: 2410.21759
Code: csguoh/IntLoRA
Area: Image Generation
Keywords: LoRA, network quantization, diffusion models, low-rank adaptation, integer arithmetic, inference acceleration
TL;DR¶
IntLoRA is proposed to fine-tune quantized diffusion models using integer low-rank parameters. After merging weights, quantized inference weights are directly obtained without additional PTQ, balancing both training and inference efficiency.
Background & Motivation¶
Large-scale text-to-image diffusion models (e.g., Stable Diffusion, SDXL, FLUX) perform exceptionally well in personalized generation tasks. However, full-parameter fine-tuning is limited by memory on consumer GPUs. The combination of LoRA and quantization techniques (such as QLoRA) allows direct fine-tuning on quantized weights, lowering training costs.
Core Problem: Existing methods use FP16 low-rank parameters during training. During merging, quantized pre-trained weights must be converted back to FP16, and another PTQ step is required during deployment. This workflow has two major drawbacks:
Process Redundancy: The training \(\rightarrow\) dequantization \(\rightarrow\) merging \(\rightarrow\) re-quantization pipeline introduces extra PTQ steps, increasing deployment complexity.
Severe Performance Degradation: PTQ under low-bit (e.g., 4-bit) conditions leads to significant quality degradation (DINO drops from 0.48 to 0.21).
The root cause is the arithmetic type mismatch between pre-trained weights (INT) and adapter weights (FP16), forcing the merged weights back to the floating-point domain.
Method¶
The core idea of IntLoRA is to make low-rank adaptation parameters operate in the integer domain, so that the merged weights are naturally in a quantized format. This consists of three key techniques:
1. Adapter-Quantization Separation (AQS)¶
Original LoRA initializes \(\mathbf{AB}\) to zero to keep the fine-tuning starting point aligned with the pre-trained weights. However, an all-zero distribution is unfriendly to quantization (scale factor \(s=0\) leads to division by zero). AQS introduces an auxiliary matrix \(\mathbf{R}\) to decouple gradient from quantization:
where \(\text{sg}(\cdot)\) denotes the stop-gradient operation. \(\mathbf{AB}\) is still initialized to zero to preserve the original LoRA gradient, while \(\text{sg}(\mathbf{R}) + \mathbf{AB}\) provides a non-zero distribution to facilitate quantization. \(\mathbf{R}\) can be generated online via distribution statistics and fixed random seeds without extra storage.
2. Multiplicative Low-rank Adaptation (MLA)¶
Original LoRA adopts an additive form \(\mathbf{W} + \mathbf{AB}\). When both are quantized independently, they cannot be directly merged (they would have to share a quantizer, restricting the parameter space). MLA rewrites the addition into an equivalent multiplicative form:
The adapter term and pre-trained term can use independent quantizers, eliminating the constraint of sharing quantization parameters.
3. Variance Matching Control (VMC)¶
The variance \(\sigma_\mathbf{R}\) of the auxiliary matrix \(\mathbf{R}\) faces a dilemma: too large and the original \(\mathbf{W}\) cannot be reconstructed after quantization; too small and the adapter term distribution is not concentrated enough around zero. VMC aligns the variance ratio:
The scalar \(\alpha\) acts as a fine-tuning exponent to balance quantization difficulty with information preservation.
Two Implementation Versions¶
IntLoRA_MUL (Integer Multiplication): Applies uniform affine quantization to the adapter term, and merging is completed via integer Hadamard product:
IntLoRA_SHIFT (Shift): Applies \(\log_2\) quantization to the adapter term, completing adaptation via shift operations:
During training, STE (Straight-Through Estimator) is used to backpropagate quantized gradients.
Key Experimental Results¶
Subject-Driven Generation (DreamBooth, SD v1.5)¶
| Method | Bit-width | DINO↑ | CLIP-I↑ | CLIP-T↑ |
|---|---|---|---|---|
| LoRA (FP16) | W16A16 | 0.4828 | 0.6968 | 0.2954 |
| QLoRA | W8A8 | 0.4153 | 0.6661 | 0.2824 |
| IR-QLoRA | W8A8 | 0.4070 | 0.6630 | 0.2841 |
| IntLoRA_MUL | W8A8 | 0.4498 | 0.6882 | 0.2858 |
| QLoRA | W4A8 | 0.2136 | 0.6134 | 0.2510 |
| QA-LoRA | W4A8 | 0.4127 | 0.6897 | 0.2700 |
| IntLoRA_MUL | W4A8 | 0.4242 | 0.6913 | 0.2710 |
Controllable Generation FID↓ (ControlNet)¶
| Method | 8-bit S2I | 8-bit L2F | 4-bit S2I | 4-bit L2F |
|---|---|---|---|---|
| LoRA (FP16) | 31.39 | 37.50 | 31.39 | 37.50 |
| QLoRA | 31.09 | 38.88 | 71.75 | 117.37 |
| IR-QLoRA | 31.81 | 36.30 | 35.83 | 39.63 |
| IntLoRA_MUL | 31.08 | 37.52 | 30.87 | 33.62 |
Training and Inference Efficiency (SD v1.5, RTX 3090)¶
| Method | Bit-width | Training Speed | Model Size | Requires PTQ |
|---|---|---|---|---|
| LoRA | W32A32 | 0.68s/img | 7700MB | ✔ |
| QLoRA | W8A8 | 0.85s/img | 1925MB | ✔ |
| IntLoRA_MUL | W8A8 | 0.87s/img | 1925MB | ✘ |
| QLoRA | W4A8 | 0.85s/img | 963MB | ✔ |
IntLoRA achieves a training speed comparable to QLoRA, but eliminates the PTQ step during inference, directly yielding quantized weights.
Highlights & Insights¶
- Elimination of Inference PTQ: The core contribution is converting adaptation parameters into integer operations, making the merged weights naturally quantized and enabling end-to-end, PTQ-free deployment.
- Mathematical Equivalence of MLA: The reconstruction from addition to multiplication preserves mathematical equivalence while decoupling the quantizer constraints between pre-training and adaptation.
- AQS Gradient-Quantization Decoupling: Intelligently utilizes stop-gradient to resolve the conflict between the "learning requires zero initialization" and "quantization requires non-zero distribution" demands.
- Solid Theoretical VMC Analysis: Derives the optimal auxiliary matrix from the variance-correlation coefficient trade-off with theoretical rigor.
- Zero Storage Overhead for Auxiliary Matrix: \(\mathbf{R}\) is generated online using fixed seeds, avoiding any increase in model storage.
- Significant Advantage in 4-bit Scenarios: While QLoRA's DINO score plummets to 0.21 under W4A8, IntLoRA maintains 0.42, showing a massive performance gap.
Limitations & Future Work¶
- Evaluation Limited to Diffusion Models: The method has not been verified on LLMs; though technically transferable, generalizability remains to be studied.
- No Training Acceleration: STE and auxiliary matrices introduce additional computations during training, so the training speed does not exceed QLoRA.
- Hyperparameter Search for \(\alpha\): The exponent \(\alpha\) in VMC requires searching for different tasks, lacking automation.
- Activation Quantization Still Uses PTQ: The paper focuses on weight quantization, while activation quantization still relies on traditional schemes without a unified solution.
- IntLoRA_SHIFT Performs Worse than IntLoRA_MUL: The accuracy of \(\log_2\) quantization is limited, meaning the practical advantages mainly stem from the MUL version.
Related Work & Insights¶
- QLoRA / IR-QLoRA: Applies LoRA on quantized weights but still requires PTQ, which is the direct target for improvement in this work.
- QA-LoRA: Shares parameters through group quantization, sacrificing adaptation capacity.
- EfficientDM: Uses LoRA for diffusion model QAT, but incurs heavy training overhead.
- Insights: The concept of integer-domain adaptation can be extended to other PEFT methods (such as Adapters) and hybrid-precision strategies.
Rating¶
- Novelty: ⭐⭐⭐⭐ The three-part design (addition-to-multiplication reconstruction, AQS decoupling, and VMC adjustment) is highly elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three tasks: subject generation, controllable generation, and style customization, along with ablation analyses.
- Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations, intuitive diagrams, and well-defined problem formulations.
- Value: ⭐⭐⭐⭐ Addresses real-world pain points when deploying quantized LoRA, showing outstanding advantages in 4-bit scenarios.