Fine-tuning Quantized Neural Networks with Zeroth-order Optimization¶

Conference: ICLR 2026 arXiv: 2505.13430 Code: GitHub Area: Model Compression / Efficient Fine-Tuning Keywords: Zeroth-order optimization, quantized model fine-tuning, memory-efficient training, quantization scaling factors, gradient variance

TL;DR¶

This paper proposes QZO, a method that estimates gradients via zeroth-order perturbations applied to quantization scaling factors (rather than discrete weights), and stabilizes training with directional derivative clipping (DDC). QZO enables memory-efficient fine-tuning of 4-bit/2-bit LLMs with over 18× total memory reduction.

Background & Motivation¶

Background: Fine-tuning LLMs requires storing weights, gradients, optimizer states, and activations — a typical 7B model demands 56 GB. Existing approaches compress individual components: LoRA reduces trainable parameters, GaLore compresses optimizer states, and MeZO eliminates gradient storage via zeroth-order optimization.

Limitations of Prior Work: These methods address only part of the memory problem. Weights themselves remain a significant bottleneck — a 7B model in bfloat16 requires 14 GB — and even with MeZO eliminating gradients, 14 GB must still be allocated for weights. The most direct remedy is weight quantization (e.g., int4 requires only ~3.5 GB), but quantized weights are discrete and cannot be directly perturbed in zeroth-order schemes.

Key Challenge: Zeroth-order optimization requires perturbations in a continuous space, yet quantized weights are discrete; the estimated gradients are continuous and thus cannot directly update discrete weights without a dequantize–requantize cycle.

Goal: Enable zeroth-order optimization on quantized models while maximizing memory compression across weights, gradients, and optimizer states simultaneously.

Key Insight: Quantization can be expressed as \(w = \Delta \cdot \bar{w}\), where \(\Delta\) is a continuous scaling factor and \(\bar{w}\) is a discrete integer. Perturbations can be applied to the continuous \(\Delta\) while keeping \(\bar{w}\) fixed.

Core Idea: Perturb continuous quantization scaling factors for zeroth-order gradient estimation, and control gradient variance via directional derivative clipping.

Method¶

Overall Architecture¶

QZO = Q-SPSA (quantized zeroth-order gradient estimation) + DDC (directional derivative clipping). The quantized integer weights \(\bar{\theta}\) remain fixed; only the continuous scaling factors \(\Delta\) are updated. Two forward passes suffice for gradient estimation — no backpropagation, no gradient storage, and no optimizer states are required.

Key Designs¶

Q-SPSA (Quantized Simultaneous Perturbation Stochastic Approximation):
- Function: Extends SPSA to quantized models by perturbing continuous scaling factors rather than discrete weights.
- Mechanism: \(\hat{\nabla}_{\Delta}\mathcal{L} = \frac{\mathcal{L}((\Delta+\epsilon z)\odot\bar{\theta}) - \mathcal{L}((\Delta-\epsilon z)\odot\bar{\theta})}{2\epsilon}z\), where \(z \sim \mathcal{N}(0, I_d)\)
- Design Motivation: \(\Delta\) is continuous and thus naturally amenable to perturbation and gradient-based updates. The dequantization \(w = \Delta \cdot \bar{w}\) is consistent with standard forward passes, requiring no modification to inference code. The approach applies to both scalar-based (GPTQ) and codebook-based (AQLM) quantization schemes.
DDC (Directional Derivative Clipping):
- Function: Clips the scalar directional derivative \(d\) in the zeroth-order gradient estimate to stabilize training.
- Mechanism: \(d' = \text{clip}(d, -C, C)\), yielding a clipped gradient estimate \(\hat{\nabla} = d' \cdot z\).
- Design Motivation: Zeroth-order gradient estimates suffer from high variance (a known issue also present in MeZO). Theorem 1 proves that clipping preserves unbiasedness while reducing variance, since \(d'^2 \leq d^2\).
Memory Seed Trick:
- Function: Reproduces the perturbation vector \(z\) on-the-fly via a random seed, avoiding explicit storage.
- Mechanism: Identical to MeZO — a seed index replaces direct storage of \(z\).
- Design Motivation: Storing \(z\) would have the same memory footprint as the model itself, negating the savings.

Loss & Training¶

ZO-SGD update: \(\Delta_{t+1} = \max(\Delta_t - \eta \cdot d' \cdot z,\ 0)\) (enforcing non-negativity of scaling factors).
Learning rate \(10^{-7}\), perturbation scale \(\epsilon = 10^{-3}\), clipping threshold \(C = 100\).
Optional: joint update of \(\Delta\) via Q-SPSA and non-quantized parameters via SPSA.

Key Experimental Results¶

Main Results¶

4-bit GPTQ quantized models evaluated on SST-2 / RTE / CB / BoolQ / SQuAD:

Method	Precision	Memory	SST-2	RTE	SQuAD
Zero-Shot	16bit	14 GB	baseline	baseline	baseline
Fine-tuning + AdamW	16bit	56 GB	upper bound	upper bound	upper bound
MeZO	16bit	14 GB	strong	strong	strong
QZO (4bit)	4bit	<3 GB	near MeZO	near MeZO	near MeZO

Extreme Quantization Experiment (2-bit AQLM, Llama-2-13B)¶

Configuration	Memory	Performance	Notes
Zero-Shot-Q (2bit)	~5 GB	baseline	post-quantization zero-shot
QZO (2bit)	~5 GB	significantly above baseline	effective under extreme compression
MeZO (16bit)	26 GB	reference	requires 5× more memory

Key Findings¶

QZO achieves fine-tuning performance close to MeZO (14 GB) using less than 3 GB — an 18× memory reduction.
QZO remains significantly effective over the zero-shot baseline under 2-bit extreme quantization.
DDC is critical for training stability; without it, loss spikes occur frequently.
The clipping threshold \(C\) yields stable performance over a wide range (50–200).

Highlights & Insights¶

Unified framework for extreme compression: QZO simultaneously eliminates gradients, optimizer states, and compresses weights, achieving memory savings along all three axes. The 18× reduction enables fine-tuning of 13B models on a 24 GB GPU.
Perturbing scaling factors rather than weights: This avoids the cumbersome dequantize → perturb → requantize pipeline. The key insight is the factorization \(w = \Delta \cdot \bar{w}\), which exposes a continuous component amenable to perturbation.
Theoretical guarantee for DDC: The paper proves that clipping preserves unbiasedness while strictly reducing variance — a clean and rigorous theoretical result.

Limitations & Future Work¶

Zeroth-order optimization converges slowly, requiring significantly more update steps (~20k) compared to first-order methods (a few hundred).
Evaluation is limited to NLU tasks (classification and QA); effectiveness on generative tasks such as instruction following remains unexplored.
Only scaling factors can be fine-tuned (limited granularity); unlike LoRA, QZO cannot learn new low-rank parameters.
Comparison with LoRA-based quantized fine-tuning (e.g., QLoRA) is absent.

vs. MeZO: MeZO applies zeroth-order optimization to full-precision models; QZO applies it to quantized models, achieving comparable performance at 1/5 the memory footprint.
vs. QLoRA: QLoRA combines quantization with LoRA-based fine-tuning but still requires gradient storage. QZO eliminates gradient storage entirely, achieving lower memory at a potential cost in expressiveness.
vs. ZO-signSGD: Prior quantized zeroth-order work requires quantizing the perturbation noise and applying sign SGD on discrete weights; QZO is more efficient and flexible.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of perturbing quantization scaling factors is intuitive and effective.
Experimental Thoroughness: ⭐⭐⭐ — Limited dataset variety; comparison with QLoRA is missing.
Writing Quality: ⭐⭐⭐⭐ — Method description is clear and theoretical proofs are complete.
Value: ⭐⭐⭐⭐ — Directly practical for extremely resource-constrained deployment scenarios.