ICCV 2025 Optimization Shampoo quantization Cholesky decomposition error feedback low-precision optimizer memory-efficient training

Memory-Efficient 4-bit Preconditioned Stochastic Optimization¶

Conference: ICCV 2025 arXiv: 2412.10663 Code: To be confirmed Area: Optimization / Preconditioned Stochastic Optimization Keywords: Shampoo, quantization, Cholesky decomposition, error feedback, low-precision optimizer, memory-efficient training

TL;DR¶

This paper proposes a 4-bit quantization scheme based on Cholesky decomposition and error feedback, compressing the preconditioner matrices of the Shampoo optimizer to 4-bit precision. The approach substantially reduces GPU memory consumption while preserving training performance close to 32-bit Shampoo, with convergence guarantees provided for both smooth and non-smooth settings.

Background & Motivation¶

Preconditioned stochastic optimization algorithms (e.g., Shampoo) exploit non-diagonal preconditioner matrices to capture inter-parameter correlations, achieving superior convergence rates and empirical performance over first-order optimizers (SGD, Adam, etc.). However, Shampoo requires storing four preconditioner matrices \((L_k, R_k, L_k^{-1/4}, R_k^{-1/4})\), incurring substantial GPU memory overhead that severely limits its scalability in large-scale model training.

A straightforward solution is to quantize the preconditioner matrices from 32-bit to 4-bit, but naive quantization leads to significant information loss and performance degradation. For example, on ViT-Small/CIFAR-100, 32-bit Shampoo achieves 77.95% test accuracy, whereas direct 4-bit quantization yields only 74.56%, a gap of 3.39%. The core challenge is therefore: how to effectively compress preconditioner matrices at 4-bit ultra-low precision while preserving their spectral properties and avoiding optimization quality degradation?

Method¶

Overall Architecture¶

The proposed 4-bit Shampoo comprises three progressively layered techniques:

Baseline: Direct Quantization (VQ) — block-wise 4-bit quantization applied directly to preconditioner matrices
Core Contribution 1: Cholesky Quantization (CQ) — Cholesky decomposition performed prior to quantizing the lower-triangular factor
Core Contribution 2: Compensated Cholesky Quantization (CQ+EF) — error feedback introduced on top of Cholesky quantization

Key Design 1: Cholesky Quantization (CQ)¶

Rather than directly quantizing the symmetric positive definite preconditioner matrices \(L_k\) and \(R_k\), the method first applies Cholesky decomposition:

\[C_k^L = \text{Cholesky}(L_k + \epsilon I), \quad C_k^R = \text{Cholesky}(R_k + \epsilon I)\]

The lower-triangular Cholesky factor is then quantized as \(\bar{C}_k^L = \mathcal{Q}(C_k^L)\). During recovery, the preconditioner is reconstructed via \(L_k = \mathcal{D}(\bar{C}_k^L)\mathcal{D}(\bar{C}_k^L)^T\).

Two key advantages:

Memory reduction by half: The Cholesky factor is lower-triangular, requiring approximately half the storage of a full matrix.
Spectral preservation: Matrices recovered from quantized Cholesky factors automatically remain symmetric positive definite (SPD), with inverse 1/4-roots closer to those of the 32-bit counterpart. Quantitative validation on synthetic PD matrices shows NRE reduced from 46.14 (direct quantization) to 9.19 (Cholesky quantization), and AE from 27.19 to 9.20.

Quantization strategy: Diagonal elements are retained at 32-bit precision for numerical stability; off-diagonal elements are quantized to 4-bit using block-wise linear-2 mapping.

Key Design 2: Error Feedback (EF)¶

Inspired by error feedback in gradient compression for distributed training, the method adapts this mechanism to preconditioner compression:

A 4-bit error state \(\bar{E}_k^L\) is maintained to record the quantization error of the Cholesky factor.
Prior to each quantization step, the previous error is used to compensate: \(\bar{C}_k^L = \mathcal{Q}(C_k^L + E_{k-1}^L)\)
The error state is updated via exponential moving average: \(E_k^L = \beta_e E_{k-1}^L + (1-\beta_e)(C_k^L + E_{k-1}^L - \mathcal{D}(\bar{C}_k^L))\)

Key storage trick: The Cholesky factor occupies the lower-triangular portion and the error state (strictly lower-triangular, with zero diagonal) occupies the upper-triangular portion of the same matrix, incurring no additional memory overhead.

Loss & Training¶

Model updates follow the standard Shampoo procedure, using quantized inverse 1/4-roots to precondition the gradients:

\[W_{k+1} = W_k - \eta_k \mathcal{D}(\hat{L}_k) G_k \mathcal{D}(\hat{R}_k)\]

The learning rate is scaled via the grafting trick. Both SGD and AdamW are supported as base optimizers.

Convergence Theory¶

Smooth non-convex setting: 4-bit Shampoo achieves the optimal convergence rate \(\mathcal{O}(1/\sqrt{T})\).
Non-smooth non-convex setting (e.g., ReLU networks): Global convergence to a stationary point set is proven — the first convergence guarantee for preconditioned gradient descent in the non-smooth regime.

Key Experimental Results¶

Main Results: CIFAR-100 Test Accuracy (%) / Peak Memory (MB)¶

Model	Optimizer	Accuracy	Memory
VGG-19	SGDM	74.43	597.3
VGG-19	SGDM + 32-bit Shampoo	75.02	1065.2
VGG-19	SGDM + 4-bit VQ	74.36	662.2
VGG-19	SGDM + 4-bit CQ+EF	75.21	662.2
ViT-Small	AdamW	73.00	2930.0
ViT-Small	AdamW + 32-bit Shampoo	77.95	3448.9
ViT-Small	AdamW + 4-bit VQ	74.56	3001.7
ViT-Small	AdamW + 4-bit CQ+EF	77.74	3001.7

ImageNet Test Accuracy (%) / Peak Memory (MB)¶

Model	Optimizer	Accuracy	Memory
ResNet-50	AdamW Base	77.56	11356
ResNet-50	32-bit Shampoo	78.06	11986
ResNet-50	4-bit CQ+EF	78.00	11445
ViT-Base	AdamW Base	72.59	11840
ViT-Base	32-bit Shampoo	75.47	13319
ViT-Base	4-bit CQ+EF	75.01	12052

LLM Pre-training: LLaMA on C4 Perplexity (PPL, lower is better)¶

Model	Optimizer	PPL	Memory (GB)
LLaMA-130M	AdamW Base	27.32	45.9
LLaMA-130M	32-bit Shampoo	26.93	48.2
LLaMA-130M	4-bit CQ+EF	26.98	46.2

Ablation Study: Incremental Component Analysis¶

Method	ResNet-34/CIFAR-100	ViT-Small/CIFAR-100
4-bit VQ	79.45	74.56
4-bit CQ	80.27 (+0.82)	77.34 (+2.78)
4-bit CQ+EF	80.52 (+1.07)	77.74 (+3.18)

Spectral Preservation Metrics (NRE / AE, lower is better)¶

Method	Synthetic NRE	Synthetic AE	Epoch100 NRE	Epoch100 AE
Direct Quantization VQ	46.14	27.19	25.71	18.51
Cholesky CQ	9.19	9.20	4.85	4.85

Key Findings¶

CQ+EF reduces the 4-bit accuracy gap on ViT-Small/CIFAR-100 from 3.39% to 0.21%.
On ImageNet/ViT-Base, 4-bit CQ+EF is only 0.46% below 32-bit Shampoo while saving 1267 MB of memory.
The computational overhead of Cholesky decomposition is negligible: ResNet-50/ImageNet training takes 1899 min vs. 1883 min for VQ.

Highlights & Insights¶

Novel quantization target: Rather than directly quantizing preconditioner matrices, the method quantizes their Cholesky factors — the first such attempt. The triangular structure and SPD recovery property jointly address both memory reduction and information preservation.
Elegant combination of error feedback and triangular structure: The error state can be stored in the upper-triangular portion of the same matrix at zero additional memory cost.
First convergence proof in the non-smooth setting: Theoretical guarantees are provided for preconditioned gradient methods with non-smooth activations such as ReLU.
Comprehensive validation: Experiments span diverse architectures including VGG, ResNet, Swin, ViT, and LLaMA.

Limitations & Future Work¶

Cholesky decomposition requires strictly positive definite preconditioner matrices, necessitating \(\epsilon I\) regularization.
Diagonal elements are retained at 32-bit, resulting in an overall compression ratio below pure 4-bit.
Validation is limited to image classification and LLM pre-training; extension to detection and segmentation tasks has not been explored.
Error feedback introduces an additional hyperparameter \(\beta_e\), with insufficient sensitivity analysis provided.

Preconditioned optimizers: Shampoo, K-FAC, K-BFGS, AdaBK
Optimizer quantization: 4-bit Adam and related methods, which target only first-order optimizer states
Error feedback: Originally proposed for distributed gradient compression; this work is the first to apply it to preconditioner matrix quantization compensation

Rating¶

Dimension	Score (1–10)
Novelty	8
Theoretical Depth	8
Experimental Thoroughness	8
Value	8
Writing Quality	7
Overall	8