Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis¶
Conference: ACL 2025
arXiv: 2505.14742
Code: https://github.com/Little0o0/Quaff.git
Area: Model Compression / LLM Efficiency
Keywords: quantization, PEFT, activation outlier, weight-activation quantization, fine-tuning
TL;DR¶
This paper proposes the Outlier Spatial Stability Hypothesis (OSSH)—that the spatial locations of activation outlier channels remain stable during fine-tuning—and designs the Quaff framework based on this hypothesis. By handling only a few persistent outlier channels using targeted momentum scaling, Quaff achievements a 1.73× latency reduction and a 30% memory saving, while also improving accuracy on GPQA by 0.6%.
Background & Motivation¶
Background: PEFT (such as LoRA) reduces the number of trainable parameters, but the computational and memory overhead of fine-tuning billion-scale models remains substantial. Quantization is a primary means to improve efficiency: weight-only quantization (WOQ) compresses only weights but introduces mixed-precision computation bottlenecks; weight-activation quantization (WAQ) compresses both to INT8, utilizing integer arithmetic to achieve a 4× speedup.
Limitations of Prior Work: LLMs exhibit "emergent" channel-level outliers where certain channel activation values are 100× larger than the average. Existing solutions face a trilemma: (1) static scaling predefines scaling factors on calibration data, but shifts in activation distribution during fine-tuning cause mismatch; (2) dynamic scaling adjusts in real-time but requires storing full-precision weights and repeated re-quantization, incurring excessive memory and computational costs; (3) rotation transformations replace scaling but introduce computational inefficiencies.
Key Challenge: Channel scaling couples the quantization of weights and activations—the scaled weights \(\hat{W} = sW\) depend on real-time activation statistics, rendering independent quantization impossible.
Goal: Decouple the dependence between weight and activation quantization, making INT8 WAQ both efficient and accurate in fine-tuning scenarios.
Key Insight: It is observed that the spatial locations of outlier channels remain stable during fine-tuning (OSSH). Therefore, one only needs to pre-identify these channels and target them for processing.
Core Idea: Leverage the spatial stability of activation outlier channels to dynamically scale only a small fraction (<5%) of stable channels, thereby decoupling the weight-activation quantization dependency.
Method¶
Overall Architecture¶
Preprocessing phase: Identify outlier channels \(O\) using calibration data \(\rightarrow\) Quantize frozen weights \(W_{int}\) + Keep full-precision weights \(W_O\) for outlier channels \(\rightarrow\) Fine-tuning phase: Inject PEFT parameters \(\theta\) \(\rightarrow\) Run INT8 forward propagation with momentum scaling on outlier channel activations at each step.
Key Designs¶
-
Outlier Spatial Stability Hypothesis (OSSH):
- Function: Propose and validate the hypothesis that the spatial locations of activation outlier channels remain stable during fine-tuning.
- Mechanism: A pre-defined set of 5% outlier channels can achieve a hit rate of \(>90\%\) during the fine-tuning process.
- Design Motivation: If outlier channel locations are stable, they can be identified beforehand, avoiding global runtime scans. This naturally stems from the preservation of pre-trained features—outlier channels encode high-level semantic primitives, and fine-tuning only performs task adaptation, which does not alter them.
- Comparison with Prior Work: Prior works only observed channel stability in inference scenarios (fixed models); OSSH extends this to fine-tuning scenarios (where distributions shift).
-
Decoupled WAQ Formulation:
- Function: Decompose channel scaling \(Y = \hat{X}(sW)\) into static and dynamic components.
- Mechanism: \(Y = \hat{X}W + \hat{X}_{:,O}(s_O - 1)W_O\), where the main body \(\hat{X}W\) can complete INT8 computation with pre-quantized weights, and the compensation term \(\hat{x}\hat{w}\) involves only a small number of outlier channels.
- Since \((s-1)\) is zero for non-outlier channels, the compensation term is highly sparse, keeping the computational overhead \(<5\%\).
-
Targeted Momentum Scaling:
- Function: Compute scaling factors only for outlier channels, utilizing a momentum mechanism for smooth updates.
- Mechanism: \(s_t = \gamma s_{t-1} + (1-\gamma)\beta\), where \(\beta_i = \max(1, \sqrt{\max(|X_{:,i}|)/\max(|W_i|)})\) for outlier channels, and \(\beta_i = 1\) for non-outlier channels.
- The momentum parameter \(\gamma\) controls the update inertia, preventing overreaction to instantaneous activation fluctuations.
- Compared with dynamic scaling, this reduces recomputation and memory overhead by 99%.
Loss & Training¶
- Use standard STE (Straight-Through Estimator) for backpropagation.
- Compatible with four PEFT methods: LoRA, Prompt Tuning, P-tuning, and IA3.
- Outlier channel budget is controlled under 5% and adaptively allocated across layers (fewer for q_proj, more for down_proj).
Key Experimental Results¶
Main Results: GPQA Inference Benchmark (Phi-3 + LoRA)¶
| Method | Precision | Latency Ratio (vs FP32)↓ | Memory Ratio (vs FP32)↓ | GPQA Acc |
|---|---|---|---|---|
| FP32 | FP32 | 1.0× | 1.0× | Baseline |
| SmoothQuant | W8A8 | ~0.65× | ~0.7× | Below Quaff |
| QuaRot | W8A8 | ~0.75× | ~0.75× | Below Quaff |
| LLM.int8() | W8A8 | Slower than FP32 | ~0.85× | Below Quaff |
| Quaff | W8A8 | 0.58× | 0.70× | FP32+0.6% |
Ablation Study¶
| Configuration | Performance | Note |
|---|---|---|
| Quaff Complete | Best | Targeted Momentum Scaling + OSSH |
| w/o Momentum (Direct Scaling) | Obvious drop | Instantaneous fluctuations lead to instability |
| Static Scaling (SmoothQuant) | Drops by 2.1% | Distribution shift during fine-tuning causes mismatch |
| Global Dynamic Scaling | Similar but slower | Requires storing full-precision weights |
| Different OSSH Channel Ratios | 5% is best | Too few leads to insufficient coverage, too many increases overhead |
Key Findings¶
- Thorough validation of OSSH: The 5% pre-defined channels maintain a hit rate of \(>90\%\) across fine-tuning iterations, and this is consistent across different models (LLaMA-2, Phi-3, OPT).
- Feasible on consumer GPUs: Successfully ran LLaMA-2-7B fine-tuning on an RTX 2080 Super (8GB).
- Cross-PEFT compatibility: Quaff is effective across four methods: LoRA, Prompt Tuning, P-tuning, and IA3.
- \((s-1)\) scaling is more stable than \(s\): It reduces weight sensitivity to the scaling factor, improving quantization stability.
Highlights & Insights¶
- Theoretical elegance of the OSSH hypothesis: By observing the simple property that "outlier channel locations remain invariant", the method cleverly decouples the weight-activation coupling bottleneck in WAQ. This hypothesis is both intuitive and backed by experimental evidence.
- Ingenuity of the formulation decomposition: Decomposing \(\hat{X}(sW)\) into \(\hat{X}W + \hat{x}\hat{w}\) and exploiting the sparsity of the compensation term allows handling outliers almost for free. This algebraic transformation can be transferred to other scenarios requiring dynamic-static separation.
- Practical deployment value: Enables fine-tuning on consumer-grade GPUs, contributing directly to "democratizing LLMs."
- The \((s-1)\) trick: For non-outlier channels, \(s_i = 1\) makes \((s-1)_i = 0\), naturally achieving sparsification—a highly practical trick.
Limitations & Future Work¶
- Only INT8 quantization is validated; whether OSSH still holds in more extreme low-bit (INT4/INT2) scenarios remains unexplored.
- The impact of calibration data selection on outlier channel identification is not fully discussed.
- The momentum parameter \(\gamma\) requires manual tuning, and no adaptive adjustment scheme is provided.
- The theoretical explanation of OSSH is quite intuitive ("feature preservation + semantic consistency") and lacks rigorous mathematical proof.
Related Work & Insights¶
- vs SmoothQuant (Xiao et al., 2023): SmoothQuant uses static channel scaling, which is effective at inference but fails during fine-tuning. Quaff achieves "targeted dynamic scaling" via OSSH.
- vs LLM.int8() (Dettmers et al., 2022): Keeps outlier channels in full precision, but introduces mixed-precision overhead. Quaff is more efficient through the compensation term approach.
- vs QuaRot (Ashkboos et al., 2024): Eliminates outliers using rotation instead of scaling, but calculating rotation matrices introduces extra overhead. Quaff is lighter by only processing 5% of channels.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The OSSH hypothesis is novel and convincing, and the formulation decoupling is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10 benchmarks, 3 models, and 4 PEFT methods, including deployment validation on consumer GPUs.
- Writing Quality: ⭐⭐⭐⭐ Clear logic progressing from problem \(\rightarrow\) hypothesis \(\rightarrow\) method \(\rightarrow\) validation.
- Value: ⭐⭐⭐⭐⭐ Inherently balances theoretical contribution and practical deployment value; holding high significance for consumer GPU fine-tuning.