Skip to content

Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis

Conference: ACL 2025
arXiv: 2505.14742
Code: https://github.com/Little0o0/Quaff.git
Area: Model Compression / LLM Efficiency
Keywords: quantization, PEFT, activation outlier, weight-activation quantization, fine-tuning

TL;DR

This paper proposes the Outlier Spatial Stability Hypothesis (OSSH)—that the spatial locations of activation outlier channels remain stable during fine-tuning—and designs the Quaff framework based on this hypothesis. By handling only a few persistent outlier channels using targeted momentum scaling, Quaff achievements a 1.73× latency reduction and a 30% memory saving, while also improving accuracy on GPQA by 0.6%.

Background & Motivation

Background: PEFT (such as LoRA) reduces the number of trainable parameters, but the computational and memory overhead of fine-tuning billion-scale models remains substantial. Quantization is a primary means to improve efficiency: weight-only quantization (WOQ) compresses only weights but introduces mixed-precision computation bottlenecks; weight-activation quantization (WAQ) compresses both to INT8, utilizing integer arithmetic to achieve a 4× speedup.

Limitations of Prior Work: LLMs exhibit "emergent" channel-level outliers where certain channel activation values are 100× larger than the average. Existing solutions face a trilemma: (1) static scaling predefines scaling factors on calibration data, but shifts in activation distribution during fine-tuning cause mismatch; (2) dynamic scaling adjusts in real-time but requires storing full-precision weights and repeated re-quantization, incurring excessive memory and computational costs; (3) rotation transformations replace scaling but introduce computational inefficiencies.

Key Challenge: Channel scaling couples the quantization of weights and activations—the scaled weights \(\hat{W} = sW\) depend on real-time activation statistics, rendering independent quantization impossible.

Goal: Decouple the dependence between weight and activation quantization, making INT8 WAQ both efficient and accurate in fine-tuning scenarios.

Key Insight: It is observed that the spatial locations of outlier channels remain stable during fine-tuning (OSSH). Therefore, one only needs to pre-identify these channels and target them for processing.

Core Idea: Leverage the spatial stability of activation outlier channels to dynamically scale only a small fraction (<5%) of stable channels, thereby decoupling the weight-activation quantization dependency.

Method

Overall Architecture

Preprocessing phase: Identify outlier channels \(O\) using calibration data \(\rightarrow\) Quantize frozen weights \(W_{int}\) + Keep full-precision weights \(W_O\) for outlier channels \(\rightarrow\) Fine-tuning phase: Inject PEFT parameters \(\theta\) \(\rightarrow\) Run INT8 forward propagation with momentum scaling on outlier channel activations at each step.

Key Designs

  1. Outlier Spatial Stability Hypothesis (OSSH):

    • Function: Propose and validate the hypothesis that the spatial locations of activation outlier channels remain stable during fine-tuning.
    • Mechanism: A pre-defined set of 5% outlier channels can achieve a hit rate of \(>90\%\) during the fine-tuning process.
    • Design Motivation: If outlier channel locations are stable, they can be identified beforehand, avoiding global runtime scans. This naturally stems from the preservation of pre-trained features—outlier channels encode high-level semantic primitives, and fine-tuning only performs task adaptation, which does not alter them.
    • Comparison with Prior Work: Prior works only observed channel stability in inference scenarios (fixed models); OSSH extends this to fine-tuning scenarios (where distributions shift).
  2. Decoupled WAQ Formulation:

    • Function: Decompose channel scaling \(Y = \hat{X}(sW)\) into static and dynamic components.
    • Mechanism: \(Y = \hat{X}W + \hat{X}_{:,O}(s_O - 1)W_O\), where the main body \(\hat{X}W\) can complete INT8 computation with pre-quantized weights, and the compensation term \(\hat{x}\hat{w}\) involves only a small number of outlier channels.
    • Since \((s-1)\) is zero for non-outlier channels, the compensation term is highly sparse, keeping the computational overhead \(<5\%\).
  3. Targeted Momentum Scaling:

    • Function: Compute scaling factors only for outlier channels, utilizing a momentum mechanism for smooth updates.
    • Mechanism: \(s_t = \gamma s_{t-1} + (1-\gamma)\beta\), where \(\beta_i = \max(1, \sqrt{\max(|X_{:,i}|)/\max(|W_i|)})\) for outlier channels, and \(\beta_i = 1\) for non-outlier channels.
    • The momentum parameter \(\gamma\) controls the update inertia, preventing overreaction to instantaneous activation fluctuations.
    • Compared with dynamic scaling, this reduces recomputation and memory overhead by 99%.

Loss & Training

  • Use standard STE (Straight-Through Estimator) for backpropagation.
  • Compatible with four PEFT methods: LoRA, Prompt Tuning, P-tuning, and IA3.
  • Outlier channel budget is controlled under 5% and adaptively allocated across layers (fewer for q_proj, more for down_proj).

Key Experimental Results

Main Results: GPQA Inference Benchmark (Phi-3 + LoRA)

Method Precision Latency Ratio (vs FP32)↓ Memory Ratio (vs FP32)↓ GPQA Acc
FP32 FP32 1.0× 1.0× Baseline
SmoothQuant W8A8 ~0.65× ~0.7× Below Quaff
QuaRot W8A8 ~0.75× ~0.75× Below Quaff
LLM.int8() W8A8 Slower than FP32 ~0.85× Below Quaff
Quaff W8A8 0.58× 0.70× FP32+0.6%

Ablation Study

Configuration Performance Note
Quaff Complete Best Targeted Momentum Scaling + OSSH
w/o Momentum (Direct Scaling) Obvious drop Instantaneous fluctuations lead to instability
Static Scaling (SmoothQuant) Drops by 2.1% Distribution shift during fine-tuning causes mismatch
Global Dynamic Scaling Similar but slower Requires storing full-precision weights
Different OSSH Channel Ratios 5% is best Too few leads to insufficient coverage, too many increases overhead

Key Findings

  • Thorough validation of OSSH: The 5% pre-defined channels maintain a hit rate of \(>90\%\) across fine-tuning iterations, and this is consistent across different models (LLaMA-2, Phi-3, OPT).
  • Feasible on consumer GPUs: Successfully ran LLaMA-2-7B fine-tuning on an RTX 2080 Super (8GB).
  • Cross-PEFT compatibility: Quaff is effective across four methods: LoRA, Prompt Tuning, P-tuning, and IA3.
  • \((s-1)\) scaling is more stable than \(s\): It reduces weight sensitivity to the scaling factor, improving quantization stability.

Highlights & Insights

  • Theoretical elegance of the OSSH hypothesis: By observing the simple property that "outlier channel locations remain invariant", the method cleverly decouples the weight-activation coupling bottleneck in WAQ. This hypothesis is both intuitive and backed by experimental evidence.
  • Ingenuity of the formulation decomposition: Decomposing \(\hat{X}(sW)\) into \(\hat{X}W + \hat{x}\hat{w}\) and exploiting the sparsity of the compensation term allows handling outliers almost for free. This algebraic transformation can be transferred to other scenarios requiring dynamic-static separation.
  • Practical deployment value: Enables fine-tuning on consumer-grade GPUs, contributing directly to "democratizing LLMs."
  • The \((s-1)\) trick: For non-outlier channels, \(s_i = 1\) makes \((s-1)_i = 0\), naturally achieving sparsification—a highly practical trick.

Limitations & Future Work

  • Only INT8 quantization is validated; whether OSSH still holds in more extreme low-bit (INT4/INT2) scenarios remains unexplored.
  • The impact of calibration data selection on outlier channel identification is not fully discussed.
  • The momentum parameter \(\gamma\) requires manual tuning, and no adaptive adjustment scheme is provided.
  • The theoretical explanation of OSSH is quite intuitive ("feature preservation + semantic consistency") and lacks rigorous mathematical proof.
  • vs SmoothQuant (Xiao et al., 2023): SmoothQuant uses static channel scaling, which is effective at inference but fails during fine-tuning. Quaff achieves "targeted dynamic scaling" via OSSH.
  • vs LLM.int8() (Dettmers et al., 2022): Keeps outlier channels in full precision, but introduces mixed-precision overhead. Quaff is more efficient through the compensation term approach.
  • vs QuaRot (Ashkboos et al., 2024): Eliminates outliers using rotation instead of scaling, but calculating rotation matrices introduces extra overhead. Quaff is lighter by only processing 5% of channels.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The OSSH hypothesis is novel and convincing, and the formulation decoupling is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10 benchmarks, 3 models, and 4 PEFT methods, including deployment validation on consumer GPUs.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic progressing from problem \(\rightarrow\) hypothesis \(\rightarrow\) method \(\rightarrow\) validation.
  • Value: ⭐⭐⭐⭐⭐ Inherently balances theoretical contribution and practical deployment value; holding high significance for consumer GPU fine-tuning.