FloE: On-the-Fly MoE Inference on Memory-constrained GPU¶
Conference: ICML2025
arXiv: 2505.05950
Code: To be confirmed
Area: Model Compression
Keywords: MoE Inference, Expert Offloading, Activation Sparsity, Ultra-low-bit Quantization, Memory-constrained GPU, Inference Acceleration
TL;DR¶
Proposes FloE, an on-the-fly MoE inference system tailored for consumer-grade GPUs. By design of specialized intra-expert hybrid compression (contextual sparsification + ultra-low-bit quantization) and dual predictors, FloE pipelines computation and transmission. It successfully deploys Mixtral-8×7B on a single RTX 3090 with only 11GB VRAM, achieving a 48.7x speedup compared to DeepSpeed-MII with only a 4.4%--7.6% performance degradation.
Background & Motivation¶
MoE models (e.g., DeepSeek-R1, Mixtral) reduce inference computation through sparse activation, but a large number of inactive experts consume immense amounts of memory. Mixtral-8×7B in FP16 format requires 94GB VRAM, out of which 70% is "wasted" on inactive experts.
Expert Offloading, which stores expert parameters in CPU memory and loads them to the GPU on demand, is a natural solution. However, the core bottlenecks are: - PCIe 4.0 bandwidth is only 32GB/s, far below the GPU's internal HBM bandwidth (300GB/s). - A single Mixtral expert has 300MB of FP16 parameters, taking ~15ms to transfer but only ~5ms to compute. - Prior works adopt uniform ultra-low-bit quantization to reduce transmission overhead, which severely degrades generation quality.
Core Problem: How to hide expert I/O overhead within model computation to achieve on-the-fly inference without significantly degrading performance?
Method¶
Overall Architecture¶
FloE consists of three core components: hybrid expert compression, dual sparsity predictors, and system-level co-optimization.
1. Hybrid Expert Compression¶
Key Discovery: The three projection matrices of an MoE expert (gate, up, and down) show different sensitivities to compression.
Contextual Sparsification — for gate and down projections¶
- Observation: MOE experts exhibit significant inner activation sparsity (with the absolute values of up projection outputs heavily concentrated near zero).
- Theoretical Proof: Pruning according to the magnitude of the down projection input yields the smallest error, followed by the up projection, whereas the gate projection is the most sensitive.
- Design Choice: Generate a mask based on the output magnitude of the up projection, applying channel-level sparsification to the gate and down projections.
- At 90% sparsity, the perplexity of Mixtral increases by only about 0.5%.
Sparsified forward propagation: \(\mathbf{a}^S(x) = (\text{SiLU}(\mathbf{x}\mathbf{W}^{\text{gate}}) \odot S_t(\mathbf{x}\mathbf{W}^{\text{up}})) \mathbf{W}^{\text{down}}\)
Ultra-low-bit Quantization — for up projection¶
- Observation: The up projection is the least sensitive to quantization (under INT2, its perplexity degradation is only 46% of gate's and 27% of down's).
- Analysis: The MLP can be viewed as a key-value memory model, where up/gate acts as keys to select activated values (down). As a key, up can tolerate more quantization noise.
- Quantize the up projection using HQQ INT2.
Final Compression Effect: 9.3x parameter compression per expert.
2. Dual Sparsity Predictors¶
Inter-expert Predictor (learning-based)¶
- Uses the current layer's hidden states to predict which experts should be activated in the next layer.
- The cosine similarity between hidden states of adjacent layers holds > 0.95 (except for the first layer).
- Employs a single-layer MLP (32K parameters) for shallow layers, and a two-layer MLP (2M parameters) for deep layers.
- Average accuracy: 0.88.
Intra-expert Predictor (reuse-based)¶
- Multiplies the current layer's hidden states with the next layer's up projection matrix (reused) to estimate the sparsity distribution.
- No extra parameters, zero additional memory overhead.
- Average recall: 0.95.
3. System Co-optimization¶
- Efficient Sparse GEMV Kernel: Column-major storage + selective column loading implemented via Triton.
- Compact Asynchronous Transmission: Column-aligned storage of gate and down projections, AVX-512 instructions + multi-threaded packing + multi-stream asynchronous transmission.
- Achieves 88% of peak transmission bandwidth, which is 12.6x faster than PyTorch's native implementation.
Key Experimental Results¶
End-to-End Inference Speed (RTX 3090, 12GB VRAM)¶
| Method | Relative Mixtral-GPU Speed | Speedup |
|---|---|---|
| DeepSpeed-MII | Extremely slow (FP16 Offloading) | 1x |
| Mixtral-Offloading | - | 18.7x |
| Fiddler | - | 15.5x |
| FloE | 91% of Mixtral-GPU | 48.7x |
Deployment Capabilities¶
| Metric | Value |
|---|---|
| Minimum VRAM Requirement | 11GB |
| Memory Footprint Compression | 8.5x |
| Per-expert Parameter Compression | 9.3x |
| Performance Degradation | 4.4%--7.6% |
Downstream Tasks (Average Accuracy across 7 Zero/Few-shot Tasks)¶
- FloE-W^up (sparsification only) outperforms CATS by 2.8% at 80% sparsity and by 9.8% at 90% sparsity.
- FloE (sparsified + quantized) still outperforms HQQ INT3 and CHESS.
- Sparsification and quantization errors are mostly independent and additive.
Sparse Kernel Speedup (RTX 3090 Single Expert)¶
| Sparsity | Speedup |
|---|---|
| 50% | 1.43x |
| 70% | 1.72x |
| 90% | 1.92x |
Highlights & Insights¶
- Matrix-level differentiated compression is the core innovation: up projection quantization + gate/down projection sparsification preserves quality significantly better than uniform quantization.
- The dual-predictor design is elegant: the learning-based inter-expert predictor is parameterized but lightweight, while the reuse-based intra-expert predictor is parameter-free with zero overhead, collectively enabling a seamless computation-transmission pipeline.
- Excellent coordination of theory and practice: The authors theoretically prove that pruning based on down-projection input is optimal, and experimentally verify that using up-projection output is near-optimal while also enabling transfer predictability.
- Comprehensive system engineering: Full-stack optimization from algorithms and kernels to transmission protocols renders the end-to-end acceleration highly tangible.
- Deployability on consumer-grade GPUs: Running Mixtral-8x7B on just 11GB of VRAM drastically lowers the barrier to using MoE models.
Limitations & Future Work¶
- Complete evaluation is only conducted on Mixtral-8x7B; generalization to newer MoE architectures like DeepSeek-V2/V3 remains to be validated.
- Currently only supports single-batch inference (latency-sensitive scenarios), while high-throughput batch inference is not yet covered.
- The inter-expert predictor requires training, inducing additional costs when scaling to new models.
- The trade-off between sparsity and performance is more pronounced in knowledge-intensive tasks like MMLU.
Related Work & Insights¶
- Mixtral-Offloading (Eliseev & Mazur, 2023): A pioneering scheme featuring unified quantization, prediction, and caching.
- CATS (Lee et al., 2024a): An activation sparsification method for dense LLMs.
- Fiddler (Kamahori et al., 2024): A CPU-GPU co-execution scheme.
- Insight: MoE inference optimization should exploit redundancy "within experts" rather than focusing solely on routing between experts.
Rating¶
- Novelty: ⭐⭐⭐⭐ (System-level innovation combining hybrid compression and dual predictors)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Dual dimensions of efficiency and effectiveness, verified across multiple GPUs and tasks)
- Writing Quality: ⭐⭐⭐⭐ (Rich illustrations, clear logical flow from motivation to observation and design)
- Value: ⭐⭐⭐⭐⭐ (Enables on-the-fly MoE execution on consumer-grade GPUs, offering extremely high practical value)