Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis¶

Conference: ACL 2026 arXiv: 2502.04416 Code: GitHub Area: Model Compression / MoE Keywords: FFN-to-MoE, activation pattern analysis, shared experts, analytical routing, post-training compression

TL;DR¶

This paper proposes an analytical post-training framework that rapidly restructures dense FFN layers into sparse MoE by analyzing neuron activation patterns — distinguishing high-frequency shared experts from low-frequency routed experts and constructing routers directly from activation statistics — achieving 1.17× speedup with only 2k-sample fine-tuning.

Background & Motivation¶

Background: MoE architectures decouple parameter scale from computational cost via sparse activation, yet conventional approaches require training MoE models from scratch at prohibitive cost.

Limitations of Prior Work: (1) Existing dense-to-MoE methods (e.g., MoEfication) rely on weight clustering and ignore differences in neuron activation frequency; (2) methods such as LLaMA-MoE require continued training on 200B tokens to recover quality; (3) a critical observation has been overlooked — neuron activation frequencies follow a bimodal distribution, with a small subset always active and the majority conditionally active.

Key Challenge: Treating always-active high-frequency neurons and conditionally active low-frequency neurons uniformly forces the router to activate most experts for nearly every input, undermining the sparsity of MoE.

Goal: Exploit the bimodal structure of activation patterns to design an analytical (large-scale training-free) FFN-to-MoE conversion method.

Key Insight: FFN hidden-layer activations are highly sparse and bimodally distributed — high-frequency neurons are assigned to a shared expert, while low-frequency neurons are clustered into routed experts by co-activation similarity, and the router is constructed directly from statistics.

Core Idea: The structured partition into shared and routed experts leverages the natural structure of activations, enabling the router to select only among genuinely input-dependent experts.

Method¶

Overall Architecture¶

The framework consists of three stages: (A) Activation Pattern Analysis — computing the activation rate \(\mu_i\) of each neuron using a small calibration dataset; (B) Expert Construction — high-frequency neurons are assigned to the shared expert, while low-frequency neurons are clustered into routed experts via a balanced assignment algorithm; (C) Analytical Router — the routing function is constructed directly from activation statistics without any training.

Key Designs¶

Shared/Routed Expert Partition Based on Activation Rate:
- Function: Constructs a natural expert partition by exploiting the bimodal activation structure.
- Mechanism: The activation rate \(\mu_i\) (proportion of occurrences in the top-\(K_a\) activations) is computed for each neuron. High-frequency neurons enter the shared expert \(E^s\) (always active); the remainder are clustered into routed experts \(E_i^r\) by activation pattern similarity.
- Design Motivation: High-frequency neurons are important for nearly all inputs; distributing them across routed experts would force most experts to be activated at all times, destroying sparsity.
Analytical Router Construction:
- Function: Determines which routed experts to activate for each input without any training.
- Mechanism: The reconstruction error minimization \(\|F_{MoE}(\mathbf{x}) - F(\mathbf{x})\|^2\) is reduced to minimizing the output contribution of inactive experts. The \(L_1\) norm of each expert's hidden state is used as a proxy for contribution, and the router selects the top-\(N_k\) experts with the largest contributions.
- Design Motivation: This bypasses costly router training by deriving routing signals directly from the activation statistics of the original FFN.
Hierarchical Sparsity (Recursive Application to Existing MoE):
- Function: Achieves finer-grained sparsity by recursively applying the framework within each expert of an existing MoE model.
- Mechanism: The same shared/routed partition is applied recursively to the FFN of each expert in an MoE model.
- Design Motivation: While the dense→MoE pipeline targets dense models, recursive application extends the method to further accelerate existing MoE models.

Loss & Training¶

The analytical restructuring is entirely training-free (the training-free baseline can be deployed directly). An optional fine-tuning step using 2k samples with a standard language modeling loss further improves quality.

Key Experimental Results¶

Main Results¶

Configuration	Speedup	Processing Time	Quality
Training-free	1.17×	Minutes	Usable
+2k fine-tuning	1.17×	Minutes + fine-tuning	Surpasses methods requiring orders of magnitude more resources

Ablation Study¶

Configuration	Key Metric	Note
Unified vs. partitioned experts	Partition significantly better	Validates the value of bimodal splitting
Analytical vs. learned router	Analytical is comparable	No router training required
Recursive hierarchical sparsity	Effective	Further accelerates MoE models

Key Findings¶

The bimodal activation pattern is pervasive across multiple LLM architectures (LLaMA-2, Mistral, etc.).
Minutes of processing plus 2k-sample fine-tuning suffices to surpass methods requiring 200B tokens of training.
The analytical router achieves quality comparable to learned routers at drastically reduced cost.

Highlights & Insights¶

The method is observation-driven — starting from the bimodal distribution of activation patterns, the design is natural and elegant.
The efficiency contrast of "minutes of processing vs. 200B-token training" is highly compelling.
The recursive application extends the method to both dense and MoE models.

Limitations & Future Work¶

The 1.17× speedup is relatively modest and may be insufficient for extreme low-latency scenarios.
The proportion of shared experts requires model-specific tuning.
The method has not been evaluated on vision or multimodal models.
Future work could combine the approach with orthogonal techniques such as quantization for further acceleration.

vs. MoEfication: Distinguishes shared and routed neurons rather than applying uniform clustering, fundamentally exploiting activation structure.
vs. LLaMA-MoE: Eliminates the need for large-scale continued training, reducing cost by orders of magnitude.
vs. Activation Sparsity Methods (e.g., DejaVu): Operates at a different granularity and can be used in combination.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of the bimodal activation observation and the analytical router is original.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple models and tasks against strong baselines.
Writing Quality: ⭐⭐⭐⭐⭐ The observation→motivation→method→validation logical chain is exemplary.
Value: ⭐⭐⭐⭐ Directly applicable to efficient LLM inference.