Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation¶

Conference: CVPR 2026
arXiv: 2603.12577
Code: GitHub
Area: Parameter-Efficient Fine-Tuning / Large Language Models / Mixture of Experts
Keywords: Parameter-efficient fine-tuning, expert pyramid, deconvolution projection, contrastive task embedding, MoE-LoRA

TL;DR¶

This paper proposes Expert Pyramid Tuning (EPT), which transplants the multi-scale feature pyramid (FPN) concept from computer vision into the MoE-LoRA paradigm. By combining a shared low-dimensional meta-knowledge subspace, deconvolution expert projections with kernels of varying scales, and contrastive task embeddings, EPT achieves an average score of 87.0% on GLUE with only 0.41M parameters per task—reducing parameter count by approximately 50% compared to existing MoE-LoRA variants.

Background & Motivation¶

Background: LoRA achieves strong performance in single-task fine-tuning, and MoE-LoRA variants (MoELoRA, MoRE, MixLoRA) mitigate multi-task negative transfer by using gated routing to assign tokens to different low-rank experts.

Limitations of Prior Work:

Existing MoE-LoRA variants adopt a uniform expert architecture (identical rank and capacity), ignoring the hierarchical nature of task complexity—simple tasks (e.g., sentiment classification SST-2) require only high-level semantic abstraction, while complex tasks (e.g., grammaticality judgment CoLA) demand fine-grained syntactic operations.
Empirical validation shows that the optimal LoRA rank varies drastically across tasks as rank scales from 1 to 32 (e.g., RTE: rank=16, CoLA: rank=8), confirming the limitation of a uniform rank assignment.
Each expert independently learning a full LoRA matrix introduces parameter redundancy, as shared general knowledge is repeatedly encoded.

Key Challenge: Tasks of varying complexity require feature representations at different granularities, yet the one-size-fits-all expert design of existing MoE-LoRA frameworks cannot accommodate this requirement.

Goal: To enable a multi-task PEFT framework to adaptively allocate experts of different granularities according to task complexity, while eliminating parameter redundancy across independent experts, all without sacrificing parameter efficiency.

Key Insight: Drawing inspiration from FPN's multi-scale philosophy—using a shared low-dimensional meta-knowledge seed and constructing a "parameter pyramid" via deconvolution projections with kernels of varying scales.

Core Idea: All experts share a single low-dimensional linguistic prior seed, which is projected to different granularities through deconvolution kernels of varying sizes, forming a parameter pyramid ranging from fine-grained to coarse-grained representations.

Method¶

Overall Architecture¶

Input token \(x\) → frozen pretrained weights \(W_0\) → EPT layer (shared meta-knowledge subspace \(Z_\text{meta}\) → \(N\) deconvolution experts with different kernel scales → Adaptive LoRA Pruner for dimension alignment → Top-K routing → weighted fusion). At inference, the module can be re-parameterized and merged into the backbone with no additional latency.

Key Designs¶

Shared Meta-knowledge Subspace
- Function: Encodes universal linguistic patterns shared across all tasks, serving as the "seed" for all experts.
- Mechanism: A low-dimensional matrix \(Z_\text{meta} = B \cdot A\) (\(h, w \ll d_\text{model}\)) is learned. Unlike the zero initialization in conventional LoRA, both \(A\) and \(B\) are initialized with random Gaussian distributions, ensuring the seed encodes rich, non-degenerate representations from the onset of training.
- Design Motivation: Avoids the parameter redundancy caused by independent expert learning in MoE-LoRA, encoding shared general knowledge only once.
Pyramid Projection Mechanism
- Function: Projects the low-dimensional meta-knowledge seed into high-dimensional feature spaces at different granularities, forming a parameter pyramid.
- Mechanism: \(N\) deconvolution experts are defined, each with a distinct kernel scale \(s_i\) (e.g., \(\{2, 2, 4, 4, 6, 6, 8, 8\}\)), where \(W_i = \text{Deconv}(Z_\text{meta}; K_i)\). Kernels are initialized to zero to ensure no perturbation to pretrained weights at initialization. Small-kernel experts capture local fine-grained patterns, while large-kernel experts capture global semantic dependencies.
- Adaptive LoRA Pruner (ALP): Dynamically slices the \(B\) and \(A\) matrices for experts at different scales to produce scale-specific seeds, ensuring output dimensions are consistent with pretrained weights. A dimension-aware scaling factor \(d_t / T\) is introduced to balance the update frequency disparity between shared and task-specific parameters.
- Design Motivation: Analogous to FPN using different resolutions to detect objects of different sizes—parameters at different granularities are matched to tasks of different complexity.
Contrastive Task Embedding Module
- Function: Learns discriminative embeddings for each task to assist the router in accurately selecting experts.
- Mechanism: Parameterized embedding matrices are maintained for \(T\) tasks, and a temperature-scaled contrastive loss is used to maximize mutual information between samples and their corresponding task embeddings.
- PCA visualization confirms that similar tasks (QNLI/MNLI) naturally cluster together, while dissimilar tasks (STSB/CoLA) are clearly separated.

Loss & Training¶

Total loss: \(\mathcal{L}_\text{total} = \mathcal{L}_\text{gen} + 0.1 \cdot \mathcal{L}_\text{con}\) (temperature \(\tau = 0.05\))
Balanced data sampling: each task is sampled with probability \(1/T\)
AdamW, lr=\(3\times10^{-4}\), linear decay with 500-step warmup, 5 epochs, batch size 32
T5-base: 1×A100; LLaMA2-7B: 3×A800

Key Experimental Results¶

Main Results¶

Method	params/task	MNLI	QQP	QNLI	SST-2	STS-B	MRPC	RTE	CoLA	AVG
LoRA (r=8)	0.39M	85.8	89.2	93.1	93.2	90.4	89.9	76.3	62.8	85.1
MOELoRA	0.81M	86.3	90.4	93.2	94.2	89.8	90.7	79.9	65.3	86.2
MoRE	0.81M	85.6	90.2	93.1	93.9	89.9	90.7	77.7	68.7	86.2
EPT	0.41M	86.4	90.2	93.6	94.5	90.0	90.7	82.0	68.9	87.0

EPT achieves the best performance on 6 out of 8 GLUE tasks with an average of 87.0%, using only half the parameters of MoELoRA/MoRE.

Method	params/task	BoolQ	OBQA	ARC-E	ARC-C	AVG
MoRE	4.5M	74.7	80.5	80.0	64.5	74.9
EPT	3.3M	76.1	78.4	81.4	66.2	75.5

On LLaMA2-7B commonsense reasoning benchmarks, EPT achieves a higher average score with fewer parameters.

Ablation Study¶

Configuration	GLUE AVG	Note
Full EPT	87.0	Baseline
Zero init. for A & B	86.2	Non-degenerate seed benefits deconv reconstruction
w/o Top-K routing	86.0	Adaptive multi-scale fusion is critical
w/o ALP module	86.3	Dimension-aware scaling stabilizes training
w/o contrastive loss	86.5	Task embedding discriminability contributes to routing
EPT-2 (all small kernels)	85.8	Fine-grained only is insufficient
EPT-8 (all large kernels)	86.1	Coarse-grained only is insufficient
EPT-2468 (mixed kernels)	87.0	Multi-scale combination is optimal

Parameter efficiency: EPT uses only 6,384 parameters per layer vs. 98,304 for MoE-LoRA—a 15× improvement.

Key Findings¶

Mixed kernel scales (pyramid) > all large kernels > all small kernels, validating the necessity of multi-scale design.
Expert activation analysis shows that large tasks preferentially activate large-kernel experts and small tasks activate small-kernel experts, consistent with the design intuition.
Random Gaussian initialization outperforms zero initialization (+0.8 pp), confirming that the seed must encode rich representations from the start of training.
Re-parameterization at inference introduces no additional computational overhead.

Highlights & Insights¶

The cross-domain inspiration of adapting FPN's multi-scale philosophy to the PEFT domain is elegant—the "parameter pyramid" serves as a direct analogue to the "feature pyramid."
Exceptional parameter efficiency: shared meta-knowledge combined with lightweight deconvolution kernels achieves better performance with 15× fewer parameters than conventional MoE-LoRA.
The combination of zero-initialized deconvolution kernels and Gaussian-initialized meta-knowledge ensures a well-conditioned training starting point.
Re-parameterization at inference enables seamless deployment with no additional latency.

Limitations & Future Work¶

The pyramid dimension configuration \(\{2, 2, 4, 4, 6, 6, 8, 8\}\) is a static hyperparameter; future work could explore automatic dimension assignment.
Validation is limited to downstream fine-tuning tasks; effectiveness in large-scale pretraining scenarios remains unknown.
The contrastive task embedding module requires known task labels; the routing strategy for unseen tasks at inference time is not addressed.
Experiments are conducted on relatively small models (T5-base and LLaMA2-7B); performance on larger models remains to be verified.

vs. MoELoRA/MoRE: These methods have each expert independently learn a full LoRA matrix (0.81M params), whereas EPT requires only 0.41M through shared meta-knowledge and deconvolution projection while achieving a higher average score. The key distinction is "sharing + projection" vs. "independent learning."
vs. MixLoRA: MixLoRA (1.49M) prioritizes high-throughput inference but overlooks multi-scale requirements; EPT uses fewer parameters and achieves 1.1% higher average performance.
vs. DCFT: DCFT similarly employs deconvolution for subspace projection but is a single-task method; EPT extends this into a multi-scale, multi-expert framework.
Insight: The pyramid projection paradigm could be extended to adapter tuning in VLMs, where different modalities may have inherently different granularity requirements.

Rating¶

Novelty: ⭐⭐⭐⭐ The cross-domain inspiration (FPN → PEFT) is original, though the MoE+LoRA combination framework is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 8 GLUE tasks + 4 commonsense reasoning tasks, with complete ablations and parameter efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure, complete mathematical derivations, and convincing motivation.
Value: ⭐⭐⭐⭐ Provides a more efficient framework for multi-task PEFT with strong practical applicability.