Skip to content

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Conference: ICML2026
arXiv: 2602.01990
Code: https://github.com/LAMDA-CL/Prism
Area: Multimodal VLM / Continual Learning / MoE-LoRA
Keywords: SAME, MCIT, router drift, expert drift, spectral-aware update, curvature-aware Riemannian scaling, adaptive expert activation

TL;DR

SAME explicitly decomposes "catastrophic forgetting" in multimodal continual instruction tuning (MCIT) within MoE-LoRA into two independent sources: router drift and expert drift. It employs spectral-aware subspace constraints to update routers, utilizes historical input covariance for Riemannian preconditioning to protect experts, and removes redundant updates via task-level adaptive freezing. The method consistently outperforms current MoE continual learning SOTA on CoIN, UCIT, and the authors' TriGap long-sequence benchmark.

Background & Motivation

Background: MLLMs (e.g., LLaVA, Qwen-VL) show impressive performance on static multi-tasks via instruction tuning. However, in real-world deployment, tasks arrive sequentially, leading to MCIT (Multimodal Continual Instruction Tuning). Modern approaches integrate MoE + LoRA into FFN layers, leveraging sparse routing for "expert specialization" to resist forgetting (e.g., MoELoRA, CL-MoE, HiDe-LLaVA).

Limitations of Prior Work: The authors conduct diagnostic experiments (Fig. 1) by saving router and expert snapshots after each task in an 8-task sequence. Testing on Task 1 revealed: (a) as new tasks arrive, the router's activation distribution for Task 1 inputs drifts significantly, meaning the same input is routed to different experts; (b) even when retraining the router on Task 1 data (with frozen experts), Task 1 accuracy continues to decline and routing entropy decreases—indicating that the experts themselves are drifting and have lost their original Task 1 functionality.

Key Challenge: In MoE-based continual learning, "forgetting" is the superposition of two decoupled sources: (i) router drift—router updates cause old task samples to be mismatched to new experts; (ii) expert drift—shared experts are repeatedly overwritten by new task gradients, erasing original functions. Previous methods failed to treat these separately, often lead to a "whack-a-mole" problem where fixing one causes the other to fail.

Goal: Design independent stabilization mechanisms for both the router and the experts, complemented by a training efficiency component to reduce redundant updates, creating an end-to-end rehearsal-free solution.

Key Insight: The authors borrow two classical ideas from continual learning: "gradient subspace projection" and "natural gradient/Fisher metrics." The former addresses the router (preserving old task subspaces), while the latter protects experts (preconditioning under historical input geometry).

Core Idea: Maintain an uncentered covariance \(\mathbf{C}^t\) of historical inputs as a proxy for "past geometry." Router gradients are projected onto high-energy subspaces of this covariance, while expert gradients are scaled via Riemannian preconditioning using \((\mathbf{C}^{t-1})^{-1}\), combined with task-level adaptive expert freezing.

Method

Overall Architecture

The backbone is LLaVA-v1.5-7B + CLIP-L/14-336, with LoRA experts inserted into the LLM's FFN layers. Given input \(\mathbf{x}\in\mathbb{R}^d\), the output is \(\mathbf{h}=\mathbf{W}_0\mathbf{x}+\sum_{i=1}^n \omega_i \mathbf{B}_i\mathbf{A}_i\mathbf{x}\), where \(\omega_i=\mathrm{Softmax}(\mathbf{W}_G\mathbf{x})_i\) is the router output. Workflow for task \(t\): (1) Online maintenance of input covariance \(\mathbf{C}^t\) with top-\(k\) SVD to obtain \(\mathbf{V}_\parallel, \mathbf{V}_\perp\); (2) Constraint of \(\mathbf{W}_G\) updates via spectral-aware rules; (3) Expert gradient preconditioning using \((\mathbf{C}^{t-1})^{-1}\); (4) Temporary freezing of experts based on utility-importance differentials.

Key Designs

  1. Spectral-aware Routing:

    • Function: Restricts router updates to "newly added input directions" for the current task while remaining nearly invariant in directions relied upon by old tasks.
    • Mechanism: Maintains a cross-task sliding covariance \(\mathbf{C}^t=(\alpha_{t-1}\mathbf{C}^{t-1}+n_t\hat{\mathbf{C}}^t)/\alpha_t\), storing only top-\(k\) principal components such that the cumulative energy ratio \(\sum_{i=1}^k\sigma_i^2/\sum_{i=1}^d\sigma_i^2\geq \delta\). SVD yields \(\mathbf{C}^t=\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top\), partitioned into \(\mathbf{V}_\parallel=\mathbf{V}[:,:k]\) (key directions for current/past tasks) and \(\mathbf{V}_\perp=\mathbf{V}[:,k:]\) (near-zero variance subspace). Updates in new directions are scaled by \(g(\boldsymbol{\Sigma})\) (using inverse sliding average \(\hat\sigma_i\) as \(\alpha_i\)): \(\Delta\mathbf{W}_\parallel^t=\Delta\mathbf{W}_G^t\mathbf{V}_\parallel g(\boldsymbol{\Sigma})\mathbf{V}_\parallel^\top\). Old task directions are directly projected: \(\Delta\mathbf{W}_\perp^t=\Delta\mathbf{W}_G^t\mathbf{V}_\perp\mathbf{V}_\perp^\top\). Final update: \(\Delta\mathbf{W}_G^t=\Delta\mathbf{W}_\parallel^t+\Delta\mathbf{W}_\perp^t\).
    • Design Motivation: Since \(\mathbf{C}^t\propto \mathbf{X}\mathbf{X}^\top\), then \(\mathbf{V}_\perp^\top\mathbf{X}^{old}\approx \mathbf{0}\), ensuring \(\Delta\mathbf{W}_\perp^t \mathbf{X}^{old}\approx \mathbf{0}\)—preserving predictions for old tasks. Simultaneous re-weighting within \(\mathbf{V}_\parallel\) based on "local context \(\hat\sigma_i\)" avoids treating all new directions equally, addressing the drift observed in Fig. 1(a-c).
  2. Curvature-aware Riemannian Scaling:

    • Function: Forces expert updates to take smaller steps in "historically frequent input directions" to avoid overwriting learned functions.
    • Mechanism: In a rehearsal-free setting, functional degradation is approximated as \(\Delta_{degrad}\triangleq \mathbb{E}_{\mathbf{x}\sim \mathcal{D}_{<t}}[\|\Delta\mathbf{W}_i \mathbf{x}\|^2]=\mathrm{tr}(\Delta\mathbf{W}_i\mathbf{C}^{t-1}\Delta\mathbf{W}_i^\top)\). Optimizing \(\min_{\Delta\mathbf{W}_i}\mathcal{L}+\lambda\max(0,\Delta_{degrad}-\epsilon)\) leads to Riemannian updates \(\Delta\mathbf{W}_i=-\eta\nabla_{\mathbf{W}_i}\mathcal{L}(\mathbf{C}^{t-1})^{-1}\). The term \((\mathbf{C}^{t-1})^{-1}\) is approximated using the damped pseudo-inverse from the spectral stage: \((\mathbf{C}^{t-1})^{-1}\approx \mathbf{V}_k(\boldsymbol{\Sigma}_k+\mu\mathbf{I})^{-1}\mathbf{V}_k^\top+\frac{1}{\mu}(\mathbf{I}-\mathbf{V}_k\mathbf{V}_k^\top)\).
    • Design Motivation: High variance directions (large \(\sigma_i\)) result in small \((\sigma_i+\mu)^{-1}\), "compressing" the update. Low variance or new directions use the \(1/\mu\) default scale. This follows the principle of natural gradients: automatically applying "gravity" in directions frequently used in the past.
  3. Adaptive Expert Activation:

    • Function: Temporarily freezes experts that are "low utility for the current task but high importance for historical tasks" to save computation and prevent interference.
    • Mechanism: Maintains two running averages—current task utilization \(\mathcal{U}(i)\) (mean routing weight per batch) and historical importance \(\mathcal{F}^{pre}(i)\) (approximated via routing-weighted input energy \(\omega_i(\mathbf{x})\|\mathbf{x}\|^2\)). Min-max normalization yields \(\tilde{\mathcal{U}}(i)\) and \(\tilde{\mathcal{F}}^{pre}(i)\). Activation score: \(\mathrm{Score}(i)=\tilde{\mathcal{U}}(i)-\tilde{\mathcal{F}}^{pre}(i)\). Experts with \(\mathrm{Score}(i)<\tau_{score}\) are frozen during the current task's training.
    • Design Motivation: Top-\(k\) routing is sample-level and scatters single-task gradients across many experts. Task-level freezing focuses updates on the few experts truly needed, strengthening compartmentalization and reducing training time and VRAM usage.

Loss & Training

Standard next-token CE for multimodal instruction tuning. LoRA rank=8, 1 epoch per task, bs=6, lr=2e-4 with cosine decay and 0.03 warmup, 8×RTX 5090. All constraints are implemented by modifying gradient update rules without additional loss terms. Overall training cost is nearly identical to vanilla MoELoRA.

Key Experimental Results

Main Results

CoIN 8-task benchmark (aligned with original paper reports):

Method ScienceQA ImageNet REC OCR-VQA Average
MoELoRA (Chen 2024) 62.02 37.21 33.22 65.75 50.58
SEFE (Chen 2025) 75.35 83.10 16.75 66.25 58.57
HiDe-LLaVA (Guo 2025a) 73.20 69.28 59.18 64.76 63.95
SAME (Ours) 78.35 90.21 59.87 63.59 66.82

TriGap (10-task) long-sequence benchmark:

Method DocVQA IconQA FloodNet Average
MoE-LoRA 37.49 43.43 90.41 44.45
CL-MoE 36.79 52.70 80.09 44.11
ModalPrompt 38.23 44.73 71.52 40.15
SAME 43.87 64.03 81.09 46.53

On UCIT (6 tasks), SAME leads with an average accuracy of 67.12% (vs ModalPrompt at 65.52%).

Ablation Study

Configuration CoIN Avg Acc Note
Baseline (MoELoRA) 50.58 No constraints on router/experts
+ Spectral-aware Routing 61.32 Stable routing, +10.74
+ Curvature-aware Scaling 65.89 Experts not overwritten, +4.57
+ Adaptive Expert Activation 66.82 Task-level freeze +0.93, saves time/VRAM

Key Findings

  • Router stability provides the largest single gain: Fig. 3 shows that with spectral-aware routing, Task 1's expert activation distribution barely drifts during training, corresponding to the +10.74 gain.
  • Expert drift exists independently: Using a "re-routing protocol" (frozen experts + retrained router), Fig. 4 shows that as the expert version increases, Task 1 accuracy still drops; curvature scaling significantly slows this decay.
  • Format drift is a silent killer: On the ScienceQA → TextVQA → ImageNet sequence, 70.6% of "errors" were merely case changes ("a" vs "A"). SAME flattens such format drift.
  • Adaptive freezing is a free lunch: While gaining +0.93 accuracy, it saves an average of 32.1 minutes and 2.3 K MiB VRAM per GPU per task.

Highlights & Insights

  • Microscopic decomposition of forgetting: Unlike previous work that vaguely cited "parameter drift," this paper uses a re-routing control experiment to isolate router drift and expert drift as independent observable signals. This "diagnosis → decoupling → divide-and-conquer" paradigm is highly valuable.
  • Unified covariance \(\mathbf{C}^t\) usage: The same cumulative covariance is reused for router subspace partitioning, expert Riemannian metrics, and adaptive freezing input energy approximation, reducing storage overhead to nearly zero.
  • Natural anti-forgetting from a Riemannian perspective: \((\mathbf{C}^{t-1})^{-1}\) preconditioning ensures that "historically used directions" automatically gain resistance to change, equivalent to natural gradient descent under the Fisher Information metric.
  • Empirical evidence of format drift: The analysis of case-sensitivity errors suggests that much "forgetting" in LLM continual learning is superficial output style shifts rather than knowledge loss.

Limitations & Future Work

  • Known task boundaries: Covariance updates and subspace partitioning currently assume task IDs are known.
  • Covariance maintained only at the routing layer: There is a potential small mismatch between FFN input distribution and router input distribution.
  • Hyperparameter sensitivity: Parameters like \(\delta, \rho, \mu, \lambda, \epsilon, \tau_{score}\) require tuning.
  • Limited to LoRA-on-FFN: Extending the method to attention layers or vision encoders is necessary for further validation.
  • vs MoELoRA / CL-MoE / HiDe-LLaVA: These rely on routing sparsity/auxiliary losses or hierarchical distillation. SAME intervenes directly in update rules with zero auxiliary loss and zero rehearsal.
  • vs Replay-LoRA / SEFE: Replay methods require storing old data; SAME is rehearsal-free, using covariance summaries to preserve "past geometry."
  • vs O-LoRA / GEM / OGD: While sharing the "orthogonal gradient projection" concept, SAME upgrades it to the MoE dimension—managing both parameter-dimensional projection and routing distribution stability.

Rating

  • Novelty: ⭐⭐⭐⭐ Effective decoupling of MoE forgetting sources.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive benchmarks plus diagnostic protocols and efficiency analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Logical flow and clear problem decomposition.
  • Value: ⭐⭐⭐⭐ Rehearsal-free with low engineering cost.