Skip to content

PRISM: Synergizing Vision Foundation Models via Self-Organized Expert Specialization

Conference: ICML 2026
arXiv: 2606.03444
Code: https://github.com/robotyingtang/PRISM-VFM
Area: Multimodal VLM / Vision Foundation Model Distillation
Keywords: Multi-teacher Distillation, Vision Foundation Models, MoE, Contextual Routing, Gradient Conflict

TL;DR

PRISM distills three heterogeneous vision foundation models (CLIP, SAM, DINOv2) into a single ViT student using "Dual-stream Conditional MoE." By employing a shared anchor stream to stabilize gradients and a context-routed sparse expert stream to resolve conflicts, it enables self-organized expert specialization. Consensual knowledge is shared while conflicting knowledge is branched, outperforming the previous SOTA (SAK) on all five tasks of PASCAL-Context.

Background & Motivation

Background: Vision Foundation Models (VFMs) like CLIP (semantic alignment), SAM (boundary/geometry), and DINOv2 (fine-grained local texture) each have unique strengths. Industrial deployment seeks to compress these capabilities into a single student backbone to reduce memory and latency.

Limitations of Prior Work: Distilling multi-teacher features into a dense student (e.g., RADIO, Theia, UNIC) leads to severe gradient conflicts. CLIP encourages features to be category-invariant (compressing variance), while DINO requires local textures to be separable (preserving variance). Shared parameters receive opposing gradients \(\cos(\mathbf{g}_i, \mathbf{g}_j)<0\), causing gradient magnitude cancellation and resulting in a sub-optimal "jack of all trades, master of none" compromise.

Key Challenge: Existing "divide-and-conquer" solutions (like SAK using Teacher-Agnostic Stems + Teacher-Specific Adapters) mitigate interference via hard branching. However, this assumes vision knowledge can be explicitly partitioned into disjoint sub-domains. In reality, CLIP and DINO often encode different frequency bands of the same concept (e.g., a "cat"). Hard splitting either wastes parameters via redundancy or stifles positive transfer.

Goal: In multi-teacher VFM distillation, avoid both dense "full sharing" (conflict) and SAK-style "hard splitting" (redundancy). Instead, pursue an intermediate route that dynamically decides between sharing and branching based on token, layer, and teacher context.

Key Insight: Treat sparse MoE routing as a tool for "gradient orthogonalization." Conflicting teacher gradients are routed to different experts to minimize the effective inner product \(\langle \tilde{\mathbf{g}}_{i,n}, \tilde{\mathbf{g}}_{j,n}\rangle\approx 0\), while consensual components are directed through a shared anchor stream.

Core Idea: A "Decompose-then-Recombine" two-stage paradigm is proposed. Stage 1 uses teacher IDs as context to drive emergent specialization of sparse experts. Stage 2 uses task IDs as context to recombine these experts for downstream tasks. A locality-aware decorrelation loss is introduced to prevent shallow layers from collapsing due to strong CLIP semantic supervision.

Method

Overall Architecture

PRISM replaces specific FFN blocks (layers 2, 5, 8, 11) of a standard ViT-B/16 with PRISM blocks, each comprising a Dual-stream Conditional MoE:

  1. Universal Anchor (Stability Stream): A dense MLP \(\mathcal{F}_{\text{anc}}\) shared across all contexts to capture task-agnostic low-frequency consensus.
  2. Specialized Delta (Plasticity Stream): A sparse MoE \(\mathcal{F}_{\text{moe}}\) with 15 experts (Top-3 routing) and 1 internal shared expert, modulated by context \(c\).

The block output is a weighted sum: \(\mathbf{y}=\mathbf{x}+\lambda\cdot \mathcal{F}_{\text{anc}}(\text{LN}(\mathbf{x}))+(1-\lambda)\cdot \mathcal{F}_{\text{moe}}(\mathbf{x}, c)\), where \(\lambda\in[0,1]\) is a learnable gate. Experiments show it automatically learns a hierarchy: high \(\lambda\) (stability) in shallow layers and low \(\lambda\) (specialization) in deep layers.

Training consists of two stages: Stage 1 involves distillation from three frozen ViT-L teachers (DINOv2-L, CLIP-L, SAM-L) on ImageNet-1k for 30 epochs. Stage 2 involves multi-task fine-tuning on PASCAL-Context or NYUD-v2 for 40k iterations.

Key Designs

  1. Gradient Conflict Diagnosis + Orthogonalization Goal → MoE Routing:

    • Function: Explicitly models optimization contradictions in multi-teacher distillation as gradient conflicts and provides an actionable mitigation target.
    • Mechanism: In a dense backbone, the aggregate gradient \(\mathbf{g}_{\text{total}}=\sum_k \gamma_k \mathbf{g}_k\) collapses when \(\cos(\mathbf{g}_i,\mathbf{g}_j)<0\). PRISM posits that sparse MoE can route conflicting gradients to different experts \(E_n\) such that \(\langle \tilde{\mathbf{g}}_{i,n}, \tilde{\mathbf{g}}_{j,n}\rangle\approx 0\). This is achieved by reducing co-activation of conflicting teachers on the same expert.
    • Design Motivation: Prior work either ignores conflicts (RADIO) or uses heuristic hard branching (SAK). PRISM elevates this to a gradient-driven orthogonalization design, letting data determine where to share and where to branch.
  2. Context-Modulated Routing (FiLM):

    • Function: Ensures the router makes decisions based on both "image content" and "task/teacher identity," preventing the router from producing a single decision for different tasks on the same image.
    • Mechanism: FiLM injects Context ID \(c\) into normalized token features via an affine transformation: \(\hat{\mathbf{x}}=(1+\gamma(c))\odot \text{LayerNorm}(\mathbf{x})+\beta(c)\). The router \(G(\hat{\mathbf{x}})\) then performs Top-\(K\) dispatch. The MoE output is \(\mathcal{F}_{\text{moe}}(\mathbf{x}, c)=E_{\text{shared}}(\mathbf{x})+\sum_{i\in \text{TopK}} G(\hat{\mathbf{x}})_i\, E_i(\mathbf{x})\).
    • Design Motivation: Standard MoE routers would produce identical outputs for CLIP and DINO teachers seeing the same image, hindering specialization. FiLM redirects the feature space to force different routing decisions. Unlike MoFME (which uses FiLM for expert computation), PRISM only modulates routing, maintaining separation between routing and representation learning.
  3. Locality-Aware Decorrelation Loss (LDL):

    • Function: Prevents shallow layers from being "short-circuited" by strong semantic signals like CLIP, maintaining high-rank local features for deeper experts.
    • Mechanism: Observation of "semantic short-circuiting" shows CLIP supervision causes shallow layers to collapse into global semantics (rank collapse). LDL is applied to the first two layers, penalizing high cosine similarity between spatially distant tokens while preserving local correlations: \(\mathcal{L}_{\text{decorr}}=\frac{1}{|\mathcal{P}|}\sum_{(i,j)\in\mathcal{P}}\max(0,\cos(\mathbf{z}_i,\mathbf{z}_j)-\epsilon)\cdot \mathbb{I}(d_{ij}>r)\), where \(r\) is a local radius and \(d_{ij}\) is Euclidean distance.
    • Design Motivation: MoE effectiveness depends on token diversity. LDL serves as a regularization with local inductive bias, forcing distant tokens to remain distinct and supporting a high-rank feature base.

Loss & Training

  • Stage 1: \(\mathcal{L}_{\text{stage1}}=\mathcal{L}_{\text{aux}}+\alpha \mathcal{L}_{\text{distill}}+\beta \mathcal{L}_{\text{decorr}}\), with \(\alpha=0.9, \beta=0.1\). A teacher \(T_k\) is randomly sampled per iteration.
  • Stage 2: \(\mathcal{L}_{\text{stage2}}=\mu \mathcal{L}_{\text{distill}}+\sum_{t}w_t \mathcal{L}_t\), with \(\mu=1.0\) and fixed multi-task weights \(w_t\).
  • Backbone: ViT-B/16 with 15 + 1 experts per PRISM layer, Top-3 routing.

Key Experimental Results

Main Results

Benchmarks on PASCAL-Context (5 tasks) and NYUD-v2 (4 tasks).

Method (PASCAL-Context, ViT-B) SemSeg mIoU↑ Parsing mIoU↑ Saliency maxF↑ Normal mErr↓ Boundary odsF↑ \(\Delta_m\) %↑
Single-task baseline 80.25 70.54 84.54 13.57 74.22 0.00
Multi-task baseline 76.76 65.26 84.39 13.98 70.37 -4.04
RADIO 78.06 68.13 85.18 13.59 72.64 -1.53
Theia 76.51 67.53 84.38 14.56 70.34 -4.33
SAK (Prev. SOTA) 81.88 74.30 84.79 14.02 74.09 0.83
PRISM (Ours) 82.20 75.34 84.81 13.47 75.92 2.29

Key observations: (1) PRISM elevates \(\Delta_m\) to 2.29%, significantly exceeding the single-task baseline. (2) It outperforms SAK across all five tasks. Significant gains in Boundary (+1.83 odsF) and Normal (-0.55 mErr) suggest emergent experts extract shared geometric structures more efficiently than SAK's isolated adapters.

NYUD-v2 & Ablation Study

On NYUD-v2, PRISM and SAK yield competitive results. PRISM leads in SemSeg and Depth, while SAK performs slightly better in Normal and Boundary. This reflects a trade-off between flexible recombination and specialized local adaptation for indoor geometry.

Configuration Key Observation Description
Full PRISM \(\Delta_m=2.29\%\) Dual-stream + FiLM + LDL active
Shallow vs Deep \(\lambda\) High \(\lambda\) shallow, low \(\lambda\) deep Learned "shallow stability, deep specialization" pattern
Stage 1 Routing Different teachers use different experts Validates that emergent specialization occurred

Key Findings

  • Emergent Layer Hierarchy: The learned gating \(\lambda\) confirms that shallow layers prefer shared anchors for robust optimization, while deep layers prefer sparse experts for fine differentiation.
  • Geometric Extraction: Improved performance on Boundary/Normal tasks indicates emergent experts are better at aggregating implicit geometric signals from SAM and DINO compared to hard-branched adapters.
  • LDL Layer Placement: Applying LDL only to the first two layers is sufficient; applying it to deep layers hinders specialization.

Highlights & Insights

  • MoE as Gradient Orthogonalizer: This perspective treats MoE as a structural solution to gradient conflicts in multi-objective optimization, rather than just a capacity booster.
  • FiLM for Routing: Modulation is applied only to the dispatch decision, not the expert computation. This maintains a clean separation of concerns between routing and representation learning.
  • Dual-stream Philosophy: The "Stability + Plasticity" design is transferable to other scenarios where general capabilities must be preserved during downstream specialization (e.g., multimodal instruction tuning).

Limitations & Future Work

  • Training Cost: The dual-stream MoE architecture is heavier than a dense ViT-B during training, and the reliance on multiple teacher forwards increases peak VRAM and computation time.
  • Teacher Sensitivity: The stability of emergent specialization when scaling to a larger number of teachers (e.g., Depth Anything, ConvNeXt) remains unverified.
  • Indoor Geometry: On NYUD-v2, hard-branched adapters (SAK) still hold advantages in specific tasks, suggesting a potential hybrid paradigm of routing and lightweight adapters.
  • vs SAK (Lu et al., 2025): PRISM replaces SAK's "hard split" with "soft split + contextual routing," yielding higher efficiency and better geometric feature extraction.
  • vs RADIO (Ranzinger et al., 2024): RADIO relies on loss weighting to manage conflicts in a dense backbone; PRISM resolves conflicts structurally via routing.
  • vs Mod-Squad (Chen et al., 2023): Mod-Squad uses information-theoretic objectives for specialization; PRISM achieves emergent specialization across multiple teachers without explicit constraint terms.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of Dual-stream MoE, FiLM routing, and LDL provides a novel and well-motivated framework for multi-VFM distillation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on two benchmarks with competitive baselines (SAK, RADIO, UNIC) and supportive diagnostic analysis.
  • Writing Quality: ⭐⭐⭐⭐ Logical flow from gradient conflict diagnosis to architectural design is clear.
  • Value: ⭐⭐⭐⭐ Offers a reproducible recipe for compressing multiple VFMs, backed by a significant 2.29% \(\Delta_m\) improvement over single-task baselines.