Revisiting Model Stitching In the Foundation Model Era¶
Conference: CVPR 2026 arXiv: 2603.12433 Code: To be confirmed Area: Multimodal VLM Keywords: Model stitching, vision foundation models, representation compatibility, feature matching, multi-VFM fusion, efficiency optimization
TL;DR¶
A two-stage stitching training method (Final Feature Matching + Task Loss Training) for heterogeneous Vision Foundation Models (VFMs) is proposed, demonstrating that heterogeneous VFMs can be reliably stitched and fused for complementary knowledge. A VFM Stitch Tree (VST) architecture is also designed to achieve controllable accuracy–efficiency trade-offs in multi-VFM systems.
Background & Motivation¶
-
Proliferation of VFMs with unknown representation compatibility: Contemporary VFMs such as CLIP, DINOv2, and SigLIP 2 differ substantially in training objectives (contrastive learning vs. self-supervised reconstruction), data sources (LAION vs. WebLI vs. LVD-142M), and modality combinations (vision-language vs. vision-only), yet whether their internal representations are mutually compatible remains unclear.
-
Model stitching as a probe for representation compatibility: Prior work has shown that small models trained on the same dataset (e.g., ResNet-18 on CIFAR-10) can be stitched with near-lossless accuracy even under different initializations or objectives, but whether this finding generalizes to heterogeneous VFMs has not been verified.
-
Failure of existing stitching strategies on VFMs: Conventional Layer Feature Matching (LFM)—which aligns features at the stitch point—and Task Loss Training (TLT)—which directly optimizes the downstream loss—perform poorly in the VFM setting, especially at shallow stitch points.
-
Gradient propagation difficulty at shallow stitch points: When the stitch point is shallow, all subsequent layers of the target model are frozen, so gradients must traverse a long chain of frozen layers to update the stitch layer, making optimization difficult.
-
Efficiency bottleneck in multi-VFM systems: Modern multimodal LLMs (e.g., MoF-LLaVA, Cambrian-1) deploy multiple VFMs to capture complementary visual information, incurring linearly growing compute and memory costs (\(k\) VFMs = \(k\times\) overhead).
-
Demand for a transition from diagnostic tool to practical solution: There is a need to elevate model stitching from a representation analysis tool to a practical approach for fusing complementary VFM capabilities.
Method¶
Overall Architecture¶
Given a source model \(f_\theta\) and a target model \(f_\phi\), stitching is performed at layer \(n\): the first \(n\) layers of the source model and the last \(N-n\) layers of the target model are frozen, and only a lightweight stitch layer \(S\) is trained. The stitched model is defined as \(F(x) = T_\phi^N \circ S \circ R_\theta^n(x)\).
Key Designs: Two-Stage Training¶
Stage 1: Final Feature Matching (FFM)
- Rather than matching intermediate features at the stitch point, the final-layer output features of the target model are matched.
- Loss function: \(\mathcal{L}_{FFM} = \frac{1}{M}\sum_{i=1}^{M}\|T_\phi^N(S(R_\theta^n(x_i))) - T_\phi^N(R_\phi^n(x_i))\|_2^2\)
- Key finding: Although FFM directly matches final-layer features, it also implicitly preserves local alignment at the stitch point (layer feature distance is comparable to that of LFM).
- This stage requires no labels; only unlabeled images are needed.
Stage 2: Task Loss Training (TLT)
- The stitch layer is initialized with the parameters obtained from Stage 1 FFM, avoiding the optimization difficulties caused by random initialization.
- Fine-tuning on the downstream task: \(\mathcal{L}_{task} = \frac{1}{M}\sum_{i=1}^{M}\ell(F(x_i), y_i)\)
- FFM initialization places the stitch layer in a more favorable loss landscape, and subsequent fine-tuning converts this favorable initialization into strong stitching accuracy.
Stitch Layer Design¶
- Linear: Processes tokens independently; weakest expressive capacity.
- MLP (default): Two-layer perceptron with ReLU; best overall performance.
- LoRA: Applies LoRA to layer \(n\) of the source model, allowing inter-token interactions; strongest expressive capacity in theory, yet underperforms MLP—indicating that moderate mismatch facilitates complementary information fusion.
VFM Stitch Tree (VST)¶
- Core Idea: Multiple VFMs share shallow-layer computation, retaining only their respective specialized deep layers, connected via stitch layers.
- Architecture: A tree structure in which the trunk consists of the shallow layers of one VFM and branches consist of the deep layers of each VFM.
- Using Cambrian-1 (4 VFMs) as an example, stitching at layer 14 reduces GPU memory and compute by 54%.
Key Experimental Results¶
Two-Stage Training vs. Naive TLT (fMoW Classification, Accuracy %)¶
| Direction | Pretrain | L2 | L6 | L10 | L14 | L18 | L22 |
|---|---|---|---|---|---|---|---|
| DINOv2→SigLIP2 | None | 25.1 | 39.4 | 52.6 | 62.3 | 68.6 | 68.6 |
| DINOv2→SigLIP2 | FFM | 51.7 | 55.8 | 59.3 | 68.0 | 72.0 | 71.8 |
| SigLIP2→DINOv2 | None | 38.7 | 56.7 | 58.3 | 64.4 | 70.4 | 70.1 |
| SigLIP2→DINOv2 | FFM | 53.8 | 53.8 | 61.9 | 69.6 | 70.4 | 72.2 |
FFM initialization yields gains of up to +26.6% at shallow stitch points (L2), with consistent improvements at deeper layers as well.
Consistency Across Datasets and Tasks (Classification Accuracy % / Segmentation mIoU %)¶
| Direction | fMoW (L6/14/22) | iNaturalist (L6/14/22) | Aircraft (L6/14/22) | ADE20K (L14/22) |
|---|---|---|---|---|
| Self-Stitch DINOv2 | 41.5/59.7/69.9 | 56.9/81.5/91.2 | 37.8/79.3/91.2 | 35.4/50.9 |
| Self-Stitch SigLIP2 | 50.5/62.0/68.9 | 71.2/88.5/87.3 | 67.9/88.1/89.3 | 44.5/50.5 |
| DINOv2→SigLIP2 | 55.8/68.0/71.8 | 75.9/89.1/92.8 | 77.8/87.6/92.4 | 44.9/51.2 |
| SigLIP2→DINOv2 | 53.8/69.6/72.2 | 86.3/88.9/91.9 | 80.7/89.0/91.0 | 49.0/51.4 |
Cross-model stitching consistently surpasses self-stitching baselines, with classification gains of +0.7%–+5.5% and segmentation gains of +0.5–+0.7 mIoU.
VST Accuracy–Efficiency Trade-off (MoF-LLaVA)¶
| Configuration | Additional Resources | Gain Recovery Ratio |
|---|---|---|
| Single-VFM baseline | 0% | 0% |
| VST-22 (1 specialized layer) | 4.3% | 45% |
| VST-14 (9 specialized layers) | 39% | 84% |
| Full dual-VFM | 100% | 100% |
Ablation Study: Stitch Layer Type Comparison (fMoW Accuracy %)¶
| Method | L2 | L6 | L10 | L14 | L18 | L22 |
|---|---|---|---|---|---|---|
| D→S Linear | 26.1 | 54.3 | 59.5 | 66.5 | 69.1 | 69.6 |
| D→S MLP | 51.7 | 55.8 | 59.3 | 68.0 | 72.0 | 71.8 |
| D→S LoRA | 49.1 | 49.4 | 57.4 | 61.7 | 67.7 | 67.3 |
MLP consistently outperforms both Linear and LoRA across all stitch points; LoRA, despite its greater expressive capacity, underperforms MLP.
Highlights & Insights¶
- Precise problem diagnosis: The paper clearly analyzes why LFM fails at shallow stitch points (small errors amplified through frozen layers) and why TLT faces gradient propagation difficulties at shallow depths; the FFM solution is elegant and effective.
- Well-designed self-stitching baseline: Self-stitch control experiments eliminate "stitch layer capacity gains" as a confounding factor, establishing that complementary knowledge fusion is a genuine effect.
- Complete loop from analysis to application: Starting from representation analysis, the work identifies VFM stitchability, and subsequently designs the VST architecture to address the practical efficiency problem in multi-VFM systems.
- Comprehensive and systematic experiments: Covering 4 VFMs, 4 datasets, 2 task types (classification + segmentation), 3 stitch layer types, and 6 stitch depths.
Limitations & Future Work¶
- Weak source model constraint: Stitching performance degrades when CLIP is used as the source model, suggesting that a weak encoder may discard critical information that the target model cannot recover.
- Limited VST evaluation: Validation is confined to the LLaVA framework and a small number of VQA benchmarks, without coverage of broader MLLM architectures or more extensive benchmark suites.
- Unexplored stitch layer design space: Only Linear, MLP, and LoRA are evaluated; more complex mechanisms such as cross-attention remain unexplored.
- Restricted to ViT architectures: All evaluated VFMs are Transformer-based; stitchability of CNN or hybrid architectures is not verified.
- Static stitch points: The stitch point is fixed; input-adaptive dynamic stitching strategies are not explored.
Related Work & Insights¶
- vs. Bansal et al. (2021): Prior work validated stitchability only on small models trained on the same dataset; this paper is the first to systematically extend the investigation to heterogeneous VFMs and demonstrate the failure of naive approaches.
- vs. Collins et al. (2025): That work observed that TLT may create out-of-distribution representations; FFM initialization in this paper directly mitigates this issue by preserving representation fidelity before task adaptation.
- vs. Smith et al. (2025): That work questioned whether stitching success merely reflects representation clustering rather than semantic similarity; the self-stitching control experiments in this paper directly address this concern.
- vs. SN-Net (Pan et al.): SN-Net focuses on stitchability training across scales within the same model family, whereas this paper addresses post-hoc stitching of independently trained heterogeneous VFMs.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The FFM training strategy and VST architecture are innovative, though the core ideas build upon the existing model stitching framework.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers multiple VFMs, datasets, tasks, stitch layer types, and stitch depths, with well-designed control experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ — Logically coherent; the problem–analysis–solution–validation narrative flows smoothly, and the introduction of the self-stitching baseline is convincing.
- Value: ⭐⭐⭐⭐ — VST provides a practical solution for efficiency optimization in multi-VFM systems, though weak source model limitations and limited MLLM evaluation somewhat reduce its practical impact.