ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads¶
Conference: ICCV 2025 arXiv: 2506.03433 Code: jackyfl.github.io/vitsplit.github.io Area: 3D Vision Keywords: Vision Foundation Models, Efficient Adapters, DINOv2, Hierarchical Features, Semantic Segmentation
TL;DR¶
Grounded in the key observation that VFM layers can be partitioned into low-level feature extractors and high-level task adapters, this paper proposes ViT-Split, which freezes the VFM backbone and introduces a task head (replicating the last \(K_t\) layers) and a prior head (a lightweight CNN aggregating multi-scale prior features). On ADE20K, ViT-Split achieves 58.2 mIoU (DINOv2-L) with only a linear head, offers 4× faster training, and requires only 1/4–1/5 of the trainable parameters compared to conventional adapters.
Background & Motivation¶
Problem Definition¶
How to efficiently transfer pre-trained knowledge from vision foundation models (VFMs, e.g., DINOv2) to downstream tasks (segmentation, detection, depth estimation, VQA, etc.), while substantially reducing training overhead without sacrificing—or even surpassing—state-of-the-art performance?
Limitations of Prior Work¶
Inefficiency of VFM adapters (ViT-Adapter, ViT-CoMer): - Dual-branch architectures (CNN + ViT) require gradient backpropagation through all layers, with computational and memory costs scaling linearly with model size. - All components (VFM + CNN + adapter + task head) require fine-tuning, resulting in large parameter counts. - The pre-trained prior features of the VFM are altered, preventing full exploitation of pre-trained knowledge.
Limitations of PEFT methods: - VPT (prompt tuning), AdaptFormer (adapter tuning), and LoRA (low-rank fine-tuning) still insert learnable parameters at every layer, triggering gradient backpropagation through early layers. - Without low-level features (as provided by the CNN branch in VFM adapters), performance is typically on par with or slightly below full fine-tuning. - Pre-trained prior features of the VFM are underutilized.
Low efficiency in multi-task deployment: Conventional VFM adapters maintain a complete pipeline (VFM + CNN + adapter + head) per task, incurring severe memory and computational redundancy.
Core Motivation¶
Key Observation: CKA (Centered Kernel Alignment) analysis reveals that layers in VFMs such as DINOv2 can be explicitly divided into two groups—early layers learn low-level features (textures, edges, etc., similar across tasks) and later layers learn task-specific features (segmentation focuses on semantics, detection on corners, with large cross-task divergence). This implies: 1. An additional CNN branch for low-level features is unnecessary—the VFM's own early layers already capture them. 2. Only the later layers need fine-tuning for downstream adaptation—the early layers can be frozen. 3. Prior features from all VFM layers should be leveraged rather than overwritten by fine-tuning.
Method¶
Overall Architecture¶
The entire VFM backbone is frozen. Two "split heads" are introduced: (1) Task head: a copy of the last \(K_t\) VFM layers, receiving the output of layer \(L-K_t\) and learning task-specific features; (2) Prior head: a lightweight CNN that uniformly samples features from \(K_p\) frozen VFM layers and aggregates them. The outputs of both heads are fused via a fusion net and passed to the downstream task head.
Key Designs¶
1. VFM Layer Grouping Observation & Task Head¶
- Function: Copies the last few VFM layers as a trainable task-specific adapter.
- Mechanism:
Observation validation: CKA analysis and feature visualization reveal that: - Early layers (e.g., L1–L6 of DINOv2-S): cross-task features (pre-training/segmentation/detection) are similar, focusing on textures and edges. - Later layers (L7–L12): features diverge across tasks—segmentation biases toward semantic information, detection toward object corners.
Task head design: The output \(f_{L-K_t}\) of VFM layer \(L-K_t\) is fed into the copied last \(K_t\) layers: $\(f_t = g_{\theta_t}(f_{L-K_t})\)$ The class token is then discarded and the output is reshaped into a spatial feature map \(f_t' \in \mathbb{R}^{h \times w \times D}\).
\(K_t\) is a key hyperparameter: it controls the trade-off between adapter capacity and training efficiency. For segmentation, gains diminish as \(K_t\) increases, so a smaller value suffices.
- Design Motivation: Since early layers are shared across tasks, fine-tuning them (and their associated gradient backpropagation) is unnecessary. Copying the later layers as the task head provides a naturally good initialization while preserving the original VFM prior features intact. This also implies that large segmentation heads (Mask2Former, UperNet) may be unnecessary.
2. Prior Head: Multi-Scale Prior Feature Aggregation¶
- Function: Extracts multi-scale prior features from the frozen VFM to enhance task-specific features.
- Mechanism:
Layer selection—uniform sampling: \(K_p\) layers are uniformly sampled from \(L\) layers with interval \(\delta = \frac{L-b-1}{K_p-1}\): $\(\mathcal{S} = \{b + \text{round}(i \cdot \delta) | i = 0, ..., K_p-1\}\)$ where \(b=2\) or \(b=3\) (skipping the noisiest features in the very first layers).
Prior head architecture: Features from sampled layers are concatenated as \(f_p \in \mathbb{R}^{h \times w \times (K_p \cdot D)}\) and processed by two CNN layers: $\(f_p' = g_{\theta_p}(f_p)\)$ The CNN consists of a \(1 \times 1\) convolution (channel compression) followed by a \(3 \times 3\) deformable convolution (enhancing low-level features and modeling geometric transformations).
- Design Motivation: Uniform sampling reduces redundancy among highly similar adjacent-layer features while increasing diversity. VFM prior features, learned from large-scale data, encode rich general-purpose knowledge—directly leveraging them enhances task-specific features and mitigates overfitting in the task head. Deformable convolutions are better suited for irregular spatial structures than standard convolutions.
3. Fusion Net & Multi-Task Adaptation¶
- Function: Fuses task-specific and prior features for adaptation to different downstream tasks.
- Mechanism:
Task and prior feature maps are concatenated (preserving more information): $\(f_o = g_{\theta_f}([f_p'; f_t'])\)$ The fusion net shares the same structure as the prior head (\(1 \times 1\) + \(3 \times 3\) deformable convolution).
Task-specific transformations: - Segmentation: \(4\times\) upsampling (two transposed convolution layers). - Detection: generates 4 scales (\(4\times, 2\times, 1\times, 0.5\times\)) to match MaskRCNN input. - VQA: reshaped into sequence dimension \((h \cdot w) \times D\) and fed into the LLM decoder.
- Design Motivation: Concatenation outperforms addition and attention-based fusion (verified by ablation), as it retains all information. A single frozen VFM backbone can be shared across multiple tasks, with each task training only its own task head + prior head + fusion net, substantially reducing multi-task deployment overhead.
Loss & Training¶
- Segmentation: Standard cross-entropy loss, AdamW (lr=2e-4, wd=1e-2), batch size 16, 40k iterations.
- Detection: Standard MaskRCNN losses, AdamW (lr=1e-4, wd=5e-2), 12 epochs.
- Task head learning rate is scaled by 0.1× (to avoid rapid corruption of the initialization).
- The backbone is completely frozen; no gradients are backpropagated into the early VFM layers.
Key Experimental Results¶
Main Results¶
ADE20K Semantic Segmentation (512×512):
| Method | Head | Trainable Params (M) | mIoU |
|---|---|---|---|
| ViT-Adapter-S | UperNet | 57.6 | 46.2 |
| ViT-CoMer-S | UperNet | 61.4 | 46.5 |
| DINOv2-S (Linear) | Linear | 22.1 | 49.6 |
| ViT-Split-S | Linear | 10.2 | 51.6 |
| ViT-Adapter-B | UperNet | 133.9 | 48.8 |
| DINOv2-B (UperNet) | UperNet | 120.7 | 54.8 |
| ViT-Split-B | Linear | 40.5 | 55.7 |
| ViT-Adapter-L | UperNet | 363.8 | 53.4 |
| DINOv2-L (UperNet) | UperNet | 341.2 | 57.1 |
| ViT-Split-L | Linear | 88.6 | 58.2 |
Cityscapes Semantic Segmentation (896×896):
| Method | Head | Trainable Params (M) | mIoU (SS/MS) |
|---|---|---|---|
| ViT-Adapter-L | Mask2Former | 571 | 84.9/85.8 |
| DINOv2-L | Linear | 312.9 | 83.5/84.3 |
| ViT-Split-L | Linear | 164.1 | 85.8/86.7 |
Ablation Study¶
VQA Performance (LLaVA-1.5 + ViT-Split vs. Baseline):
| Benchmark | LLaVA-1.5 | + ViT-Split | Change |
|---|---|---|---|
| VQAv2 | 78.5 | 78.2 | -0.3 |
| LLaVA-Wild | 65.4 | 71.1 | +5.7 |
| SciQA-IMG | 66.8 | 70.4 | +3.6 |
| POPE (adv) | 84.2 | 86.1 | +1.9 |
| MMBench | 64.3 | 66.4 | +2.1 |
Training Efficiency Comparison:
| Method | Training Time (10k iters) | Ratio vs. ViT-Split |
|---|---|---|
| ViT-Adapter-S | ~38 min | ~4× slower |
| ViT-CoMer-S | ~37 min | ~3.9× slower |
| ViT-Split-S | ~9 min 25 s | baseline |
Fusion Strategy Ablation:
| Fusion Strategy | mIoU |
|---|---|
| Addition | Lower |
| Concatenation | Best |
Key Findings¶
- A simple linear head suffices to outperform complex segmentation heads: ViT-Split-L (Linear, 88.6M params) surpasses ViT-Adapter-L (UperNet, 363.8M params) by 4.8 mIoU.
- Significant parameter efficiency: Trainable parameters are only 1/4–1/5 of conventional adapters, with only 1/4 of the training iterations required.
- 4× training speedup: Achieved by eliminating gradient backpropagation through early layers and removing the CNN branch.
- Value of VFM prior features: The prior head yields an average improvement of ~2% mIoU over the no-prior baseline, confirming the utility of frozen multi-scale prior features.
- Multi-task generality: Effective across segmentation, detection, depth estimation, and VQA; integrating ViT-Split into LLaVA-1.5 yields a +5.7-point gain on LLaVA-Wild.
Highlights & Insights¶
- Concise and powerful core observation: VFM layers = feature extractor (shared across tasks) + task adapter (task-specific), validated through both CKA analysis and feature visualization.
- "Less is more" philosophy: Freezing the majority of parameters and training only a small subset outperforms full fine-tuning by preventing prior knowledge from being overwritten.
- Elegant design: The entire method reduces to "copy the last few layers + add a CNN to aggregate priors," yet substantially outperforms complex multi-branch adapter architectures.
- Efficient multi-task inference: A single frozen VFM is shared; each task requires only its own lightweight heads, making the approach GPU-memory friendly.
- Deep understanding of VFMs: The paper reveals the intuitive divergence between DINOv2 segmentation features (semantic-focused) and detection features (corner-focused).
Limitations & Future Work¶
- VFM dependency: The approach works best with DINOv2 (where layer grouping is most pronounced); layer grouping is less distinct for other VFMs (e.g., CLIP).
- Detection task gap: Due to the large discrepancy between detection and DINOv2 pre-training objectives, ViT-Split only matches ViT-CoMer on COCO detection.
- Manual tuning of \(K_t\) and \(K_p\): Optimal values vary across tasks and model scales.
- Resolution limitations: Inherits ViT's patch size (14×14 or 16×16), leading to reduced efficiency at high input resolutions.
- Video/temporal extension unexplored: The current framework supports only single-frame input.
Related Work & Insights¶
- Relation to ViT-Adapter: ViT-Adapter employs a CNN branch for local features with cross-attention interaction; ViT-Split demonstrates that the VFM's own early layers already capture local features, rendering an additional CNN branch unnecessary.
- Comparison with LoRA/VPT: PEFT methods modify every layer without leveraging low-level features; ViT-Split modifies only high-level layers while utilizing prior features from all layers.
- Encoder-decoder nature of DINOv2: Self-supervised pre-training (reconstructing masked patches) naturally induces a MAE-like hierarchical "encode-then-decode" layer structure.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The core observation is insightful and the design is clean and effective, though the adapter paradigm itself is not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers four tasks (segmentation/detection/depth/VQA) with comprehensive ablations and training efficiency analysis.
- Writing Quality: ⭐⭐⭐⭐ — Observations and method descriptions are clear, with highly informative figures.
- Value: ⭐⭐⭐⭐⭐ — Provides an efficient paradigm for VFM adaptation with strong practical utility.