Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models¶
Conference: AAAI 2026 arXiv: 2504.08915 Code: N/A Area: 3D Vision Keywords: Vision Foundation Models, Parameter-Free Fine-Tuning, Channel Redundancy, SAM, Feature Selection
TL;DR¶
This work identifies pervasive redundant channels in vision foundation models (SAM/SAM2/DINOv2) and proposes a parameter-free adaptation method that requires no parameter updates: an output-difference-based channel selection algorithm identifies optimal replacement pairs, substituting redundant channels with effective ones to enhance feature representations for downstream tasks, yielding average mIoU gains of 5–11 points.
Background & Motivation¶
Vision Foundation Models (VFMs) such as SAM and DINOv2, trained on large-scale data, possess powerful general-purpose visual representations. Adapting them to downstream tasks typically requires parameter fine-tuning:
- Full fine-tuning: updates all parameters at high computational cost.
- Parameter-Efficient Fine-Tuning (PEFT/LoRA/Adapter): updates a small subset of parameters (thousands to millions), yet still requires backpropagation and computation graph maintenance.
Key Observation (control experiment in Table 1): On the PerSeg dataset with SAM, zeroing out certain channel activations reveals: - Channel 6 zeroed: mIoU unchanged (50.6 → 50.6), indicating the channel is redundant. - Channel 216 zeroed: mIoU improves (50.6 → 52.7), indicating the channel is even harmful. - Channels 175/19/189 zeroed: mIoU drops, indicating these channels are beneficial for the task.
Key Challenge: Among the general features learned by VFMs on large-scale data, many are irrelevant or even detrimental to specific downstream tasks. This redundancy arises because the model must generalize across a large variety of tasks.
Core Problem: Can downstream tasks be addressed without modifying any model parameters—solely by selecting, reusing, and enhancing existing features?
Method¶
Overall Architecture¶
The proposed approach contrasts sharply with conventional fine-tuning paradigms: - (a) Conventional decoder fine-tuning: updates decoder parameters to adapt pre-trained features. - (b) Conventional encoder fine-tuning: updates encoder parameters to modify pre-trained features. - (c) Proposed method: updates no parameters; instead, redundant channels are replaced with more effective ones.
Pipeline: Search dataset → Encoder feature extraction → Pairwise channel replacement → Output difference comparison → Dictionary construction → Optimal combination search → Replacement application.
Key Designs¶
- Problem Formulation
The objective is to find the optimal set of replacement pairs \(P^*\) that maximizes performance on the downstream dataset \(S\): $\(P^* = \arg\max_P \text{mIoU}(S, P)\)$ where \(P = \{(i,j)_1, (i,j)_2, \ldots, (i,j)_k\}\) and \((i,j)\) denotes replacing channel \(i\) with channel \(j\).
Direct enumeration of all combinations is intractable: for \(C=256\), it requires \(2^{C^2}\) inference passes.
- Channel Selection Algorithm
Three strategies to reduce search cost:
(1) Output-difference-based search: Given a search dataset \(\mathbf{S}\), the encoder produces features \(X \in \mathbb{R}^{D \times C \times W \times H}\). For each replacement pair \((i,j)\), the gain is computed as: $\(\Delta\text{Acc}_{(i \to j)} = D(X') - D(X)\)$ where \(D(X)\) and \(D(X')\) denote decoder outputs for the original and replaced features, respectively.
A dictionary \(\mathcal{D} = \{(i,j): \Delta\text{Acc}_{(i \to j)}\}\) is constructed, and the top \(N\) pairs form \(\mathcal{D}_{topN}\).
All \(2^N - 1\) combinations within \(\mathcal{D}_{topN}\) are then enumerated to find the optimal combination \(P^*\).
Complexity reduction: from \(2^{C^2}\) to \(C^2 + 2^N - 1\) (for \(N=10\), approximately \(65{,}536 + 1{,}023\) inference passes).
(2) Sample reduction: only 50 images are used as the search dataset.
(3) Feature caching: encoder features are precomputed and stored; each inference pass only modifies the cached features before passing them to the decoder, avoiding redundant encoding.
Design Motivation: The output difference of a single replacement pair serves as a reliable predictor of that pair's contribution within a combination. The filter-then-combine strategy substantially reduces computation while preserving search effectiveness. The process requires only forward inference—no backpropagation—resulting in minimal GPU memory overhead.
- Channel Replacement Implementation
Given a replacement pair \((i,j)\), the feature transformation is: $\(X'_{d,c,w,h} = X_{d, f_{i \to j}(c), w,h}\)$ where \(f_{i \to j}(\cdot)\) is a mapping function that substitutes channel \(i\) with channel \(j\).
This is not a random shuffle but a selective, deterministic replacement of redundant channels with effective ones.
Loss & Training¶
During the search phase, evaluation uses the same Dice + CE loss as the baseline. Notably, the search process involves only model inference—no gradient computation or backpropagation is required.
Implementation details: - Search dataset: 50 randomly sampled images. - \(N = 10\) (top-\(N\) replacement pairs). - Baseline fine-tuning comparisons use 25 epochs, Adam optimizer, and an initial learning rate of \(10^{-4}\).
Key Experimental Results¶
Main Results¶
Parameter-free fine-tuning results on various SAM versions (average mIoU across 9 datasets):
| Model | Backbone | Parameters | Baseline Avg | +Ours Avg | Gain Δ |
|---|---|---|---|---|---|
| SAM | ViT-B | 91M | 49.14 | 58.08 | +8.94 |
| SAM | ViT-L | 308M | 56.15 | 67.61 | +11.46 |
| SAM | ViT-H | 636M | 55.54 | 60.68 | +5.14 |
| SAM2 | Hiera-T | 39M | 57.29 | 65.63 | +8.34 |
| SAM2 | Hiera-S | 46M | 61.04 | 68.69 | +7.65 |
| SAM2 | Hiera-B+ | 81M | 61.62 | 66.94 | +5.32 |
| SAM2 | Hiera-L | 224M | 67.77 | 73.53 | +5.76 |
Gains of 5–11 mIoU points are achieved without updating any parameters.
Combined with existing fine-tuning methods:
| Fine-tuning Method | Baseline Avg | +Ours Avg | Additional Gain |
|---|---|---|---|
| Decoder-only | 73.61 | 74.62 | +1.01 |
| SAMed (LoRA) | 78.56 | 79.72 | +1.16 |
| SAM-COBOT | 78.73 | 79.32 | +0.59 |
| SAM-Adapter | 72.89 | 73.80 | +0.91 |
| SAM-PARSER | 60.96 | 65.39 | +4.43 |
| DoRA | 79.12 | 79.92 | +0.80 |
These results demonstrate that channel redundancy persists even after parameter fine-tuning, and the proposed method can serve as a plug-and-play module for further improvement.
Ablation Study¶
Computational cost comparison:
| Method | GPU Memory (GB) | Trainable Parameters (K) |
|---|---|---|
| Encoder-only | 34.6 | 89,670 |
| Decoder-only | 13.7 | 4,057 |
| MedSAM | 34.7 | 93,735 |
| SAMed (LoRA) | 28.9 | 147 |
| SAM-PARSER | 15.9 | 0.5 |
| Ours | 11.1 | 0 |
The proposed method achieves the lowest GPU memory usage (11.1 GB vs. 13.7–34.7 GB for other methods) with zero trainable parameters.
Effect of the number of replacement pairs: Increasing the number of replacement pairs generally improves performance, with performance peaking at 6 pairs on the COCO dataset.
Extension to other visual tasks:
| Model | Backbone | NYUv2 MSE↓ / AbsRel↓ / δ₁↑ | CIFAR Acc↑ |
|---|---|---|---|
| DINOv2 | ViT-S | 0.225 / 0.126 / 0.893 | 80.41 |
| +Ours | ViT-S | 0.209 / 0.112 / 0.907 | 80.81 |
| DINOv2 | ViT-B | 0.210 / 0.110 / 0.900 | 88.08 |
| +Ours | ViT-B | 0.193 / 0.095 / 0.916 | 88.49 |
The method is effective for both depth estimation and image classification.
Key Findings¶
- Feature maps of effective channels exhibit clearer structure, edges, and textures, whereas redundant channels appear blurry and noisy (visualized in Figure 5).
- Certain channels exhibit cross-domain consistency: e.g., Channel 19 is effective across natural, medical, and camouflaged scenarios, while Channels 20/98/162/226 are universally redundant.
- Larger models (ViT-H, Hiera-L) show slightly smaller gains, possibly because larger models exhibit relatively less redundancy.
- Gains on in-domain datasets (natural images) are larger than on out-of-domain datasets (medical images), consistent with SAM's training data distribution.
Highlights & Insights¶
- Paradigm innovation: This is the first work to demonstrate that VFMs can be adapted in a completely parameter-free manner—requiring no gradients, no backpropagation, and no additional parameters. Significant downstream performance gains are achieved solely through channel substitution.
- Minimal computational barrier: Requiring only 11.1 GB of GPU memory and forward inference, the method is practically deployable on consumer-grade GPUs, substantially lowering the barrier to VFM adaptation.
- Orthogonal complementarity with PEFT: The method functions as a plug-and-play post-processing step, delivering additional gains of 0.5–4.4 mIoU points on top of already fine-tuned models.
- Insights into channel redundancy: The work reveals the pervasive feature redundancy in foundation models, offering a new perspective on understanding the efficiency of feature utilization in large models.
- Cross-task generalization: The method generalizes from segmentation to depth estimation and classification, and from SAM to DINOv2, validating its universality.
Limitations & Future Work¶
- The search process still requires iterating over \(C^2\) (~65,536) pairs and \(2^N - 1\) combinations; although only inference is needed, this incurs non-trivial time costs at scale.
- The choice of search dataset may influence the optimal replacement pairs; 50 images may not be sufficiently representative for all datasets.
- The value of \(N=10\) is fixed; adaptive determination of \(N\) has not been explored.
- Channel replacement is a hard substitution; softer channel reweighting schemes have not been investigated.
- Operations are limited to the last encoder layer; multi-layer channel replacement has not been explored.
- Validation is restricted to SAM/SAM2/DINOv2; generalizability to other VFMs such as CLIP and MAE remains unknown.
Related Work & Insights¶
- SAM-PARSER (2024): Compresses trainable parameters to as few as 512; this work goes further to zero parameters.
- ShuffleNet: Channel shuffling for cross-group information fusion during training—fundamentally different in objective and mechanism from the proposed method.
- Channel-Exchanging Network: Channel exchange for multimodal fusion; this work targets intra-modal redundancy elimination.
- Network Pruning: Removes redundancy but typically requires retraining; the proposed method does not.
- Insight: Feature redundancy is a universal phenomenon in foundation models; "subtraction" (redundancy elimination) can sometimes be more effective than "addition" (parameter augmentation). This perspective may generalize to the adaptation of NLP large language models.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Parameter-free fine-tuning paradigm is proposed for the first time in the VFM domain; bold and effective.)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (9 datasets × 7 backbones × 6 fine-tuning method combinations, plus depth estimation and classification extensions.)
- Writing Quality: ⭐⭐⭐⭐ (Clear motivation, well-designed experiments, and informative visualizations.)
- Value: ⭐⭐⭐⭐⭐ (Highly practical, extremely low computational barrier, orthogonally complementary to existing methods.)