Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models¶

Conference: AAAI 2026 arXiv: 2504.08915 Code: N/A Area: 3D Vision Keywords: Vision Foundation Models, Parameter-Free Fine-Tuning, Channel Redundancy, SAM, Feature Selection

TL;DR¶

This work identifies pervasive redundant channels in vision foundation models (SAM/SAM2/DINOv2) and proposes a parameter-free adaptation method that requires no parameter updates: an output-difference-based channel selection algorithm identifies optimal replacement pairs, substituting redundant channels with effective ones to enhance feature representations for downstream tasks, yielding average mIoU gains of 5–11 points.

Background & Motivation¶

Vision Foundation Models (VFMs) such as SAM and DINOv2, trained on large-scale data, possess powerful general-purpose visual representations. Adapting them to downstream tasks typically requires parameter fine-tuning:

Full fine-tuning: updates all parameters at high computational cost.
Parameter-Efficient Fine-Tuning (PEFT/LoRA/Adapter): updates a small subset of parameters (thousands to millions), yet still requires backpropagation and computation graph maintenance.

Key Observation (control experiment in Table 1): On the PerSeg dataset with SAM, zeroing out certain channel activations reveals: - Channel 6 zeroed: mIoU unchanged (50.6 → 50.6), indicating the channel is redundant. - Channel 216 zeroed: mIoU improves (50.6 → 52.7), indicating the channel is even harmful. - Channels 175/19/189 zeroed: mIoU drops, indicating these channels are beneficial for the task.

Key Challenge: Among the general features learned by VFMs on large-scale data, many are irrelevant or even detrimental to specific downstream tasks. This redundancy arises because the model must generalize across a large variety of tasks.

Core Problem: Can downstream tasks be addressed without modifying any model parameters—solely by selecting, reusing, and enhancing existing features?

Method¶

Overall Architecture¶

The proposed approach contrasts sharply with conventional fine-tuning paradigms: - (a) Conventional decoder fine-tuning: updates decoder parameters to adapt pre-trained features. - (b) Conventional encoder fine-tuning: updates encoder parameters to modify pre-trained features. - (c) Proposed method: updates no parameters; instead, redundant channels are replaced with more effective ones.

Pipeline: Search dataset → Encoder feature extraction → Pairwise channel replacement → Output difference comparison → Dictionary construction → Optimal combination search → Replacement application.

Key Designs¶

Problem Formulation

The objective is to find the optimal set of replacement pairs $P^*$ that maximizes performance on the downstream dataset $S$: $$P^* = \arg\max_P \text{mIoU}(S, P)$$ where $P = \{(i,j)_1, (i,j)_2, \ldots, (i,j)_k\}$ and $(i,j)$ denotes replacing channel $i$ with channel $j$.

Direct enumeration of all combinations is intractable: for $C=256$, it requires $2^{C^2}$ inference passes.

Channel Selection Algorithm

Three strategies to reduce search cost:

(1) Output-difference-based search: Given a search dataset $\mathbf{S}$, the encoder produces features $X \in \mathbb{R}^{D \times C \times W \times H}$. For each replacement pair $(i,j)$, the gain is computed as: $$\Delta\text{Acc}_{(i \to j)} = D(X') - D(X)$$ where $D(X)$ and $D(X')$ denote decoder outputs for the original and replaced features, respectively.

A dictionary $\mathcal{D} = \{(i,j): \Delta\text{Acc}_{(i \to j)}\}$ is constructed, and the top $N$ pairs form $\mathcal{D}_{topN}$.

All $2^N - 1$ combinations within $\mathcal{D}_{topN}$ are then enumerated to find the optimal combination $P^*$.

Complexity reduction: from $2^{C^2}$ to $C^2 + 2^N - 1$ (for $N=10$, approximately $65{,}536 + 1{,}023$ inference passes).

(2) Sample reduction: only 50 images are used as the search dataset.

(3) Feature caching: encoder features are precomputed and stored; each inference pass only modifies the cached features before passing them to the decoder, avoiding redundant encoding.

Design Motivation: The output difference of a single replacement pair serves as a reliable predictor of that pair's contribution within a combination. The filter-then-combine strategy substantially reduces computation while preserving search effectiveness. The process requires only forward inference—no backpropagation—resulting in minimal GPU memory overhead.

Channel Replacement Implementation

Given a replacement pair $(i,j)$, the feature transformation is: $$X'_{d,c,w,h} = X_{d, f_{i \to j}(c), w,h}$$ where $f_{i \to j}(\cdot)$ is a mapping function that substitutes channel $i$ with channel $j$.

This is not a random shuffle but a selective, deterministic replacement of redundant channels with effective ones.

Loss & Training¶

During the search phase, evaluation uses the same Dice + CE loss as the baseline. Notably, the search process involves only model inference—no gradient computation or backpropagation is required.

Implementation details: - Search dataset: 50 randomly sampled images. - $N = 10$ (top-$N$ replacement pairs). - Baseline fine-tuning comparisons use 25 epochs, Adam optimizer, and an initial learning rate of $10^{-4}$.

Key Experimental Results¶

Main Results¶

Parameter-free fine-tuning results on various SAM versions (average mIoU across 9 datasets):

Model	Backbone	Parameters	Baseline Avg	+Ours Avg	Gain Δ
SAM	ViT-B	91M	49.14	58.08	+8.94
SAM	ViT-L	308M	56.15	67.61	+11.46
SAM	ViT-H	636M	55.54	60.68	+5.14
SAM2	Hiera-T	39M	57.29	65.63	+8.34
SAM2	Hiera-S	46M	61.04	68.69	+7.65
SAM2	Hiera-B+	81M	61.62	66.94	+5.32
SAM2	Hiera-L	224M	67.77	73.53	+5.76

Gains of 5–11 mIoU points are achieved without updating any parameters.

Combined with existing fine-tuning methods:

Fine-tuning Method	Baseline Avg	+Ours Avg	Additional Gain
Decoder-only	73.61	74.62	+1.01
SAMed (LoRA)	78.56	79.72	+1.16
SAM-COBOT	78.73	79.32	+0.59
SAM-Adapter	72.89	73.80	+0.91
SAM-PARSER	60.96	65.39	+4.43
DoRA	79.12	79.92	+0.80

These results demonstrate that channel redundancy persists even after parameter fine-tuning, and the proposed method can serve as a plug-and-play module for further improvement.

Ablation Study¶

Computational cost comparison:

Method	GPU Memory (GB)	Trainable Parameters (K)
Encoder-only	34.6	89,670
Decoder-only	13.7	4,057
MedSAM	34.7	93,735
SAMed (LoRA)	28.9	147
SAM-PARSER	15.9	0.5
Ours	11.1	0

The proposed method achieves the lowest GPU memory usage (11.1 GB vs. 13.7–34.7 GB for other methods) with zero trainable parameters.

Effect of the number of replacement pairs: Increasing the number of replacement pairs generally improves performance, with performance peaking at 6 pairs on the COCO dataset.

Extension to other visual tasks:

Model	Backbone	NYUv2 MSE↓ / AbsRel↓ / δ₁↑	CIFAR Acc↑
DINOv2	ViT-S	0.225 / 0.126 / 0.893	80.41
+Ours	ViT-S	0.209 / 0.112 / 0.907	80.81
DINOv2	ViT-B	0.210 / 0.110 / 0.900	88.08
+Ours	ViT-B	0.193 / 0.095 / 0.916	88.49

The method is effective for both depth estimation and image classification.

Key Findings¶

Feature maps of effective channels exhibit clearer structure, edges, and textures, whereas redundant channels appear blurry and noisy (visualized in Figure 5).
Certain channels exhibit cross-domain consistency: e.g., Channel 19 is effective across natural, medical, and camouflaged scenarios, while Channels 20/98/162/226 are universally redundant.
Larger models (ViT-H, Hiera-L) show slightly smaller gains, possibly because larger models exhibit relatively less redundancy.
Gains on in-domain datasets (natural images) are larger than on out-of-domain datasets (medical images), consistent with SAM's training data distribution.

Highlights & Insights¶

Paradigm innovation: This is the first work to demonstrate that VFMs can be adapted in a completely parameter-free manner—requiring no gradients, no backpropagation, and no additional parameters. Significant downstream performance gains are achieved solely through channel substitution.
Minimal computational barrier: Requiring only 11.1 GB of GPU memory and forward inference, the method is practically deployable on consumer-grade GPUs, substantially lowering the barrier to VFM adaptation.
Orthogonal complementarity with PEFT: The method functions as a plug-and-play post-processing step, delivering additional gains of 0.5–4.4 mIoU points on top of already fine-tuned models.
Insights into channel redundancy: The work reveals the pervasive feature redundancy in foundation models, offering a new perspective on understanding the efficiency of feature utilization in large models.
Cross-task generalization: The method generalizes from segmentation to depth estimation and classification, and from SAM to DINOv2, validating its universality.

Limitations & Future Work¶

The search process still requires iterating over $C^2$ (~65,536) pairs and $2^N - 1$ combinations; although only inference is needed, this incurs non-trivial time costs at scale.
The choice of search dataset may influence the optimal replacement pairs; 50 images may not be sufficiently representative for all datasets.
The value of $N=10$ is fixed; adaptive determination of $N$ has not been explored.
Channel replacement is a hard substitution; softer channel reweighting schemes have not been investigated.
Operations are limited to the last encoder layer; multi-layer channel replacement has not been explored.
Validation is restricted to SAM/SAM2/DINOv2; generalizability to other VFMs such as CLIP and MAE remains unknown.

SAM-PARSER (2024): Compresses trainable parameters to as few as 512; this work goes further to zero parameters.
ShuffleNet: Channel shuffling for cross-group information fusion during training—fundamentally different in objective and mechanism from the proposed method.
Channel-Exchanging Network: Channel exchange for multimodal fusion; this work targets intra-modal redundancy elimination.
Network Pruning: Removes redundancy but typically requires retraining; the proposed method does not.
Insight: Feature redundancy is a universal phenomenon in foundation models; "subtraction" (redundancy elimination) can sometimes be more effective than "addition" (parameter augmentation). This perspective may generalize to the adaptation of NLP large language models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Parameter-free fine-tuning paradigm is proposed for the first time in the VFM domain; bold and effective.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (9 datasets × 7 backbones × 6 fine-tuning method combinations, plus depth estimation and classification extensions.)
Writing Quality: ⭐⭐⭐⭐ (Clear motivation, well-designed experiments, and informative visualizations.)
Value: ⭐⭐⭐⭐⭐ (Highly practical, extremely low computational barrier, orthogonally complementary to existing methods.)