ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation¶

Conference: CVPR 2026 arXiv: 2603.11542 Code: github.com/Jahid12012021/ReHARK Area: Multimodal VLM Keywords: CLIP adaptation, one-shot learning, kernel ridge regression, training-free, RBF kernel

TL;DR¶

This paper proposes ReHARK — a training-free one-shot CLIP adaptation framework that constructs a hybrid prior by fusing CLIP text knowledge, GPT-3 semantic descriptions, and visual prototypes, and performs global proximal regularization in RKHS via multi-scale RBF kernels, achieving a new one-shot SOTA of 65.83% average accuracy across 11 benchmarks.

Background & Motivation¶

Adapting VLMs such as CLIP to downstream tasks under extremely few-shot (one-shot) settings faces a stability–plasticity dilemma. Fine-tuning methods (e.g., CoOp) are computationally expensive and prone to catastrophic forgetting; training-free methods such as Tip-Adapter are lightweight but are theoretically equivalent to local Nadaraya–Watson estimators, suffering from boundary bias and lacking global structural regularization. ProKeR mitigates this by introducing global RKHS regularization, yet under extreme data scarcity with a single visual sample per class, it struggles to capture domain-specific details, limiting performance.

Core Problem¶

How to fully exploit multimodal prior knowledge (text + vision + LLM semantics) under the extreme constraint of one sample per class, and achieve stable, robust CLIP domain adaptation via global kernel methods?

Method¶

Overall Architecture¶

A four-stage pipeline: (1) feature transformation and rectification → (2) hybrid prior construction → (3) support set augmentation (bridging) → (4) multi-scale RBF kernel ridge regression in RKHS. All steps require no backpropagation; hyperparameters are searched via Optuna with 1,000 trials.

Key Designs¶

Hybrid Semantic–Visual Prior: Three knowledge sources are fused — (a) standard CLIP text weights $W_{clip}$, (b) weights $W_{gpt3}$ obtained by encoding GPT-3-generated dense semantic descriptions through CLIP, and (c) 1-shot visual class prototypes $P_{vis}$. Text priors are first blended with weight $\gamma$, then fused with visual prototypes using weight $\omega$, yielding a stable global anchor $W_{prior}$.
Support Set Bridge Augmentation: Cross-modal "bridge" samples are generated by interpolating visual features with their corresponding refined text priors: $x_{bridge} = \text{norm}(x_{vis} + \eta \cdot w_{label})$. This smooths the adaptation manifold between the text and visual modalities, expanding the support set from $NK$ to $2NK$.
Multi-Scale RBF Kernel Ensemble: A convex combination of two Gaussian RBF kernels with bandwidths $\beta_1$ (local) and $\beta_2$ (global) is employed: $$K(\mathbf{x}, \mathbf{x}') = \pi \exp(-\beta_1 \|\mathbf{x}-\mathbf{x}'\|^2) + (1-\pi) \exp(-\beta_2 \|\mathbf{x}-\mathbf{x}'\|^2)$$ The closed-form solution is $\boldsymbol{\alpha} = (K + \lambda I)^{-1}(Y - \hat{Y}_{zs})$.
Nonlinear Power Transform: The transform $f(\mathbf{x}, p) = \text{sign}(\mathbf{x}) \cdot |\mathbf{x}|^p$ is applied to features followed by L2 normalization, alleviating high-dimensional distribution bias and domain shift.

Loss & Training¶

No training is required; adaptation coefficients are computed directly via closed-form solution at inference time. Hyperparameters ($\beta_1, \beta_2, p, \gamma, \omega$, etc.) are optimized with Optuna over 1,000 trials. The inference formula is: $$\Phi(x_q) = \sigma_{zs}(x_q W_{prior}^T) + K(x_q, S_{aug})\boldsymbol{\alpha}$$

Key Experimental Results¶

Method	ImageNet	Caltech	DTD	EuroSAT	Aircraft	Pets	Flowers	Food101	Cars	SUN397	UCF101	Avg.
Zero-Shot CLIP	60.35	85.68	42.91	36.27	17.01	77.37	66.02	85.72	55.75	58.82	61.78	58.88
Tip-Adapter	60.58	88.09	45.90	56.76	19.06	77.54	75.06	86.02	57.11	60.85	64.40	62.85
ProKeR	60.60	88.17	47.99	59.75	20.65	77.40	78.85	86.44	56.79	59.66	65.13	63.77
ReHARK	61.88	90.13	49.23	69.19	21.45	77.55	80.82	86.34	59.18	63.53	64.83	65.83

ReHARK outperforms ProKeR by +2.06% on average, with the most substantial gain on EuroSAT (+9.44%).

Ablation Study¶

Removing the Power Transform causes the largest drop (65.75 → 65.32), underscoring the importance of nonlinear feature rectification.
Removing Rectify (distribution alignment) and Refine (visual prototype fusion) each results in approximately 0.3% degradation.
Using only a visual prior (ONLY_VISUAL) causes accuracy to collapse to 43.83%, demonstrating the critical role of text priors under one-shot settings.
The RBF kernel substantially outperforms the Linear kernel (55.45%) and Laplacian kernel (60.84%).
Increasing search trials monotonically improves performance from 50 to 1,000 trials (64.87 → 65.83%).

Highlights & Insights¶

Clear theoretical perspective: Tip-Adapter is interpreted as a local NW estimator, while ReHARK is framed as global RKHS regularization, providing well-grounded theoretical motivation.
The training-free design with closed-form solutions runs on a single P100 GPU, making it extremely lightweight.
Fusing GPT-3 semantic descriptions with CLIP text weights effectively increases "knowledge density" under one-shot constraints.
The +9.44% gain on EuroSAT suggests the framework is particularly effective for datasets with strong structural sensitivity or large distribution shifts.

Limitations & Future Work¶

The 1,000-trial Optuna search introduces non-trivial computational overhead; the hyperparameter search itself is not cost-free despite training-free inference.
Generic GPT-3 descriptions may lack sufficient discriminability in highly specialized domains.
Only the one-shot setting is evaluated; performance under few-shot regimes (2/4/8/16-shot) is not reported.
Experiments are conducted solely with a ViT-B/16 backbone; generalization to larger backbones or other VLMs is not verified.

Tip-Adapter: A local NW estimator averaging 62.85% vs. ReHARK's 65.83%; the gap originates from global regularization and hybrid priors.
ProKeR: Also employs global RKHS regularization but lacks GPT-3 priors and multi-scale kernels, yielding 63.77% vs. 65.83%, with the largest gap on EuroSAT (59.75 vs. 69.19).
GDA: Gaussian discriminant analysis with Mahalanobis distance achieves 62.24%, representing one of the stronger training-free baselines.

The hybrid prior construction strategy is broadly applicable to other VLM adaptation scenarios — rather than relying on a single prompt or description, fusing multi-source semantic knowledge is a generally useful principle. The multi-scale kernel ensemble idea is simple yet effective and transferable to other kernel methods. The bridge augmentation concept (mixing visual and text features) also holds general value for cross-modal alignment tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — A systematic improvement over ProKeR; the combination of hybrid priors, multi-scale kernels, and bridge augmentation is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated on 11 benchmarks with multi-dimensional ablations; lacks few-shot and multi-backbone comparisons.
Writing Quality: ⭐⭐⭐ — Structure is clear, but some equations are redundant and excessive use of passive voice reduces readability.
Value: ⭐⭐⭐⭐ — Offers clear practical value for one-shot VLM adaptation; the training-free design is well-suited for real-world deployment.