FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation¶
Conference: CVPR 2026 arXiv: 2603.04733 Code: eVI-group-SCU/FOZO Area: Model Compression Keywords: Test-Time Adaptation, Zeroth-Order Optimization, Visual Prompt, Forward Propagation, Quantized Model Deployment
TL;DR¶
This paper proposes FOZO, a forward-only zeroth-order prompt optimization paradigm that updates prompts via SPSA gradient estimation, a dynamic perturbation strategy, and shallow–deep feature statistics alignment—without modifying model weights. FOZO achieves 59.52% accuracy on ImageNet-C, surpassing all forward-only methods including FOA (58.13%), and supports INT8 quantized models.
Background & Motivation¶
Distribution shift is ubiquitous: Deep learning models frequently encounter train-test distribution shift in real-world deployment. TTA addresses this by dynamically adapting models using unlabeled test data at inference time.
Backpropagation-based methods are resource-intensive: Gradient-based TTA methods such as TENT, SAR, and EATA require backpropagation to update model weights, incurring high computational and memory overhead (e.g., TENT: 5,495 MiB vs. FOZO: 831 MiB), making them unsuitable for low-power edge devices.
Gradient-free methods have limited capacity: Methods such as AdaBN, T3A, and LAME do not construct explicit optimization objectives, resulting in limited learning capacity and suboptimal adaptation performance (e.g., LAME achieves only 54.16% on ImageNet-C).
CMA-ES is inefficient in high-dimensional spaces: The recent forward-only method FOA employs the CMA-ES evolutionary strategy to update prompts, but CMA-ES has \(O(d^2)\) complexity and converges slowly in high-dimensional prompt spaces.
ZOA modifies internal model parameters: ZOA applies zeroth-order optimization to update normalization layer parameters, limiting its applicability in scenarios where model weights cannot be modified (e.g., hardware-encoded or quantized models).
OOD data streams pose optimization challenges: The continuously shifting data distribution in TTA makes zeroth-order gradient estimates prone to unreliability, necessitating specialized optimization strategies to ensure convergence stability.
Method¶
Overall Architecture¶
FOZO injects a small set of learnable prompts \(\mathbf{P} = \{\mathbf{p}^k \in \mathbb{R}^d \mid 1 \leq k \leq p\}\) (default \(p=3\)) into the input layer of a pretrained ViT, with all model weights frozen. Upon each test batch, the prompts are updated via SPSA zeroth-order gradient estimation using only forward passes. The core pipeline is as follows:
- Apply symmetric positive and negative perturbations to prompt \(\mathbf{P}\): \(\mathbf{P}_+ = \mathbf{P} + \epsilon_t \mathbf{Z}\), \(\mathbf{P}_- = \mathbf{P} - \epsilon_t \mathbf{Z}\) (where \(\mathbf{Z} \sim \mathcal{N}(0, I_d)\))
- Compute losses \(\ell_+\) and \(\ell_-\) via separate forward passes
- Estimate the projected gradient: \(\hat{g} = \frac{\ell_+ - \ell_-}{2\epsilon_t} \mathbf{Z}\)
- Average gradient estimates over \(n\) SPSA samples and update the prompts
Key Designs: Dynamic Perturbation Strategy¶
Based on convergence analysis (Theorem 1), the bias term \(C\eta\ell\epsilon_t^2 r\) requires \(\epsilon_t \to 0\) for accurate convergence, yet large perturbations are needed early in training or after domain shifts to encourage exploration. FOZO introduces an adaptive decay mechanism:
- When a loss spike is detected (domain shift or optimization stagnation), \(\epsilon_t\) is reset to \(\epsilon_0\)
- Otherwise, it decays gradually with factor \(\alpha = 0.9\)
- Convergence rate is theoretically shown to depend on the effective Hessian rank \(r\) rather than the parameter dimension \(d\) (Theorem 2)
Loss & Training¶
Shallow–deep feature statistics alignment \(\mathcal{L}_{stats}\): The mean \(\mu\) and standard deviation \(\sigma\) of [CLS] token activations are collected from shallow (\(1 \sim N/2\)) and deep (\(N/2+1 \sim N\)) ViT layers, and aligned to source-domain statistics precomputed offline:
Entropy minimization \(\mathcal{L}_{ent}\): Encourages high-confidence predictions on the target domain.
Total loss: \(\mathcal{L} = \lambda \mathcal{L}_{stats} + \mathcal{L}_{ent}\), where \(\lambda = 0.4\).
Key Experimental Results¶
Main Results¶
Comparison with forward-only methods on ImageNet-C (5K, severity level 5, 2 forward passes):
| Method | FP | Avg Acc (%) | Time (s) | Memory (MiB) | #Params |
|---|---|---|---|---|---|
| NoAdapt | 1 | 55.57 | 94 | 819 | 0 |
| LAME | 1 | 54.16 | 97 | 819 | 0 |
| T3A | 1 | 53.76 | 311 | 823 | 0 |
| FOA | 2 | 58.13 | 224 | 831 | 2304 |
| ZOA | 2 | 58.56 | 198 | 859 | 26145 |
| FOZO | 2 | 59.52 | 179 | 831 | 2304 |
Comparison with backpropagation-based methods (28 forward passes):
| Method | Avg Acc (%) | Time (s) | Memory (MiB) |
|---|---|---|---|
| TENT | 58.32 | 208 | 5495 |
| EATA | 61.35 | 218 | 5496 |
| SAR | 60.36 | 393 | 5495 |
| DEYO | 60.76 | 282 | 5499 |
| FOZO (FP=26) | 62.60 | 2102 | 831 |
Ablation Study¶
| Configuration | Acc (%) | Δ |
|---|---|---|
| NoAdapt | 55.1 | — |
| Base FOZO (ZO + Entropy) | 57.3 | +2.2 |
| + Deep-Shallow Alignment | 60.1 | +2.8 |
| + Dynamic Perturbation (full) | 62.7 | +2.6 |
Shallow–deep feature alignment contributes the most (+2.8%), followed by the dynamic perturbation strategy (+2.6%).
Key Findings¶
- Strong compatibility with quantized models: On INT8 PTQ4ViT, FOZO achieves 58.00% vs. FOA 57.07% and ZOA 56.91%, demonstrating the advantage of forward-only methods in quantized settings.
- Exceptional memory efficiency: FOZO requires only 831 MiB, on par with the no-adaptation baseline and approximately 15% of backpropagation-based methods (831 vs. 5,495 MiB).
- Parameter efficiency: Only 2,304 prompt parameters are updated, representing 8.8% of ZOA's 26,145 parameters.
- Convergence speed: FOZO reaches 65% accuracy in approximately 66% of the time required by FOA/ZOA.
- Cross-dataset generalization: FOZO outperforms all forward-only methods on ImageNet-R (64.1%) and ImageNet-Sketch (50.5%).
Highlights & Insights¶
- Solid theoretical foundation: Convergence is rigorously proven under the SPSA framework and a local effective rank assumption; the convergence rate depends on the effective Hessian rank \(r\) rather than the parameter dimension \(d\).
- Strong practical applicability: Pure forward-pass inference, no model weight modification, and low memory footprint make FOZO directly deployable on edge devices and quantized models.
- Elegant dynamic perturbation design: Domain shifts are automatically detected to reset the perturbation scale, striking a balance between exploration and convergence.
- Comprehensive evaluation: Experiments cover full-precision and quantized models, multiple datasets, continual adaptation scenarios, clear ablation studies, and hyperparameter sensitivity analysis.
Limitations & Future Work¶
- High wall-clock time: Under multiple forward passes (FP=26/28), adaptation time (2,102 s) is substantially higher than backpropagation-based methods (208–393 s); the memory-for-speed trade-off may be unfavorable in latency-sensitive scenarios.
- Validated only on ViT architectures: CNN or other architectures are not tested, and prompt injection relies on ViT's token concatenation mechanism.
- Dependency on source-domain statistics: Feature statistics must be precomputed from a source-domain validation set, which may be unavailable in fully black-box deployment settings.
- Sensitivity to prompt count and batch size: Although ablations show that 3 prompts and batch size 64 are robust configurations, performance degrades noticeably at small batch sizes (4/8).
Related Work & Insights¶
- FOA (CVPR 2024): The first forward-only prompt optimization TTA method, which uses CMA-ES to update prompts. FOZO replaces CMA-ES with SPSA to address its \(O(d^2)\) complexity.
- ZOA (ACM MM 2025): Applies zeroth-order optimization to TTA but updates normalization layer parameters (26,145 parameters); FOZO updates only prompts (2,304 parameters) without modifying the model.
- TENT (ICLR 2021): A landmark entropy-minimization TTA method that requires backpropagation to update batch normalization parameters; FOZO inherits the entropy loss formulation while eliminating backpropagation.
- MeZO (NeurIPS 2023): Proposes the local effective rank hypothesis to establish the feasibility of zeroth-order optimization; FOZO extends this theoretical framework to the TTA setting.
- Visual Prompt Tuning (ECCV 2022): The seminal method for injecting learnable prompts into the ViT input layer; FOZO adapts this technique to a backpropagation-free test-time setting.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Replacing CMA-ES with SPSA zeroth-order estimation for TTA prompt optimization is a well-motivated and novel combination; the dynamic perturbation design is theoretically grounded.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset evaluation, quantized model testing, continual adaptation, detailed ablations, and hyperparameter analysis are provided, though non-ViT architectures are absent.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, complete theoretical derivations, and well-articulated motivation.
- Value: ⭐⭐⭐⭐ — Provides a strongly competitive solution for TTA on edge-deployed and quantized models in practical settings.