ProAPO: Progressively Automatic Prompt Optimization for Visual Classification¶

Conference: CVPR 2025
arXiv: 2502.19844
Code: Yes
Area: AIGC Detection
Keywords: Vision-Language Models, Prompt Optimization, Evolutionary Algorithms, Fine-Grained Classification, Few-Shot Learning

TL;DR¶

Proposes ProAPO, a progressively automatic prompt optimization method based on evolutionary algorithms. With only one-shot supervision and zero human intervention, it progressively optimizes from task-level templates to category-level descriptions to address hallucination and lack of discriminativeness in LLM-generated descriptions, outperforming existing text prompting methods on 13 datasets.

Background & Motivation¶

Background: Vision-Language Models (VLMs) such as CLIP perform classification by computing the similarity between images and text prompts. Prompt quality directly determines performance—handcrafted templates require domain expertise and lack fine-grained information; prompt tuning (CoOp) requires extra training and lacks interpretability; LLM-generated descriptions (CuPL, DCLIP) provide category-level semantics but suffer from LLM hallucinations.

Limitations of Prior Work: Category descriptions generated by LLMs suffer from three issues: (1) Inaccuracy—e.g., generating descriptions about "feet" for "Peking duck"; (2) Lack of discriminativeness—generating identical "hooked beak" and "webbed feet" descriptions for different bird species; (3) Non-visual features—e.g., generating "strong smell" for jackfruit.

Key Challenge: Optimizing category-level prompts faces a search space explosion—each category has multiple candidate descriptions, and the combination of (number of categories) \(\times\) (number of descriptions) far exceeds that of task-level templates. This leads to high generation costs, excessive iterations, and severe overfitting (where multiple candidates achieve identical training accuracy but differ significantly in test performance).

Goal: To find the visually most discriminative and optimal category-level prompts under minimal supervision (one-shot) and zero human intervention.

Key Insight: Drawing inspiration from Automatic Prompt Optimization (APO) in NLP, this work searches for optimal prompts in the language space using an evolutionary algorithm. However, unlike methods that only optimize templates, this work progressively optimizes from templates to category descriptions.

Core Idea: First optimize task-level templates using an evolutionary algorithm, and then optimize category-specific descriptions on top of them through editing operations (addition, deletion, replacement) and evolutionary operations (crossover, mutation). This is combined with an entropy-constrained fitness score and sampling strategies to resolve the search space explosion.

Method¶

Overall Architecture¶

ProAPO consists of two phases: (1) Template Optimization Phase—starting from an initial template "a photo of a {class}.", it iteratively evolves to find the optimal task-level template; (2) Description Optimization Phase—built upon the optimal template, it progressively optimizes visual descriptions for each category. Each phase is executed in a loop using the APO (Automatic Prompt Optimization) algorithm: candidate generation \(\rightarrow\) fitness evaluation \(\rightarrow\) retaining the best \(\rightarrow\) continuing evolution.

Key Designs¶

Candidate Generation via Editing + Evolution:
- Function: To generate diverse candidate prompts without repeatedly querying LLMs.
- Mechanism: Queries the LLM once during the initialization phase to construct a template/description library. Subsequently, in each iteration, two types of operations are used to generate candidates: (a) editing operations—performing Add (adding a new description from the library), Delete (deleting an existing description), or Replace (replacing a description) on the current best candidate; (b) evolutionary operations—Crossover (concatenating two high-scoring candidates) and Mutation (randomly replacing some descriptions). These operations search around the neighborhood of the current optimal solution without requiring additional LLM queries.
- Design Motivation: Querying the LLM in every iteration is extremely costly. Constructing the library offline and performing editing/evolution online reduces LLM calls to a single time while maintaining search diversity.
Entropy-Constrained Fitness Scoring:
- Function: To evaluate candidate prompt quality and mitigate overfitting in the one-shot setting.
- Mechanism: Fitness is formulated as \(F(\mathcal{D}, P) = Acc + \alpha \cdot H\), where \(Acc\) is the training set accuracy and \(H = \mathbb{E}[\log(s(x,y))]\) represents the log-similarity score of the ground-truth label. When multiple candidates achieve identical training accuracies, the entropy constraint favors candidates with higher prediction confidence for the correct category, providing a finer grain of discrimination.
- Design Motivation: Under a one-shot setup, many candidates achieve identical training accuracy (all correctly predicting the single sample), yet their test performance varies significantly. Accuracy alone cannot distinguish them; hence, the entropy constraint acts as a "soft gradient" to resolve this issue.
Two-Stage Sampling Strategy:
- Function: To reduce the iteration cost of category-level description optimization.
- Mechanism: (a) Prompt Sampling—starts not from empty templates but from descriptions generated by the LLM with the highest initial scores to shorten the search path; (b) Group Sampling—groups categories by significance and focuses optimization on easily confusable categories rather than traversing all categories. Categories are sorted by prediction entropy, prioritizing high-entropy (high uncertainty) categories.
- Design Motivation: The number of categories can reach hundreds or thousands, making it impractical to optimize descriptions for every category one by one. The grouping and sampling strategy reduces complexity from linear to sub-linear.

Loss & Training¶

No gradient training is required. Zero-shot inference with CLIP is used for evaluation, and the one-shot samples are solely used to compute fitness scores. Evolutionary hyperparameters: keep the top-k candidates per iteration, running for T iterations. LLM (such as GPT-4) is queried only once during the initialization phase.

Key Experimental Results¶

Main Results¶

Method	Type	Average Accuracy on 13 Datasets
CLIP (Handcrafted Templates)	Template	Baseline
CuPL (LLM Descriptions)	Description	Above baseline
PN (Template Optimization)	Template Optimization	Above CuPL
ProAPO	Progressive Optimization	Best among all methods

Ablation Study¶

Configuration	Key Metric	Description
Template Optimization Only	Marginal improvement	Templates lack fine-grained information
+ Description Optimization	Significant improvement	Category-level descriptions provide the core gain
- Entropy Constraint	Drop in accuracy	Overfitting to the one-shot training samples
- Prompt Sampling	Requires more iterations	Low search efficiency from a poor starting point
- Group Sampling	High time cost	Traversing all categories

Key Findings¶

Progressive optimization (template \(\rightarrow\) description) yields better results than directly optimizing descriptions, as a good template provides a superior starting point for description optimization.
Optimized prompts can transfer across different vision backbones (e.g., from ViT-B/16 to ViT-L/14), demonstrating that optimization enhances semantic quality rather than overfitting to a specific model.
Optimized prompts also boost adaptor-based methods (e.g., Tip-Adapter), showing model-agnostic generality.
Among editing operations, Replace contributes the most, while crossover and mutation operations help escape local optima.

Highlights & Insights¶

Progressive Search Strategy: Hierarchical optimization from template to description effectively manages the complexity of the search space. This "coarse-to-fine" strategy is generalizable to other optimization problems with large search spaces.
Single LLM Query + Offline Evolution: Ingeniously limits the role of the LLM to initialization (providing the candidate library), subsequently relying entirely on lightweight editing/evolutionary operations. This drastically reduces API calling costs.
Entropy Constraint for One-Shot Overfitting: Employs logarithmic similarity scores as soft targets to distinguish candidates with identical accuracies, offering a simple yet effective solution.

Limitations & Future Work¶

The search efficiency of evolutionary algorithms is still limited—it requires multiple iterations to converge, which remains costly on datasets with a huge number of categories.
High randomness in one-shot evaluation—different one-shot samples can lead to different optimization results.
The upper bound of the description library's quality is constrained by the initial LLM query—if the LLM generates entirely low-quality descriptions, editing and evolutionary operations can hardly compensate.
Only validated on image classification tasks, leaving more complex vision-language tasks like object detection and VQA unexplored.

vs CuPL: CuPL directly uses LLM-generated descriptions without subsequent optimization. ProAPO builds upon this by utilizing evolutionary search to prune hallucinated descriptions and retain discriminative ones.
vs PN: PN only optimizes task-level templates, whereas ProAPO further optimizes category-level descriptions, covering a much larger search space.
vs CoOp / CoCoOp: Prompt tuning methods optimize continuous token embeddings via gradients, requiring more training samples and lacking interpretability. ProAPO searches in the natural language space, functioning effectively with a single shot while remaining readable.

Rating¶

Novelty: ⭐⭐⭐ Extends NLP’s APO methods to category-level description optimization for visual classification; the progressive strategy is a notable contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 13 datasets with comprehensive ablations; cross-backbone transfer experiments are highly convincing.
Writing Quality: ⭐⭐⭐⭐ The algorithm is clearly described (with multiple pseudo-codes), and the problem definition and motivation are thoroughly discussed.
Value: ⭐⭐⭐ Holds practical value for improving prompt quality in CLIP-like models, though the method is somewhat engineering-oriented.