Skip to content

Self-Evolving Visual Concept Library using Vision-Language Critics

Conference: CVPR 2025
arXiv: 2504.00185
Code: https://trishullab.github.io/escher-web
Area: Multimodal VLM
Keywords: Concept Bottleneck Models, Library Learning, Visual Concept Evolution, VLM Critic, Fine-grained Classification

TL;DR

This paper proposes the Escher framework, which automatically evolves a visual concept library using an iterative loop consisting of a VLM as a critic and an LLM as a concept generator. This evolution improves the performance of concept bottleneck models in image classification, boosting LM4CV from 63.26% to 83.17% (+19.91%) on the CUB dataset.

Background & Motivation

Concept-Bottleneck Visual Recognition performs classification by identifying intermediate visual concepts (e.g., "stainless steel rocket body", "grid fins"), providing better interpretability and accuracy compared to direct classification with VLMs. However, existing methods rely on LLMs to generate concept sets in a single pass, which suffers from two limitations:

  1. Concepts generated by LLMs may lack discriminative power: Generated concepts may apply to all classes (e.g., "has wings" is true for all birds) and fail to distinguish fine-grained categories.
  2. Inter-concept interactions are overlooked: Confusions arise when the same concept is activated across multiple categories, yet existing methods generate concepts for each class in isolation.

Key Insight: When facing a new domain, scientists do not rely on a fixed set of concepts; instead, they continually learn and expand their conceptual knowledge base. Similarly, a visual concept library should evolve dynamically. From the perspective of Library Learning, the authors model this problem as hierarchical Bayesian optimization.

Method

Overall Architecture

Escher adopts an alternating maximization strategy, iterating between two phases: 1. Fix concept set \(\rightarrow\) Train/optimize the concept-bottleneck classifier. 2. Fix classifier \(\rightarrow\) Identify confused category pairs \(\rightarrow\) LLM generates new discriminative concepts.

The entire process requires no human annotation, with the VLM acting as a "visual critic" to provide feedback signals.

Key Designs

  1. Concept Bottleneck Optimization:

    • Function: Given the current concept library \(\mathcal{C}\), train the classifier weights \(w_\mathcal{Y}\).
    • Mechanism: Classification prediction is \(y^* = \arg\max_{y \in \mathcal{Y}} w_y^\top \text{score}_{\text{VLM}}(\mathbf{x}, \mathcal{C})\). It supports three paradigms: zero-shot (LLM directly assigns uniform weights), few-shot (linear probing), and fine-tuning (fully training the linear layer \(\mathbb{R}^{|\mathcal{C}| \times |\mathcal{Y}|}\)).
    • Design Motivation: The modular design allows Escher to be plug-and-play and integrated into any concept bottleneck framework (CbD, LaBO, LM4CV).
  2. Confusion Heuristics:

    • Function: Identify frequently confused category pairs from the classifier's predictions.
    • Mechanism: Compute the score matrix \(\hat{\mathbf{y}} \in \mathbb{R}^{N \times |\mathcal{Y}|}\) for all images across categories, and use heuristics to identify highly confused pairs \(\{r_{ij}\}\). Four heuristics are provided: Top-k confusion, Pearson correlation, agglomerative clustering, and confusion matrix.
    • Design Motivation: Evolving only the top-\(K\) most confused category pairs instead of all pairs significantly improves efficiency. An exponential decay parameter \(\gamma\) prevents repeatedly evolving the same pair while ignoring other issues.
  3. History-Sensitive Concept Evolution:

    • Function: Generate new discriminative concepts for the confused category pairs while avoiding the regeneration of previously failed concepts.
    • Mechanism: Maintain a history library \(H^{(i,j)}_{[1:t]}\) recording past concepts and VLM feedback scores for each category pair. The LLM prompt includes history (similar to "execution history" in program synthesis), enabling it to learn from past failures. A scratchpad is also used to enhance reasoning capabilities.
    • Design Motivation: Drawing inspiration from execution traces in program synthesis, this ensures novel concepts are generated in each feedback round. Without history, LLMs tend to repeatedly propose the same features.

Loss & Training

  • Under the fine-tuning setup, the linear adapter is trained using cross-entropy loss with regularization.
  • Under the zero-shot setup, no training is required; weights are directly assigned as uniform weights \(1/|c_y|\) by the LLM.
  • Hyperparameters for Escher: number of iterations \(T=60\), Top-k confusion \(k=3\), sampling Top-50 pairs, decay rate \(\gamma=1/30\).

Key Experimental Results

Main Results (Fine-tuning Setup with LM4CV)

Dataset LM4CV LM4CV+Escher Gain
CIFAR-100 84.48 89.63 +5.15
CUB-200-2011 63.26 83.17 +19.91
Food101 94.77 94.90 +0.13
NABirds 76.58 78.21 +1.63
Oxford Flowers 94.80 96.86 +2.06
Stanford Cars 86.84 93.76 +6.92

Zero-Shot Setup (CbD)

Dataset CLIP CbD CbD+Escher Gain
CIFAR-100 73.30 76.20 77.80 +1.60
CUB-200-2011 64.83 62.00 63.33 +1.33
Food101 92.51 93.11 93.58 +0.47
Stanford Cars 74.53 75.65 77.14 +1.49

Ablation Study

Configuration CUB Top-1 Description
Original LM4CV 63.26 Concepts sampled once by LLM
LM4CV + 3x Concepts (no feedback) 66.09 Same number of concepts added but without VLM feedback
LM4CV + Escher 83.17 Concept evolution with VLM feedback

Key Findings

  • Most significant improvements occur on datasets with low initial accuracy: The gains on CUB and Stanford Cars far exceed those on Food101, showing that Escher is particularly effective at resolving confusions in fine-grained classification.
  • Merely increasing the number of concepts is insufficient: Tripling the concepts for LM4CV only improves performance from 63.26% to 66.09% (on CUB), whereas Escher reaches 83.17%. This demonstrates that concept evolution guided by VLM feedback is key.
  • Mixed results in few-shot settings: LaBO+Escher shows inconsistent performance at 8-shot, becoming more stable at 16-shot, potentially due to noisy signals from poorly calibrated classifiers under few-shot conditions.
  • Equally effective for weaker backbones: Escher consistently improves performance even when using a ViT-B/16 + Llama-3.3-70B-4bit configuration.

Highlights & Insights

  • Concept evolution vs. concept sampling is the core contribution: focus is on "better concepts" rather than "more concepts". The feedback-driven iterative evolution is inherently a search process.
  • The modular design is highly elegant: Escher can be seamlessly embedded into three paradigms — CbD (zero-shot), LaBO (few-shot), and LM4CV (fine-tuning) — without modification.
  • The +19.91% performance gain on CUB is remarkable in fine-grained classification, indicating that concept selection is a severely underestimated bottleneck.
  • The history-sensitive prompt design cleverly exploits the in-context learning capabilities of LLMs to avoid redundant concepts.

Limitations & Future Work

  • Unstable performance in few-shot settings suggests a need to integrate few-shot learning techniques to improve calibration.
  • The computational cost of invoking LLM + VLM inference in each iteration is high across 60 iterations.
  • Application to visual reasoning tasks such as VQA and object detection remains unexplored.
  • Confusion heuristics and hyperparameters (\(k\), \(K\), \(\gamma\)) require tuning for individual datasets.
  • Difference from LLM-mutate: The latter performs mutations for each category in isolation, which does not scale to large datasets; Escher jointly reasons over all categories and focuses on the underperforming subsets.
  • Transferring library learning concepts from program synthesis to visual recognition represents a novel cross-domain connection.
  • Insight: Similar "VLM-critic + LLM-proposer" loops could be applied to scenarios such as prompt engineering and data augmentation strategy evolution.

Rating

  • Novelty: ⭐⭐⭐⭐ Introducing library learning to visual concept discovery is an innovative perspective, and the closed-loop VLM-critic design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive with 7 datasets, 3 paradigms (zero-shot/few-shot/fine-tuning), backbone ablations, and concept count ablations.
  • Writing Quality: ⭐⭐⭐⭐ The Bayesian formulation is clear and the algorithm description is complete, though notations are occasionally over-complicated.
  • Value: ⭐⭐⭐⭐ The ~20% improvement on CUB proves concept selection is a real bottleneck; the framework can directly enhance existing CBM systems.