External Knowledge Injection for CLIP-Based Class-Incremental Learning¶

Conference: ICCV 2025 arXiv: 2503.08510 Code: GitHub Area: Information Retrieval Keywords: Class-Incremental Learning, CLIP, External Knowledge Injection, GPT-4, Vision-Language Models

TL;DR¶

This paper proposes Engine (ExterNal knowledGe INjEction), a framework that employs dual-branch injection tuning (visual branch via data augmentation; text branch via GPT-4-generated discriminative descriptions) and post-tuning knowledge injection at inference (pairwise discriminative feature re-ranking), achieving 3–10% improvements over all CLIP-based class-incremental learning methods across 9 benchmark datasets without storing any historical samples.

Background & Motivation¶

Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting previously learned ones. CLIP-based CIL methods leverage CLIP's strong zero-shot capability by matching visual embeddings against text embeddings of class names. However, three core issues persist in existing approaches:

Insufficient template text information: CLIP defaults to "a photo of a [CLASS]" as the text matching target, neglecting rich fine-grained visual features. For instance, "cat" can be decomposed into features such as "long thin tail," "soft fur," and "round face with large eyes," which provide more accurate classification signals when used as matching targets.

Limitations of visual prompts: Methods such as L2P and DualPrompt adapt to downstream tasks by learning visual prompts, but cannot exploit information from the text side. Moreover, the model must autonomously decompose class names into fine-grained concepts for visual feature matching — a decomposition capability that degrades during incremental updates.

Overfitting of text prompts: Methods such as CoOp learn trainable text prompts but tend to overfit to the specific distribution of training instances, resulting in poor performance under distribution shift at test time.

The core idea of this paper is to leverage large language models such as GPT-4 as an external knowledge source to provide general, comprehensive visual feature descriptions for each class, and to encode this knowledge into CLIP via lightweight injection units, thereby avoiding the overfitting problem associated with prompt learning.

Method¶

Overall Architecture¶

Engine comprises two stages of knowledge injection: - On-the-fly knowledge injection (training phase): the CLIP encoders are frozen, and lightweight injection units for both visual and text branches are learned. - Post-tuning knowledge injection (inference phase): GPT-4-generated pairwise discriminative features are used to re-rank the top-$k$ predictions.

Key Designs¶

Text Knowledge Injection Unit:
- Function: GPT-4 is used to generate discriminative visual feature descriptions for each class, and this external knowledge is encoded into a linear injection unit.
- Mechanism: GPT-4 is queried for the unique visual features of each class (e.g., "What are unique visual features of [CLASS] in a photo?"), yielding a description set $\mathbf{d}_i$. A linear layer $u_t(\cdot): \mathbb{R}^d \to \mathbb{R}^d$ is appended after the frozen text encoder $\bar{g_t}$, and knowledge is injected by maximizing the following similarity loss: $\mathcal{L}_t = -\text{Sim}\left(u_t\left(\bar{g_t}(\mathbf{t}_i)\right), \bar{g_t}(\mathbf{d}_i)\right)$ This encourages the template text feature processed by the injection unit, $u_t(\bar{g_t}(\mathbf{t}_i))$, to approximate the detailed description features $\bar{g_t}(\mathbf{d}_i)$ generated by GPT-4.
- Design Motivation: Through this mechanism, even when a simple class-name template is used at inference, the injection unit automatically maps it to an embedding space that encodes fine-grained features such as "long thin tail," yielding better generalization and less overfitting than directly learning trainable prompts.
Visual Knowledge Injection Unit:
- Function: Enhances the diversity of visual features via data augmentation.
- Mechanism: A linear layer $u_i(\cdot): \mathbb{R}^d \to \mathbb{R}^d$ is appended after the frozen visual encoder $\bar{g_i}$, and the similarity between features of the original image and its augmented counterpart is maximized: $\mathcal{L}_i = -\text{Sim}\left(u_i\left(\bar{g_i}(\mathbf{x})\right), \bar{g_i}(\mathcal{A}(\mathbf{x}))\right)$ where $\mathcal{A}$ denotes the AutoAugment data augmentation strategy.
- Design Motivation: The visual injection unit learns to map original features into a space that is invariant to augmentation transformations, resulting in visual embeddings that are richer and more robust.
Task-Specific Injection Unit Expansion + Prototype Replay:
- Function: Independent injection units are created for each incremental task to prevent forgetting.
- Mechanism: When learning the $b$-th task, new units $u_t^b, u_i^b$ are initialized and all previously trained injection units are frozen. The final representation is the sum of all injection units: $G_i(\mathbf{x}) = \sum_{p=1}^{b-1} \bar{u}_i^p(\bar{g_i}(\mathbf{x})) + u_i^b(\bar{g_i}(\mathbf{x}))$ Visual prototypes $\mathbf{p}_k$ (class-mean embeddings) serve as proxies for old classes during training, augmented with Gaussian noise $\epsilon \sim \mathcal{N}(0, \alpha^2 \mathbf{I})$ to increase diversity. Since the injection units are linear layers, they can be re-parameterized into a single equivalent layer $\sum_p u_i^p$, introducing no additional inference overhead.
- Design Motivation: Because the CLIP image encoder remains frozen throughout, visual prototypes are consistent across stages and serve as reliable surrogates for historical classes, eliminating the need to store raw data.
Post-Tuning Knowledge Injection (Inference-Time Re-Ranking):
- Function: GPT-4-generated pairwise discriminative features are used to locally refine the top-$k$ predictions.
- Mechanism: Given the top-$k$ predicted class set $\{y_{i_1}, \ldots, y_{i_k}\}$, GPT-4 generates pairwise discriminative descriptions for each class pair (e.g., "a cat has a long thin tail, whereas a lion has a stockier tail with a tuft"), forming a discriminative matrix $\mathbf{D} = [\mathbf{d}_{ij}]$. Zero-shot CLIP is then used to match the query image against these pairwise descriptions: $f_{\text{pt}, y_i}(\mathbf{x}) = \frac{1}{k-1} \sum_{j=1}^{j \neq i} f_{y_i}(\mathbf{x}, \mathbf{d}_{ij})$ The final prediction is the sum of injection-based matching and post-tuning matching: $f(\mathbf{x}) = f_{\text{inj}}(\mathbf{x}) + f_{\text{pt}}(\mathbf{x})$
- Design Motivation: Post-tuning employs zero-shot CLIP (which does not change with incremental updates), providing local fine-grained refinement of predictions, particularly for distinguishing visually similar classes (e.g., cat vs. lion).

Loss & Training¶

The overall training objective combines three loss terms: $$\min_{\{u_t^b, u_i^b\}} \sum_{(\mathbf{x}, y) \in \mathcal{D}^b \cup \mathbf{P}} \mathcal{L}_t + \mathcal{L}_i + \mathcal{L}_c$$

where $\mathcal{L}_c$ is the contrastive loss based on injected features (Eq. 9), and $\mathbf{P}$ denotes the prototype set. An SGD optimizer is used with batch size 64, trained for 10 epochs, with a learning rate of 0.05 under cosine annealing. Key hyperparameters: $\alpha = 0.25$ (prototype noise), $k = 5$ (post-tuning candidate count).

Key Experimental Results¶

Main Results¶

Final accuracy $\mathcal{A}_B$ on 9 benchmark datasets (CLIP ViT-B/16, no exemplar storage):

Method	Aircraft B0I10	CIFAR100 B0I10	Cars B0I10	ImageNet-R B0I20	CUB B0I20	UCF B0I10
SimpleCIL	48.09	76.63	86.85	74.48	77.52	85.68
L2P	28.29	73.03	61.82	66.52	57.93	76.43
CODA-Prompt	27.69	73.43	66.47	68.95	62.98	80.14
RAPF	23.61	78.04	62.85	70.48	62.77	80.33
Engine	58.69	79.22	90.08	80.37	80.20	90.03

Engine outperforms all baselines across all datasets by a significant margin, with a particularly notable improvement of 35% over RAPF on the fine-grained Aircraft dataset.

Ablation Study¶

Contribution of each component (CIFAR100 B0 Inc10 final accuracy):

Configuration	$\mathcal{A}_B$	Note
ZS-CLIP (baseline)	~71.38	Zero-shot CLIP
+ Visual Injection	~74	Visual augmentation takes effect immediately
+ Dual Injection (visual + text)	~77	Dual-modality injection yields further gains
+ Post-tuning Injection (full Engine)	79.22	Post-tuning provides final refinement

Fair comparison with external knowledge (all methods given identical GPT-4 descriptions and data augmentation):

Method	Template	ImageNet-R $\mathcal{A}_B$	CIFAR100 $\mathcal{A}_B$
ZS-CLIP	GPT descriptions	77.59	71.70
RAPF	GPT descriptions	71.04	78.52
Engine	GPT descriptions	80.37	79.22

Even when all methods are provided with identical external knowledge, Engine achieves the best performance, demonstrating its superior efficiency in knowledge utilization.

Key Findings¶

Outperforms exemplar-based methods without storing exemplars: Engine (0 exemplars) achieves 58.69% on Aircraft, surpassing PROOF (20 exemplars/class) at 53.59%.
Every component contributes: Visual injection, text injection, and post-tuning each provide incremental gains, with ablations confirming the necessity of each component.
Robustness to hyperparameters: Performance remains stable for $\alpha \in [0.15, 0.35]$ and $k \in [3, 7]$.
t-SNE visualization confirms alignment: Engine achieves visual-text feature alignment (intra-class clustering) and preserves clustering structure of previous tasks when learning new ones.
Post-tuning corrects erroneous predictions: Visualizations show that post-tuning can correct mispredictions (e.g., acorn→birdhouse) by attending to locally discriminative features.

Highlights & Insights¶

Elegant use of external knowledge: GPT-4 provides general and comprehensive descriptions for each class, which are less prone to overfitting to the training data distribution than directly learned prompts.
Elegant injection unit design: A single linear layer suffices to encode external knowledge into the frozen encoder, and weight summation enables re-parameterization with no additional inference overhead.
Two-stage knowledge injection: Global class-level knowledge is injected during training, while local pairwise discriminative knowledge is injected at inference — the two stages are complementary.
Exemplar-free design: Consistent prototypes produced by the frozen encoder serve as surrogates for old classes, elegantly circumventing the need for raw data storage.

Limitations & Future Work¶

The framework depends on GPT-4 as an external knowledge source and may fail for specialized domains not well covered by GPT-4's knowledge base (e.g., highly specialized medical imaging categories).
Post-tuning requires GPT-4 to generate pairwise discriminative descriptions for all top-$k$ class pairs; the combinatorial cost may become prohibitive when the number of classes is very large (e.g., 10,000+).
The number of injection units grows linearly with the number of tasks (though re-parameterizable), and the number of prototypes also grows linearly.
The data augmentation strategy (AutoAugment) is domain-agnostic; domain-specific augmentation strategies may yield further improvements.

The external knowledge injection paradigm is generalizable to other incremental learning scenarios (e.g., class-incremental object detection).
GPT-4-generated pairwise discriminative descriptions can be applied to disambiguate difficult class pairs in zero-shot classification.
The injection unit design philosophy (frozen encoder + lightweight linear mapping) is applicable to any pre-trained model requiring adaptation to downstream tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-stage external knowledge injection concept is novel, particularly the pairwise discriminative re-ranking in post-tuning.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 9 datasets and multiple settings, with thorough ablation and comparative studies.
Writing Quality: ⭐⭐⭐⭐ Motivation is analyzed in depth, method presentation is clear, and visualizations are informative.
Value: ⭐⭐⭐⭐⭐ Achieves substantial improvements in CLIP-based CIL, and the external knowledge injection paradigm has broad applicability.