Skip to content

Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration

Conference: CVPR 2025
Code: https://github.com/yychen016/BiMC
Area: Few-Shot Class-Incremental Learning / Vision-Language Models
Keywords: FSCIL, Bi-level Modality Calibration, CLIP, Training-Free, LLM Descriptions, Visual Prototypes

TL;DR

This paper proposes the BiMC (Bi-level Modality Calibration) framework based on a frozen CLIP model. By leveraging intra-modal calibration (combining fine-grained class descriptions generated by LLMs with visual prototypes) and inter-modal calibration (fusing pre-trained language knowledge with task-specific visual priors), BiMC achieves state-of-the-art FSCIL performance without any parameter training, outperforming the best baseline by 4.25% on CIFAR-100.

Background & Motivation

Background: Few-Shot Class-Incremental Learning (FSCIL) requires the model to learn new classes in each session using only extremely few samples (e.g., 5-shot), while preserving the recognition capability for all previously learned old classes. This is a highly challenging setup facing the dual risks of overfitting new classes and forgetting old ones. Existing methods mostly rely on vision models (e.g., ResNet) and train feature extractors during the base session, mitigating forgetting in incremental sessions through freezing or regularization strategies.

Limitations of Prior Work: (1) Regardless of how regularization strategies are designed, training-based methods inevitably disturb existing knowledge representations as long as parameters are updated, meaning catastrophic forgetting is merely "alleviated" rather than "eliminated"; (2) Purely visual models easily overfit the extremely few samples of new classes under the 5-shot setting, and the learned features are not robust; (3) Pre-trained VLMs (such as CLIP) possess powerful zero-shot generalization capabilities, but existing FSCIL methods do not fully utilize the cross-modal alignment capability of VLMs — especially the semantic understanding capability of the text modality for categories.

Key Challenge: FSCIL requires an extreme balance between "stability" (not forgetting old classes) and "plasticity" (learning new classes). Training-based methods struggle to escape the stability-plasticity dilemma, while entirely training-free approaches seemingly fail to adapt effectively to new tasks.

Goal: To achieve FSCIL while completely avoiding parameter updates, leveraging the frozen pre-trained knowledge of CLIP to fundamentally eliminate catastrophic forgetting.

Key Insight: The joint vision-language space of CLIP already implicitly contains rich semantic structures. The core question is not how to learn new features, but how to locate the representation of each category more precisely in the existing joint space — through "calibration" rather than "learning."

Core Idea: By completely freezing CLIP and employing a bi-level calibration mechanism — intra-modal calibration to make the visual and textual category representations more precise, and inter-modal calibration to fuse complementary information from both modalities and eliminate biases — training-free incremental learning is achieved.

Method

Overall Architecture

The workflow of BiMC consists of: (1) Base Session: Use all samples of base classes to construct visual prototypes (mean features), while using an LLM (such as GPT) to generate fine-grained natural language descriptions for each class, which are then encoded as text prototypes; (2) Intra-modal Calibration: Combine LLM descriptions and visual prototypes to accurately estimate classifiers within each modality; (3) Inter-modal Calibration: Fuse textual semantic knowledge and visual task priors to eliminate modality bias; (4) Incremental Session: When new classes arrive, only compute the visual mean feature of few-shot samples + LLM descriptions as the new class prototype, requiring zero training.

Key Designs

  1. Intra-modal Calibration:

    • Function: To improve the accuracy of category prototypes within a single modality.
    • Mechanism: For the text modality, an LLM (such as GPT-3.5/4) is used to generate multiple fine-grained descriptions for each category (e.g., "a small songbird with a red breast"). These descriptions are encoded via the CLIP text encoder and averaged to obtain a rich text prototype \(\mathbf{t}_c = \frac{1}{N_d} \sum_i \text{CLIP}_t(d_i^c)\), which contains more semantic information than the simple class name template "a photo of a [class]". For the visual modality, the mean of CLIP visual features of all samples in the base session is used as the visual prototype \(\mathbf{v}_c\), while session-wise 5-shot sample means serve as approximate prototypes in incremental sessions.
    • Design Motivation: The original class name templates of CLIP are too coarse, and the fine-grained descriptions generated by the LLM help distinguish fine-grained categories (e.g., different bird species). Visual prototypes provide data-driven category centers, and the two modalities complement each other.
  2. Inter-modal Calibration:

    • Function: To fuse textual semantic knowledge and visual task priors, eliminating biases of a single modality.
    • Mechanism: Text prototypes originate from a pre-trained general semantic space (which may not completely align with specific dataset distributions), whereas visual prototypes originate from specific task-related data but suffer from noise due to limited samples. The two are fused through adaptive weighted combination, obtaining the final classification prototype \(\mathbf{p}_c = \lambda \mathbf{t}_c + (1 - \lambda) \mathbf{v}_c\). The weight \(\lambda\) is adaptively determined based on the classification confidence of each modality on a validation set.
    • Design Motivation: The CLIP text encoder is strong in general semantics but may not understand specific task data distributions; visual prototypes are close to the actual data but exhibit high variance in the 5-shot setting. Inter-modal calibration allows both types of information to complement each other's weaknesses.
  3. Prediction Enhancement Strategy:

    • Function: To further improve classification robustness.
    • Mechanism: Additional metric strategies are introduced to maximize the usage of limited data: (a) adding Mahalanobis distance metric on top of cosine similarity to exploit the within-class covariance structure; (b) the BiMC-Ensemble variant leverages ensemble predictions through multiple text templates and augmented visual features to boost the reliability of uncertainty estimation.
    • Design Motivation: Under 5-shot scenarios, a single metric is highly susceptible to noise, and a multi-strategy ensemble can effectively smooth predictions.

Loss & Training

The proposed method is entirely training-free and does not involve any loss function or backpropagation. All operations are forward inference: feature extraction -> prototype calculation -> similarity matching -> calibration and fusion -> prediction. This guarantees zero catastrophic forgetting.

Key Experimental Results

Main Results

Dataset Method Base Acc Last Session Acc Average Acc
CIFAR-100 TEEN (NeurIPS 2023) 83.41 71.20 76.83
CIFAR-100 LP-DiF (CVPR 2024) 85.72 73.45 79.16
CIFAR-100 BiMC (Ours) 89.26 78.15 83.41
CUB-200 TEEN 79.85 67.23 73.45
CUB-200 LP-DiF 82.13 70.58 76.29
CUB-200 BiMC (Ours) 86.54 74.36 79.85
miniImageNet TEEN 81.20 68.95 74.88
miniImageNet BiMC (Ours) 87.15 75.62 80.93

Ablation Study

Configuration CIFAR-100 Last Acc CUB-200 Last Acc Description
Full BiMC 78.15 74.36 Full model
w/o LLM descriptions (class-name template only) 74.28 70.12 Rough text prototypes degrade significantly
w/o Visual prototypes (text only) 73.95 69.84 Lacks task-specific visual information
w/o Inter-modal calibration 75.62 71.89 Modalities are not fused
BiMC-Ensemble 79.40 75.92 Ensemble further improves performance

Key Findings

  • BiMC outclasses the best-performing baseline in the final session accuracy by 4.25% on CIFAR-100 and by 3.56% on CUB-200.
  • LLM descriptions contribute to about +3.9% accuracy gain (compared with the class template only), showing that fine-grained text descriptions dramatically enhance CLIP-based classifiers.
  • Inter-modal calibration yields approx +2.5% improvement, demonstrating the necessity of cross-modal information fusion.
  • Throughout all incremental sessions, the forgetting rate of BiMC remains zero (since there are no parameter updates), whereas the training-based method TEEN suffers from accelerated forgetting in later sessions.
  • The ensemble strategy can afford a further 1.2-1.6% enhancement, though it increases inference costs by about threefold.

Highlights & Insights

  • "Training-Free" Paradigm Advantage: Completely freezing CLIP guarantees zero forgetting. This design philosophy possesses unique structural advantages in FSCIL. As long as the pre-trained feature space of CLIP is sufficiently robust, "calibration" is safer than "learning."
  • Introducing LLM as a Knowledge Source: Injecting domain knowledge into text prototypes via GPT-generated fine-grained class descriptions is cost-effective and highly efficient. This idea can be easily extended to all zero/few-shot scenarios requiring semantic descriptions of categories.
  • Simplicity of Design: The entire method has no tedious training loops, requiring only forward inference and simple weighted fusion, which makes deployment and replication exceptionally easy.

Limitations & Future Work

  • Heavily relies on the pre-trained quality of CLIP; for domains not covered by the CLIP pre-training distribution (e.g., medical images, remote sensing), performance might be majorly degraded.
  • The quality of LLM-generated descriptions affects the accuracy of textual prototypes. Differences in writing styles of various LLMs can cause performance fluctuations.
  • The adaptive weight \(\lambda\) requires tuning on a validation set, which is itself unreliable in extreme few-shot situations.
  • The estimation variance of 5-shot visual prototypes is relatively high; more robust prototype estimation methods can be explored in future work.
  • vs TEEN (NeurIPS 2023): TEEN mitigates forgetting through knowledge distillation but still requires training. BiMC is completely training-free, experiences zero forgetting, and yields higher accuracy.
  • vs LP-DiF (CVPR 2024): LP-DiF generates pseudo-samples using diffusion models to enhance new classes, incurring high training costs. BiMC requires no generative processes.
  • vs CuPL: CuPL also utilizes LLM descriptions to augment CLIP classifiers, but exclusively for zero-shot classification patterns. BiMC generalizes this insight into incremental learning scenarios.
  • vs FeCAM: FeCAM estimates class prototypes based on Gaussian Discriminant Analysis. The bi-modal calibration of BiMC provides more robustness than purely visual prototypes.

Rating

  • Novelty: ⭐⭐⭐⭐ Training-free FSCIL + LLM description + bi-level calibration is highly novel in its combination.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three standard benchmarks, detail-oriented ablation studies, and analysis across all incremental stages.
  • Writing Quality: ⭐⭐⭐⭐ Clear system workflow and comprehensive experiments.
  • Value: ⭐⭐⭐⭐⭐ Highly practical training-free paradigm, already cited 9 times.

title: >- [Paper Note] Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration description: >- [CVPR 2025][FSCIL] Proposes a training-free bi-level modality calibration framework to achieve few-shot class-incremental learning by leveraging cross-modal alignment from pre-trained VLMs like CLIP. tags: - CVPR 2025 - FSCIL - Modality Calibration - CLIP - Training-Free - Vision-Language Models