Confidence Self-Calibration for Multi-Label Class-Incremental Learning¶

Conference: ECCV2024
arXiv: 2403.12559
Authors: Kaile Du, Yifan Zhou, Fan Lyu, Yuyang Li, Chen Lu, Guangcan Liu (Southeast University, Institute of Automation, Chinese Academy of Sciences)
Area: Graph Learning
Keywords: multi-label class-incremental learning, confidence calibration, graph convolutional network, max-entropy regularization, partial label

TL;DR¶

To address the overconfident predictions and false-positive errors caused by partial labels in Multi-Label Class-Incremental Learning (MLCIL), a Confidence Self-Calibration (CSC) framework is proposed. It calibrates label relationships using a Class-Incremental Graph Convolutional Network (CI-GCN) and calibrates confidence via max-entropy regularization, significantly outperforming SOTA methods on MS-COCO and VOC.

Background & Motivation¶

The core challenge of Multi-Label Class-Incremental Learning (MLCIL) is the task-level partial label problem: in each incremental task, only new classes of the current task are annotated, while past and future class labels are missing. This inherently results in disjoint new and old label spaces.

Existing methods (such as AGCN, KRT) overlook a key phenomenon: the model outputs overconfident prediction distributions under the partial-label setting, producing many false-positive errors. For example, even if a test image only contains "person", the model might output high confidence for the old class "dog". As the label space continuously expands, this overconfidence issue exacerbates catastrophic forgetting.

The authors' motivation is straightforward: since the issue lies in confidence calibration, it should be addressed simultaneously at both levels of label relationship calibration and confidence calibration.

Core Problem¶

Disrupted Label Relationships: Under the partial label setup, complete label co-occurrence statistics are unattainable, making cross-task label relationships difficult to construct.
Overconfident Confidence: In the absence of labels, the model easily confuses new and old class features. The output distribution presents a multi-modal overconfident state, where precision is far lower than recall, leading to a persistently high false-positive rate.
Exacerbated Catastrophic Forgetting: The superposition of the above two factors leads to a severe degradation of performance on old classes.

Method¶

Overall Architecture CSC¶

CSC comprises two major components:

(1) Class-Incremental Graph Convolutional Network (CI-GCN) — Label Relationship Calibration

CI-GCN is a two-layer stacked GCN structure that does not rely on prior statistical information:

General GCN: Uses a learnable general correlation matrix (CM) \(A_g\) to automatically learn cross-task label relationships through gradient updates. \(A_g\) is divided into the old-task part \(A_g^{1:t-1}\) (inherited from previous tasks to preserve old label relationships) and the new-task part \(A_g^t\) (establishing relationships for the new label space). The key innovation is that CM updates via gradient descent utilizing both ground-truth and pseudo-labels jointly, preventing the error accumulation of fixed statistical matrices.
Specific GCN: Adaptively generates an image-specific CM \(A_s\) unique to each image from the output \(V_1\) of the General GCN. Specifically, it applies global pooling and convolution on \(V_1\) to obtain global feature \(v\), and computes \(A_s = \sigma(V_1' W)\) through a convolutional layer after concatenation. This provides sample-level fine-grained label relationships.

The graph nodes \(V_0\) are decoupled from the feature map \(F\) extracted by the CNN backbone and the class activation map \(M\): \(V_0 = M^\top \otimes F\). The two-layer GCN computes:

\[V_1 = \text{LReLU}(A_g V_0 W_g), \quad V_2 = \text{LReLU}(A_s V_1 W_s)\]

(2) Max-Entropy Regularization — Confidence Calibration

Observing that the model's output distribution is overconfident (low entropy), the authors quantify the uncertainty of old-class predictions using Shannon entropy:

\[H = -\mathbb{E}_{c \in \mathcal{C}^{1:t-1}} [\hat{y}_c^t \log(\hat{y}_c^t)]\]

During training, taking the negative sign realizes max-entropy regularization, punishing overconfident output distributions:

\[L = L_{\text{cls}} - \beta H\]

where \(L_{\text{cls}}\) integrates cross-entropy (for new classes) and knowledge distillation (for old classes), and \(\beta\) controls the regularization intensity. The final prediction fuses the classifier output and the graph representation: \(\hat{y}^t = \hat{y}_{\text{cls}}^t + \hat{y}_{\text{gcn}}^t\).

Key Designs¶

Learnable General CM rather than statistically fixed, avoiding accumulated errors from pseudo-labels.
Specific CM is adaptively generated for each image, introducing more flexibility in handling rare label combinations.
Both types of CM scale automatically as the number of classes increases, requiring no manual adjustments.
Max-entropy regularization acts only on old classes, target-reducing false positives.

Key Experimental Results¶

MS-COCO 2014¶

Setting	Method	Buffer	Last mAP	CF1	OF1
B0-C10	KRT (SOTA)	0	65.9	55.6	56.5
B0-C10	CSC	0	72.8	64.9	66.8
B0-C10	KRT-R	5/class	68.3	60.0	61.0
B0-C10	CSC-R	5/class	73.7	67.3	68.1
B0-C10	CSC-R	20/class	74.8	67.8	68.6

CSC without a buffer (72.8%) even surpasses the performance of KRT-R using a 20/class buffer (70.2%).

PASCAL VOC 2007¶

Setting	Method	Buffer	Last mAP	Avg. mAP
B0-C4	KRT-R	2/class	83.4	90.7
B0-C4	CSC-R	2/class	87.9	92.4
B4-C2	AGCN-R	2/class	59.3	74.3
B4-C2	CSC-R	2/class	86.6	90.4

CSC-R outperforms AGCN-R by 27.3% in the most challenging B4-C2 scenario, demonstrating exceptional robustness.

Ablation Study¶

Component	mAP (B0-C10)	CF1	OF1
Baseline (KD only)	42.4	45.3	43.7
+ Max-Entropy	47.6	50.3	49.5
+ CI-GCN	69.3	59.0	59.5
+ Combination of both (CSC)	72.8	64.9	66.8

CI-GCN contributes the most (mAP +26.9%), and Max-Entropy further improves performance by 3.5% on top of it. Max-Entropy reduces the false-positive rate from 35% to 19%.

CM Structure Ablation¶

The G → S (Softmax) combination is optimal (mAP 72.8%), outperforming the fixed statistical CM Z → Z (64.1%), validating the advantage of the learnable CM.

Highlights & Insights¶

Insightful Problem Analysis: It is the first to explicitly point out the connection between overconfident output distributions and false-positive errors in MLCIL, and proposes a solution from the perspective of calibration.
Exquisite CI-GCN Design: The dual-layer General + Specific structure calibrates label relationships from macro to fine-grain, and the CM can be learned and expanded to avoid building up statistical errors.
Simple and Effective Max-Entropy Regularization: Merely one additional regularization term significantly reduces the false-positive rate (35% → 19%), demonstrating orthogonal improvements across different methods.
Overwhelming Experimental Results: CSC without a buffer even outperforms SOTA methods with a buffer of 20/class.
Strong Robustness: Extremely small performance fluctuations across different scenarios (incremental step sizes).

Limitations & Future Work¶

Only validated on MS-COCO (80 classes) and VOC (20 classes); larger-scale datasets (such as Open Images) have not been tested.
The backbone network is based on CNN (TResNetM); the adaptation of CI-GCN under Vision Transformer architectures has not been explored.
Max-entropy regularization only operates on old classes, and confidence calibration for new classes remains uninvestigated.
Random initialization of the General CM might affect the performance of the first task, suggesting the need for exploring better initialization strategies.
The class incremental order is fixed lexicographically, and the sensitivity to random or hard-first ordering is not analyzed.

Method	Label Relationship Modeling	Confidence Calibration	Mechanism
AGCN	Fixed statistical CM + pseudo-labels	None	Constructing a fixed graph with pseudo-labels
KRT (Prev. SOTA)	Knowledge recovery and transfer tokens	None	old/new knowledge token framework
CSC (Ours)	Learnable dual-layer CM (CI-GCN)	Max-Entropy regularization	Label relationship calibration + confidence calibration

Compared with KRT, CSC offers advantages in: (1) CM is learned through gradients instead of being statically fixed, yielding stronger adaptability; (2) it explicitly addresses the overconfident confidence issue overlooked by KRT. Compared with L2P (ViT-B/16, 86M parameters), CSC achieves better results with fewer parameters (TResNetM, 29.4M).

Universality of Calibration Perspective: The overconfidence problem is prevalent in incremental learning. The Max-Entropy concept can be transferred to tasks such as single-label CIL and incremental semantic segmentation.
Learnable Graph Structures: The design concept in CI-GCN where the CM is learned via gradients can be extended to other scenarios that require dynamic relational graph construction.
Complementarity with Focal Loss: Focal Loss focuses on hard samples, while Max-Entropy regularization focuses on calibration. The two might be complementary.
Comparison with Label Smoothing: Both adjust the "sharpness" of the output distribution, but Max-Entropy directly optimizes information entropy, which is theoretically more direct.

Rating¶

Novelty: ⭐⭐⭐⭐ (The calibration perspective is novel for MLCIL, and the CI-GCN design is reasonable)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Multiple datasets, multiple scenarios, detailed ablation studies, and comprehensive visualizations)
Writing Quality: ⭐⭐⭐⭐ (Clear problem formulation and abundant charts)
Value: ⭐⭐⭐⭐ (An important advancement in the MLCIL direction; the calibration perspective is highly inspiring)