C²Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning¶

Conference: NeurIPS 2025 arXiv: 2509.19674 Code: https://github.com/zhoujiahuan1991/NeurIPS2025-C2Prompt Area: LLM/NLP Keywords: federated continual learning, prompt learning, class-aware aggregation, distribution compensation, knowledge conflict

TL;DR¶

To address class-level knowledge inconsistency during prompt communication in federated continual learning, C²Prompt is proposed, which explicitly enhances class-level knowledge coherence across clients via two mechanisms: Local Class Distribution Compensation (LCDC) and Class-aware Prompt Aggregation (CPA). The method achieves an Avg accuracy of 87.20% on ImageNet-R, surpassing the previous SOTA Powder by 2.51%.

Background & Motivation¶

Background: Federated Continual Learning (FCL) requires distributed clients to learn from continuously arriving task data under privacy constraints. Prompt-based methods (e.g., CODAPrompt + FedAvg), which maintain task-specific prompts while freezing the pre-trained backbone, have shown strong performance in FCL.

Limitations of Prior Work: Existing prompt-based FCL methods overlook class-level knowledge consistency during server-side prompt aggregation: (a) different clients hold heterogeneous data distributions for the same class (intra-class distribution gap), leading to semantically inconsistent representations; (b) inter-prompt class-wise relevance is ignored, causing irrelevant or even conflicting knowledge to be fused during aggregation.

Key Challenge: Lack of class-level consistency in prompt communication → knowledge conflicts among new prompts → interference with old prompts → simultaneously exacerbating spatial forgetting (across clients) and temporal forgetting (across tasks).

Goal: (a) How to compensate for intra-class distribution bias introduced by non-IID data at the client side? (b) How to perform precise prompt aggregation on the server side based on class-level relevance?

Key Insight: The problem is approached from the perspective of class-level knowledge coherence — applying distribution compensation at the data input level (LCDC) and class-aware weighting at the parameter aggregation level (CPA).

Core Idea: Estimate the global class distribution to compensate for local distribution bias, and employ a prompt-class affinity matrix for class-aware aggregation — a dual-pronged strategy to resolve knowledge conflicts in FCL.

Method¶

Overall Architecture¶

C²Prompt builds upon the CODAPrompt architecture, learning two types of prompts on a frozen ViT-B/16: - Local Class Distribution Compensation Prompts \(\mathcal{P}^c_{t,k}\): one per class, aligning local features to the global class distribution. - Local Discriminative Prompts \(\mathcal{P}^d_{t,k}\): standard CODAPrompt prompts for learning classification knowledge.

Training proceeds in two stages: Round 0 performs global distribution estimation and LCDC training; Rounds 1–\(N_r\) perform discriminative prompt learning with CPA aggregation.

Key Designs¶

Global Distribution Estimation:
- Function: Aggregates per-class distribution statistics from all clients on the server to estimate the global distribution of each class.
- Mechanism: Assuming the features of class \(i\) on client \(k\) follow a Gaussian distribution \(\mathcal{N}(\mu^t_{i,k}, (\sigma^t_{i,k})^2)\), the global mean and variance are derived via moment estimation of a Gaussian mixture: \(\mu^g_i = \sum_k \mu^t_{i,k} p^t_{k,i}\) and \((\sigma^g_i)^2 = \sum_k ((\mu^t_{i,k})^2 + (\sigma^t_{i,k})^2) p^t_{k,i} - (\mu^g_i)^2\).
- Design Motivation: Only means and variances are transmitted (no raw data), preserving privacy while obtaining a global distributional view. Communication overhead is minimal (sparse distribution parameters).
Local Class Distribution Compensation (LCDC):
- Function: Learns class-specific compensation prompts to align local features to the global distribution.
- Mechanism: For each class \(i\), a compensation prompt \(\mathbf{p}^c_i \in \mathbb{R}^{L_c \times d}\) is learned and appended to the input tokens before passing through the frozen ViT. A distribution alignment loss \(\mathcal{L}_c = -\frac{1}{2}(f_{x,p} - \mu^g_i)^\top (\Sigma^g_i)^{-1} (f_{x,p} - \mu^g_i)\) maximizes the likelihood of output features under the global Gaussian distribution.
- Design Motivation: No data generation or raw data sharing is required; local non-IID bias is mitigated purely through prompt-based feature modulation. After training, the prompts are frozen and serve as "distribution correctors" for subsequent discriminative learning.
Class-aware Prompt Aggregation (CPA):
- Function: Performs weighted aggregation on the server based on prompt-class affinity, rather than simple averaging.
- Mechanism: During training, the cumulative matching score between each prompt and each class is recorded online (client histogram \(H^i_k\)), which is uploaded to the server to form the matrix \(\mathbf{H}^t_g \in \mathbb{R}^{KN \times |\mathcal{C}_t|}\). An inter-prompt correlation matrix is then computed as \(W^t_g = \text{softmax}(\mathbf{H}^t_g (\mathbf{H}^t_g)^\top / \tau)\), and the aggregated prompts are obtained via \(\mathbf{P}^{t*}_g = W^t_g \mathbf{P}^t_g\).
- Design Motivation: Prompts with similar class affinities should receive higher aggregation weights, reducing interference from irrelevant class knowledge. Histogram accumulation incurs virtually zero additional overhead during online learning.
Discriminative Learning + Knowledge Distillation:
- Function: Standard classification learning with cross-round knowledge retention.
- The total loss is \(\mathcal{L}_d = \mathcal{L}_{ce} + \beta \mathcal{L}_{kd}\), where \(\mathcal{L}_{kd}\) is the distillation loss from Powder.
- Compensation prompts are applied with probability 0.5, balancing information from both original and compensated data.

Loss & Training¶

Backbone: ViT-B/16 (pre-trained on ImageNet-21k, frozen)
Discriminative prompts: \(N=8\), \(L_p=10\), \(d=768\); compensation prompts: \(L_c=3\)
Number of clients \(K=5\); communication rounds per task \(N_r=3\)
Optimizer: Adam, lr=0.01

Key Experimental Results¶

Main Results¶

Method	Venue	ImageNet-R Avg↑	ImageNet-R AIA↑	DomainNet Avg↑	DomainNet AIA↑
FedWEIT	ICML2021	71.10	74.30	67.84	69.63
GLFC	CVPR2022	72.96	75.21	69.75	70.34
Fed-CODAP	CVPR2023	79.65	75.14	72.47	72.84
Powder	ICML2024	84.69	84.08	75.98	77.28
C²Prompt	Ours	87.20	85.93	78.88	77.55

Ours surpasses Powder by 2.51% Avg on ImageNet-R and by 2.90% on DomainNet.

Ablation Study¶

Configuration	ImageNet-R Avg	Note
Baseline (Powder)	84.69	baseline
+ LCDC only	86.57 (+1.88)	distribution compensation is effective
+ CPA only	86.02 (+1.33)	class-aware aggregation is effective
+ LCDC + CPA (Full)	87.20 (+2.51)	two components are complementary

Key Findings¶

LCDC and CPA resolve knowledge inconsistency at the input and parameter levels respectively, exhibiting strong complementarity.
C²Prompt is the only method achieving a negative forgetting measure (FM < 0) on the large-scale DomainNet dataset.
Forward transfer (FT) shows the largest improvement: +3.15% on ImageNet-R and +2.59% on DomainNet, indicating that global distribution estimation effectively facilitates new task learning.
Communication overhead increases by only 0.6% over Powder, with no additional parameters or computation at inference time.

Highlights & Insights¶

The privacy–efficiency trade-off in global distribution estimation is elegantly designed: only means and variances are transmitted — no raw data, no gradients — with minimal communication overhead yet effective mitigation of non-IID gaps.
"Free" implementation of class-aware aggregation: client histograms are accumulated online during training at zero additional cost, yet provide precise prompt-class affinity information as aggregation weights.
Prompt attention visualizations (Figure 5) intuitively demonstrate that CPA focuses prompts on discriminative regions, whereas Powder's prompt attention is diffuse.

Limitations & Future Work¶

The Gaussian assumption may be inaccurate for complex multimodal distributions, particularly when intra-class sub-cluster structures exist.
Validation is limited to ViT-B/16 and image classification; extension to larger backbones or NLP tasks has not been explored.
The number of clients is fixed at 5; scalability to larger federations (e.g., 50+ clients) remains unverified.
The compensation prompt usage probability \(p=0.5\) is fixed; an adaptive probability strategy could be explored.

vs. Powder (ICML2024): Powder introduces knowledge distillation for cross-round retention; this work builds upon it by adding class-level distribution compensation and class-aware aggregation. C²Prompt retains Powder's distillation loss.
vs. CODAPrompt: C²Prompt adopts CODA as the base prompt architecture, with the key difference that aggregation is replaced from simple FedAvg to class-aware weighted aggregation.
vs. PILoRA/LoRM: LoRA-based FCL methods perform substantially worse than prompt-based methods (PILoRA achieves only 45.43%), highlighting the inherent advantage of prompt learning in FCL.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of global distribution estimation and class-aware aggregation is clear and effective, though each individual module has precedents.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, 12 baselines, ablation studies, and visualizations are thorough; however, large-scale client experiments are absent.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, though the notation is dense and derivations are deferred to the appendix.
Value: ⭐⭐⭐⭐ FCL is a practically motivated research direction; the proposed method is pragmatic with low overhead.