PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/xwangrs/PACTCVPR2026.git
Area: Multimodal VLM / Continual Learning
Keywords: Continual Learning, PAC-Bayes, adapter, Stability-Plasticity Trade-off, Phase Transition Constraints
TL;DR¶
Addressing the limitation where orthogonal constraints isolate task adapters and suppress cross-task knowledge sharing, the authors derive "Phase-Like Transition Constraints (PACT)" from PAC-Bayes theory for the post-convergence phase. This allows adapters to smoothly transition rather than hard-threshold between "frozen" (preserving history) and "melting" (adapting to new tasks) states, similar to the phase transitions of water. Implemented via a dual-branch ViT, Stable Adapter Initialization (SAI), and Prior Anchoring (PA), the method outperforms SOTA across multiple continual learning settings while using 36.96% fewer trainable parameters than standard adapter baselines.
Background & Motivation¶
Background: To enable Vision-Language Models (VLMs) to learn new tasks without catastrophic forgetting, Parameter-Efficient Fine-Tuning (PEFT) is commonly used—freezing most pre-trained weights while training small modules like adapters, LoRA, or prompts. A standard approach involves optimizing task-specific adapters to convergence, often employing (approximate) orthogonal constraints to ensure different task adapters occupy mutually orthogonal directions in a shared subspace to reduce interference.
Limitations of Prior Work: Research in deep learning and neuroscience indicates that forcing different tasks into mutually orthogonal subspaces suppresses cross-task knowledge transfer and sharing, creating "knowledge islands." In other words, while "orthogonal isolation" reduces interference, it also severs legitimate synergies between related tasks.
Key Challenge: The stability-plasticity dilemma. Plasticity terms encourage adapters to adapt to new tasks, while stability terms prevent deviation from accumulated knowledge; orthogonal constraints act as a "one-size-fits-all" hard isolation, leaving no elastic space for sharing when tasks are related or isolation when they are not.
Goal: To provide a theoretically grounded constraint mechanism that adaptively transitions between sharing and isolation based on task similarity, replacing rigid orthogonal constraints.
Key Insight: The authors frame continual learning within the PAC-Bayes framework by sequentially treating the posterior of the previous task as the prior for the current task (\(P_t:=Q_{t-1}\)). The resulting generalization bound naturally yields two complementary terms: a plasticity term rewarding task adaptation and a stability term penalizing deviation from accumulated knowledge. A crucial observation is that PAC-Bayes theory assumes bounded loss, a condition that only approximately holds after training has largely converged. Therefore, these constraints should manifest as a "post-convergence phenomenon."
Core Idea: Following standard convergence for each task, adapter updates are modulated by a "phase-like transition relationship." If the current adapter is distant from historical adapters, it remains free (melting); if it is close, constraints are smoothly increased (frozen). This enables a soft switch between states like the phase transition of water, rather than a hard threshold.
Method¶
Overall Architecture¶
PACT processes a task sequence \(T_1,\dots,T_T\) using a frozen CLIP ViT-B/16 backbone. Only newly inserted adapters are trained for each task (sparse insertion: one every 3 layers, i.e., layers 3/6/9/12). Specifically, PAC-Bayes is used to decompose the KL complexity term of the current task into a plasticity term \(\mathcal{P}\) (KL between current adapter conditional posterior and marginal posterior) and a stability term \(\mathcal{S}\) (drift of the current adapter posterior relative to its initialization). Plasticity is implemented via a dual-branch ViT: a conditional branch aggregates frozen history adapters with the current adapter, while a marginal branch use only the current adapter; both branches minimize their respective cross-entropy to align the posteriors. Stability is achieved through two components: SAI (Stable Adapter Initialization) initiates new adapters to be functionally equivalent to pre-trained MLPs, and PA (Prior Anchoring) uses KL to anchor the posterior to the MLP prior, limiting drift. The "phase transition" is driven by a phase-like transition weight: Gaussian probes measure the distance between current and recent historical adapters; close distance \(\to\) frozen weight \(\alpha_t\to1\), far distance \(\to\) \(\alpha_t\to0\). This is further modulated by training stability into \(\tilde\alpha_t\), gating the marginal and PA losses only during the post-convergence stage. Total loss: \(\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{cond}}+\tilde\alpha_t(\mathcal{L}_{\text{marg}}+\mathcal{L}_{\text{pa}})\).
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["New Task Tt: Frozen CLIP ViT + Hist. Adapters<br/>Sparse Insert New Adapters (every 3 layers)"] --> B["Stable Adapter Initialization SAI<br/>Clone parallel MLP, Maintain Preds"]
B --> C["Dual-Branch ViT for Plasticity<br/>Cond. Branch (Agg. Hist.) ↔ Marg. Branch (Current Only)"]
C --> D["Phase-Like Transition Weight<br/>Gaussian Probe Distance → Frozen Weight αt"]
D --> E["Prior Anchoring PA<br/>KL Anchored to MLP Prior"]
E --> F["Total Loss<br/>Lcond + α̃t(Lmarg + Lpa)"]
Key Designs¶
1. PAC-Bayes \(\to\) Cond.-Marg. Decomposition for Continual Learning: Formulating Stability-Plasticity as KL Terms
Classical PAC-Bayes handles single tasks and does not model the sequential evolution of posteriors across tasks. The authors extend this to continual learning: for task \(t\), the prior is set as \(P_t:=Q_{t-1}\) (posterior of the previous task), which depends only on historical data \(D_{<t}\) and is independent of current data \(D_t\), systematizing "transfer from old to new tasks." Adapter parameters are split into frozen historical blocks \(\theta_a^{[t-1]}\) and the trainable current block \(\theta_a^t\). The theorem provides an upper bound decomposition of the KL term \(I_t\) for each task: \(I_t\le\mathbb{E}_{\theta_a^{[t-1]}\sim Q_t^a}[\underbrace{\mathcal{P}}_{\text{Plasticity}}+\underbrace{\kappa\mathcal{S}}_{\text{Stability}}]\), where the plasticity term \(\mathcal{P}=\mathrm{KL}(Q_t^a(\theta_a^t\mid\theta_a^{[t-1]})\Vert Q_t^a(\theta_a^t))\) measures the statistical dependence of the current adapter inference on historical adapters (lower is more plastic), and the stability term \(\mathcal{S}=\mathrm{KL}(Q_t^a(\theta_a^t)\Vert Q_{t-1}^a(\theta_a^t))\) measures the posterior drift relative to initialization (lower is more stable), with \(\kappa<\infty\) as an overlapping constant. This transforms the "stability-plasticity trade-off" from empirical intuition into two explicit optimizable objectives, serving as the theoretical foundation for the algorithm.
2. Dual-Branch ViT for Plasticity: Using Conditional/Marginal Contrast as a Trainable Surrogate for \(\mathcal{P}\)
The plasticity term aims to minimize the KL between the conditional posterior \(Q_t^a(\theta_a^t\mid\theta_a^{[t-1]})\) and the marginal posterior \(Q_t^a(\theta_a^t)\)—directly calculating distribution divergence is infeasible. The authors construct two parallel branches sharing the frozen backbone \(\theta_f\) and current adapter \(\theta_a^t\), differing only in whether they aggregate historical adapter outputs: the conditional branch aggregates historical outputs via mean pooling and adds the current adapter, \(h_{\text{cond}}=\frac{1}{t}\sum_{i=1}^t h_a^{(i)}+0.1\epsilon\); the marginal branch uses only the current adapter, \(h_{\text{marg}}=h_a^{(t)}+0.1\epsilon\) (where \(\epsilon\sim\mathcal{N}(0,I)\) provides PAC-Bayes stochasticity). Both feed into independent heads for \(\mathcal{L}_{\text{cond}}\) and \(\mathcal{L}_{\text{marg}}\). Since both branches use the same data and share \(\theta_f\) and \(\theta_a^t\), minimizing \(\mathcal{L}_{\text{cond}}+\mathcal{L}_{\text{marg}}\) serves as a tractable surrogate to align the conditional and marginal posteriors, thereby reducing \(\mathcal{P}\), encouraging the current adapter to focus on new knowledge, and weakening dependence on specific historical blocks.
3. SAI + PA for Stability: Zero-Drift Starting Point and Drift Limitation
The stability term \(\mathcal{S}\) measures the current adapter's posterior drift relative to its initialization; excessive drift distorts representations shared by historical tasks. The authors employ a two-step approach. SAI (Stable Adapter Initialization): The new adapter is placed in parallel with frozen MLP blocks, sharing the same structure and initially cloning MLP parameters. A gate \(g_t=\sigma(\gamma_t)\) is used for convex combination \(h_{\text{out}}=(1-g_t)h_{\text{base}}+g_t h_{\text{adapter}}\), inserted sparsely (every 3 layers). Consequently, the model's prediction distribution remains nearly unchanged at the moment of insertion, minimizing \(\mathcal{S}=\mathrm{KL}(Q_t^a\Vert Q_{t-1}^a)\) (approaching zero with exact matching), providing a well-behaved starting point for optimization. PA (Prior Anchoring): Since SAI only ensures a good starting point but does not constrain subsequent drift, a stability loss \(\mathcal{L}_{\text{pa}}=\mathrm{KL}(Q_t^a(\theta_a^t)\Vert Q_{\text{mlp}}^a(\theta_a^t))\) is added to anchor the current adapter posterior to the fixed MLP prior (\(Q_{\text{mlp}}^a\equiv Q_{t-1}^a\)), directly penalizing drift. Given the strong zero-shot generalization of pre-trained MLPs, keeping the adapter close ensures stability and may even enhance zero-shot performance on unseen tasks.
4. Phase-Like Transition Weight: Smooth Frozen/Melting Gating via Adapter Distance
This is the core of the "phase transition." The authors construct a fixed Gaussian probe library \(Z=\{z_m\}_{m=1}^M\) (\(z_m\sim\mathcal{N}(0,I_d)\), shared across layers and steps). Probes are fed to the current adapter and each historical adapter to obtain response matrices \(H^{(t)},H^{(j)}\). The difference is quantified using a scale-normalized Frobenius distance \(d_j=\frac{\lVert H^{(t)}-H^{(j)}\rVert_F^2}{\varepsilon+\lVert H^{(t)}\rVert_F^2+\lVert H^{(j)}\rVert_F^2}\). Taking the nearest historical adapter \(j^\star=\arg\min_j d_j\), an exponential kernel converts this into a frozen weight \(\alpha_t=\exp(-d_{j^\star}/\tau)\): when the current adapter behaves similarly to a historical one, \(\alpha_t\approx1\) (frozen); when it deviates significantly, \(\alpha_t\approx0\) (melted). This \(\alpha_t\) is further modulated by a training stability metric into \(\tilde\alpha_t\in[0,1]\), ensuring regularization only takes effect during the relatively stable post-convergence stage. This is because the theory behind \(\mathcal{L}_{\text{marg}}\) and \(\mathcal{L}_{\text{pa}}\) relies on the "bounded loss" assumption, which is only approximately met after the main conditional objective has converged. Thus, in the total loss \(\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{cond}}+\tilde\alpha_t(\mathcal{L}_{\text{marg}}+\mathcal{L}_{\text{pa}})\), \(\tilde\alpha_t\approx0\) during the unstable phase focuses on adaptation, while the marginal and PA losses are activated after stabilization—providing a "dual-mode, phase-like, smooth transition."
Loss & Training¶
The total objective is \(\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{cond}}+\tilde\alpha_t(\mathcal{L}_{\text{marg}}+\mathcal{L}_{\text{pa}})\). The conditional cross-entropy dominates throughout, while the marginal cross-entropy and PA (KL anchoring loss) are smoothly activated in the post-convergence phase by the stability-aware weight \(\tilde\alpha_t\). Implementation uses CLIP ViT-B/16, with adapters replicating the MLP block structure and inserted only at layers 3/6/9/12. Optimized via AdamW, with MTIL trained for 3k steps and CIL for 1k steps. Sparse insertion reduces trainable parameters by 36.96% compared to standard adapter baselines.
Key Experimental Results¶
Main Results¶
Two settings: Multi-domain Task Incremental Learning (MTIL, 11-dataset benchmark) and Class Incremental Learning (CIL, CIFAR-100 / TinyImageNet-100). MTIL metrics: Transfer (TF, avg. accuracy on unseen tasks after learning task \(i\), i.e., zero-shot generalization), Last (LS, retention on seen tasks), and Avg (mean of TF and LS). The table below shows overall mean for MTIL Order-I.
| Method | Transfer | Average | Last |
|---|---|---|---|
| Zero-shot CLIP | 69.4 | 65.3 | 65.3 |
| MoE-Adapters | 67.4 | 77.2 | 87.4 |
| ConDU | 70.3 | 78.3 | 86.2 |
| AFA | 70.3 | 78.5 | 87.2 |
| TRGE | 69.8 | 78.5 | 87.6 |
| PACT (Ours) | 71.76 (+1.46) | 81.05 (+2.55) | 90.54 (+2.94) |
PACT leads comprehensively across Transfer, Average, and Last by +1.46, +2.55, and +2.94 respectively over the second-best method. Notably, zero-shot generalization (Transfer) and task retention (Last) improve simultaneously, indicating a superior stability-plasticity balance: effectively adapting to new data while minimizing forgetting.
Ablation Study¶
On Few-Shot MTIL (FS-MTIL), PACT is optimal under both 5-shot and 16-shot protocols, demonstrating the robustness of its PAC-Bayes formulation (where \(\Delta\) is the change relative to zero-shot).
| Configuration | Key Metrics | Description |
|---|---|---|
| Zero-Shot | TF 69.4 / Avg 65.3 / Last 65.3 | CLIP Baseline |
| ConDU (5-shot) | TF 70.3 / Avg 72.7 / Last 77.4 | Runner-up |
| AFA (5-shot) | TF 70.2 / Avg 74.1 / Last 79.4 | Runner-up |
| PACT (5-shot) | TF 70.5 / Avg 72.0 / Last 80.7 (+15.4) | Ours |
| IAP (16-shot) | TF 70.9 / Avg 72.5 / Last 77.7 | Comparison |
| PACT (16-shot) | Ranked 1st in TF/Avg/Last | Ours |
| Sparse Insertion | Trainable params −36.96% | vs. standard adapter baseline |
Key Findings¶
- Simultaneous Gain in Stability and Plasticity: PACT surpasses baselines in both zero-shot generalization (Transfer) and retention (Last), whereas many methods trade one for the other. This validates that the "phase-like transition + complementary cond./marg. regularization" enables sharing when tasks are related and interference suppression when not.
- Higher Performance with Fewer Parameters: Sparse insertion (one adapter every 3 layers) reduces trainable parameters by 36.96% compared to standard baselines while achieving higher accuracy, suggesting more efficient utilization of adapter capacity.
- Critical Importance of Post-Convergence Activation: \(\tilde\alpha_t\) delays regularization until training stabilizes, echoing the "bounded loss" premise of PAC-Bayes. Early training focuses on adaptation via conditional loss, while subsequent constraints prevent drift.
- Training Dynamics Visualization: Introducing PACT after CE convergence causes loss to temporarily rise before re-converging to a new equilibrium. Geometrically, adapters move smoothly between "Free" and "PACT" states (denoted as ↭ in the paper), visualizing the "phase-like" soft transition.
Highlights & Insights¶
- "Adapters phase-shifting like water" is a resonant analogy with theoretical support: It transforms abstract stability-plasticity trade-offs into a soft transition between frozen/melting states, supported by the dual-modal optimization landscape of PAC-Bayes.
- The insight that "constraints should be applied post-convergence" is highly reusable: While most regularization methods are active throughout training, PACT identifies that PAC-Bayes generalization bounds depend on bounded loss assumptions met post-convergence. This "stage-aware regularization scheduling" can be migrated to other bound-based methods.
- Gaussian probe response distance is a lightweight, parameter-free similarity metric: Using a fixed probe library shared across layers to compare mappings provides a versatile trick for any scenario needing task or module similarity estimation.
- SAI ensures adapters start as equivalents to pre-trained MLPs: By reducing the initial stability term to near zero, it provides a well-posed starting point while preserving CLIP’s zero-shot capabilities—an elegant paradigm for "painless module insertion."
Limitations & Future Work¶
- The method relies on "post-convergence" detection and \(\tilde\alpha_t\) modulation. ⚠️ Detailed calculations for these are in the appendix and not fully expanded in the main text; sensitivity to stability thresholds/scheduling is hard to assess.
- The use of the nearest historical adapter \(j^\star\) for distance calculation might overlook complex relationships with multiple historical tasks as the task sequence grows.
- Experiments focus specifically on CLIP ViT-B/16 + adapters. Whether the PAC-Bayes derivation fits other PEFT forms like LoRA or prompts, or whether it relies on the adapter's replication of the MLP structure, remains to be verified.
- While PACT theoretically outperforms orthogonal constraints on highly related tasks, the paper lacks a fine-grained analysis of "Task Similarity vs. Gain."
Related Work & Insights¶
- vs. Orthogonal PEFT (e.g., Orthogonal-LoRA series): These methods use (approximate) orthogonality to hard-isolate task adapters into non-interfering subspaces. This paper argues this suppresses sharing and creates knowledge islands. PACT uses phase-like soft gating to allow sharing when related and constraint when not.
- vs. Classical Regularization (EWC/Parameter Importance): These penalize changes in important parameters to prevent forgetting. PACT derives stability-plasticity from sequential PAC-Bayes priors, offering a more systematic theoretical framework and limiting constraints to the post-convergence phase.
- vs. Other PEFT-CL (MoE-Adapters / DIKI / IAP / AFA / TRGE, etc.): Many of these focus on converging and then fixing individual adapters. PACT’s key differentiator is "reshaping after convergence"—using phase-like constraints to induce structural coupling across tasks, leading to consistent gains in Transfer/Avg/Last with fewer parameters.