PACE: Pretrained Audio Continual Learning¶

Conference: ICLR 2026 arXiv: 2602.03355 Code: Available (to be released with the paper) Area: Audio & Speech Keywords: Audio continual learning, pretrained models, parameter-efficient fine-tuning, analytic classifier, catastrophic forgetting

TL;DR¶

This paper presents the first systematic benchmark for audio continual learning (CL), identifies a fundamental upstream–downstream mismatch in pretrained audio models caused by the dominance of low-level spectral features, and proposes PACE—comprising improved first-session adaptation, adaptive subspace-orthogonal PEFT, and boundary-aware perturbation regularization—achieving substantial improvements over SOTA across 6 audio CL benchmarks.

Background & Motivation¶

Pretrained audio models excel on static tasks but suffer catastrophic forgetting under continuously evolving data distributions. Directly transferring visual-domain CL methods to the audio domain faces fundamental obstacles:

Severe upstream–downstream mismatch: Audio backbones (e.g., EAT) are pretrained via spectrogram reconstruction, emphasizing low-level time-frequency patterns rather than structured semantics, whereas downstream CL requires high-level discriminative representations.

More severe representation drift: The representation shift between consecutive sessions in the audio domain far exceeds that in the visual domain (quantified via t-SNE and CKA analysis), leading to more pronounced forgetting.

Failure of PEFT-based methods: Methods such as L2P and DualPrompt degrade approximately 3× more on audio than on vision.

Three key findings motivate the method design:

Finding	Content	Implication
Finding 1	Statistical methods (FSA + analytic classifier) outperform PEFT methods	Establishes the technical direction
Finding 2	Coarse-grained representation saturation: the first session already captures most information	FSA requires improvement
Finding 3	Larger fine-grained gap: the first session alone cannot bridge the semantic gap	Multi-session adaptation is needed

Method¶

Overall Architecture¶

PACE is a staged framework consisting of three stages:

Stage 1 (FSA): Freeze the output head; adapt deep backbone layers with LoRA; replace the head with an analytic classifier.
Stage 2 (MSA): For sessions $t \in (1, T_3]$, introduce subspace-orthogonal PEFT to progressively align representations.
Stage 3 (Frozen): For $t > T_3$, freeze the backbone and update only the analytic classifier.

Key Designs¶

1. Improved First-Session Adaptation (Improved FSA)¶

Constrained head learning: - Conventional FSA jointly trains the head and backbone, leading to head overfitting and insufficient backbone adaptation. - PACE employs asymmetric optimization: $\eta_{head} \ll \eta_{bb}$. - A staged procedure is used: first freeze the backbone and train the head for $E_{head}$ epochs, then freeze the head and fine-tune the backbone for $E_0$ epochs. - This is opposite to the LAE/SLCA strategy in visual CL—audio backbones require encouraged adaptation rather than suppression.

Later-layer LoRA: - CKA analysis shows that shallow layers encode domain-general time-frequency patterns while deep layers encode task-specific semantics. - The first $L_{tune}-1$ layers are frozen; LoRA is applied only to layers $l \geq L_{tune}$: $$W_1^l = W_0^l + A_1^l B_1^l, \quad L_{tune} \leq l \leq L$$ - The boundary layer $L_{tune}$ is determined automatically via a CKA deviation threshold $\rho_{layer}$.

Analytic classifier (replacing the trainable head): - A random projection $W_{proj}$ enhances feature discriminability. - The autocorrelation matrix is updated recursively via the Woodbury identity: $$R_t = R_{t-1} - R_{t-1}\hat{Z}_t^\top(I + \hat{Z}_t R_{t-1} \hat{Z}_t^\top)^{-1}\hat{Z}_t R_{t-1}$$ - Classifier weights are updated in closed form—no sample storage and no destructive updates.

2. Adaptive Multi-Session Subspace-Orthogonal PEFT¶

Multi-session adaptation (MSA): Each session introduces an independent LoRA module; parameters from previous sessions are frozen: $$W_t = W_0 + \sum_{\tau=0}^{t-1} B_\tau A_\tau + B_t A_t$$

Gradient projection constraint—ensures updates do not disrupt old-task representations: $$g_{update} = P_{\mathcal{U}_t} \nabla_\theta \mathcal{L}_{ce}(g_t(f_t(\mathcal{X}_t)), \mathcal{Y}_t)$$

Efficient null-space computation (via LoRA subtraction): - Construct an "unlearned model": $W_t^{unlearn} = W_0 - \sum_{\tau=0}^{t-1} A_\tau B_\tau$ - Compute the uncentered covariance matrix $X_t^{ucov}$ of current-session features. - SVD decomposition identifies the projection subspace, retaining principal components with energy ratio $> \rho_{svd}$. - No historical features need to be stored, significantly reducing memory overhead.

Adaptive freezing: When $\sum_{i=0}^{T_3} N_t > N_{stop}$, the framework transitions to Stage 3 and freezes the backbone.

3. Boundary-Aware Regularization¶

Addresses decision boundary confusion caused by entanglement between old and new class representations.

Boundary sample detection: - For each input, $N_p$ time-frequency masked perturbations are generated: $\tilde{x}_{i,t}^k = \mathcal{Q}(x_{i,t}, r_T, r_F)$ - If the misclassification rate of a temporary model $\theta_{temp}$ on the perturbations exceeds threshold $\rho_p$, the sample is added to the boundary set $\mathcal{B}_t$.

Regularization loss: $$\mathcal{L}_{reg}(i) = \max(0, \delta + \frac{1}{|\mathcal{S}_i|}\sum_{u \in \mathcal{S}_i}\|f_t(u) - \mu(x_c)\|_2^2 - \min_{b \in \mathcal{B}_t}\|f_t(x_{i,t}) - b\|_2^2)$$

Effect: pulls representations toward class centers while pushing them away from boundary points, enlarging inter-class margins.

Loss & Training¶

FSA stage: cross-entropy $\mathcal{L}_{ce}$ + boundary regularization $\mathcal{L}_{reg}$
MSA stage: cross-entropy + regularization + subspace-orthogonal gradient projection
Stage 3: analytic classifier updated only (closed-form solution, no gradient-based training)
Pretrained backbone: EAT (12-layer ViT, self-supervised pretraining on AudioSet-2M, ~5,000 hours)
Data augmentation: SpecAugment-style time-frequency masking

Key Experimental Results¶

Main Results¶

Table 1: Average Top-1 Accuracy (%) on 6 Audio CL Benchmarks

Method	ESC-50	US8K	SC2	TIMIT-2	TIMIT-3	VocalSet
Joint Training (upper bound)	96.50	98.07	95.91	95.22	95.22	76.65
L2P	39.50	38.75	14.70	1.50	2.53	20.39
RanPAC (w/ FSA)	92.25	97.08	90.53	85.63	89.92	62.82
HiDe-Prompt	83.75	79.89	40.10	47.78	49.60	48.36
PACE	95.75	97.49	91.87	90.95	94.05	69.08

Gap to joint training upper bound: only 0.75% on ESC-50, 0.58% on US8K, and 1.17% on TIMIT-3.

Table 2: Ablation—Improved FSA Components (Coarse-Grained)

Strategy	ESC-50	US8K	SC2
w/o FSA	92.50	96.49	81.22
Naive FSA	92.25	97.08	90.53
+ Low LR	93.75	97.35	90.95
+ Later Layer LoRA	95.75	97.49	91.87

Ablation Study¶

PACE maintains its advantage on the SSLAM backbone, confirming backbone-agnostic generalizability.

Contribution of MSA on fine-grained benchmarks: - FSA only → +MSA: +3.2% (TIMIT-2) - +Subspace orthogonality: +1.5% - +Boundary-aware regularization: +0.6%

Key Findings¶

Fundamental difference between audio and visual CL: Audio backbones' emphasis on low-level spectral features leads to representation drift 3× greater than in vision.
Counter-intuitive FSA finding: Audio CL requires encouraging backbone adaptation (opposite to visual CL); freezing shallow layers while tuning deep layers is critical.
Stability of analytic classifiers: Prevents cumulative bias and propagation of representation drift.
Novel use of LoRA subtraction: Approximates the representation subspace of old tasks without storing historical features.

Highlights & Insights¶

First systematic audio CL benchmark: 6 benchmarks covering coarse/fine-grained, speech/music/environmental sound tasks.
"Adaptation over freezing": Stands in sharp contrast to the visual-domain finding that "freezing the backbone suffices," revealing unique properties of pretrained audio models.
Three-stage progressive framework: FSA → MSA → frozen naturally balances plasticity and stability.
LoRA subtraction to construct an "unlearned model": Elegantly and efficiently approximates the historical representation subspace via parameter arithmetic.

Limitations & Future Work¶

Approximation assumption in LoRA subtraction: subtracting LoRA $\neq$ exact unlearning; bias may be non-trivial under high rank or strong adaptation regimes.
Boundary detection relies on the quality of the temporary model $\theta_{temp}$.
The adaptive freezing threshold $N_{stop}$ requires manual specification; the optimal value may vary across scenarios.
Validation is limited to the class-incremental setting; task-aware, domain-incremental, and other CL protocols are not explored.
The gap on VocalSet remains 7.57%, indicating the most severe mismatch on fine-grained music tasks.

The analytic classifier from RanPAC serves as a cornerstone of the technical approach.
LoRA subtraction's parameter arithmetic is innovatively repurposed for null-space projection.
The mismatch between EAT's spectrogram reconstruction pretraining objective and downstream classification is the root cause of the core challenge.
Insight: the degree of alignment between pretraining objectives and downstream tasks determines the difficulty of continual learning.

Rating¶

Novelty: ⭐⭐⭐⭐ — First audio CL benchmark + three-stage framework
Technical Depth: ⭐⭐⭐⭐⭐ — Theoretically complete subspace-orthogonal PEFT and boundary-aware regularization
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 6 benchmarks, multi-backbone validation, comprehensive ablation
Value: ⭐⭐⭐⭐ — Addresses genuine needs in audio CL; deployment scenarios warrant further clarification