Dual-Imbalance Continual Learning for Real-World Food Recognition¶

Conference: CVPR 2026 arXiv: 2603.29133 Code: GitHub Area: Continual Learning / Food Recognition Keywords: continual learning, dual imbalance, adapter merging, long-tail distribution, food recognition

TL;DR¶

This paper proposes DIME, a framework that employs class-count-aware spectral adapter merging and rank-wise threshold modulation to address dual imbalance (intra-step class long-tail distribution and inter-step class-count skew) in continual learning, consistently outperforming baselines by over 3% on four long-tail food recognition benchmarks.

Background & Motivation¶

Real-world food recognition systems must continuously incorporate new dish categories. Such settings exhibit dual imbalance:

Class Imbalance: Food data naturally follows a long-tail distribution, where common items (e.g., rice, burgers) have abundant samples while the majority of niche dishes are severely underrepresented.

Step Imbalance: The number of categories introduced at each incremental learning step varies substantially—existing methods assume a roughly equal number of new classes per step, whereas in practice some steps may introduce many new dishes and others only a few.

The compounded effect of these two imbalances remains largely unexplored. The key challenge they jointly produce is asymmetric learning dynamics: head classes and large steps supply stable gradients, while tail classes and small steps generate high-variance, noisy updates that can disrupt previously learned representations.

Method¶

Overall Architecture¶

DIME builds upon a pre-trained ViT backbone with parameter-efficient fine-tuning: 1. A lightweight MLP adapter is trained at each learning step. 2. Balanced Softmax is applied to handle intra-step long-tail distributions. 3. After training, the new adapter is integrated into a cumulative base adapter via a spectral merging strategy. 4. At inference, only a single merged adapter is used, eliminating the need to maintain multiple task-specific modules.

Key Designs¶

Balanced Softmax Training:
- Function: Incorporates class frequency priors into the softmax to balance inter-class contributions.
- Mechanism: The adjusted logit is \(\tilde{z}_y = z_y + \log \pi_y\), where \(\pi_y\) is the empirical frequency of class \(y\).
- Design Motivation: Prevents standard cross-entropy loss from being dominated by head classes, ensuring tail classes receive equitable learning signal.
Class-Count-Aware Spectral Merging:
- Function: Merges old and new adapters within a shared SVD-aligned space.
- Mechanism:
  - Concatenates the base adapter \(M_B\) and new adapter \(M_t\) column-wise and applies SVD: \(X = [M_B \ M_t] = U\Sigma V^\top\)
  - Blends representations in the aligned space using class-count proportional weights: \(w_b = \frac{C_{\text{old}}}{C_{\text{old}}+C_{\text{new}}}\), \(w_t = \frac{C_{\text{new}}}{C_{\text{old}}+C_{\text{new}}}\)
  - \(V_{\text{blend}}^\top = w_b V_B^\top + w_t V_t^\top\)
- Design Motivation: Naive parameter averaging causes destructive interference across steps; SVD alignment ensures updates interact along consistent principal directions; class-count weighting prevents noisy updates from a small new-class step from overwriting accumulated knowledge.
Rank-Wise Threshold Modulation:
- Function: Differentially modulates update magnitude according to the importance of each singular-value direction.
- Mechanism:
  - Directions corresponding to larger singular values encode dominant visual patterns (e.g., prevalent colors and textures) and should remain stable.
  - Directions with smaller singular values encode fine-grained variation and can more flexibly absorb new knowledge.
  - A gating mask is defined: the top \(r_h\) directions use \(\gamma_{\text{head}}\) (small value); remaining directions use \(\gamma_{\text{tail}}\) (large value).
  - \(V_{\text{final}}^\top = V_B^\top + G \odot \Delta V^\top\)
- Design Motivation: Large steps typically produce strong dominant directions, while small steps contribute weak but potentially useful variations; uniform merging fails to accommodate the distinct nature of both types of directions.

Loss & Training¶

The backbone (ViT-B/16 pre-trained on ImageNet-21K) is frozen; only adapter parameters and the classification head are trained.
Adapters use an MLP structure with hidden dimension 64.
SGD optimizer with learning rate 0.07, weight decay 0.0005, batch size 16, trained for 20 epochs.
Step imbalance is controlled via an exponential decay sequence \(s_t = \rho^{(t-1)/(T-1)}\); random permutation is applied to avoid artificial curriculum effects.

Key Experimental Results¶

Main Results¶

Dataset	Metric (\(A_T\))	DIME	TUNA (strongest baseline)	Gain
VFN186-LT	Last Acc	69.07%	66.19%	+2.88%
VFN186-Insulin	Last Acc	69.40%	66.28%	+3.12%
VFN186-T2D	Last Acc	69.88%	67.32%	+2.56%
Food101-LT	Last Acc	77.01%	75.00%	+2.01%

The advantage of DIME is more pronounced under extreme imbalance (\(\rho=0.001\)): on VFN186-LT, DIME achieves 69.33% vs. TUNA's 66.60% (+2.73%); on Food101-LT, 78.13% vs. 74.02% (+4.11%).

Ablation Study¶

Configuration	\(A_T\)	\(wA\)	Description
Base (direct merge + equal weights + CE)	66.73%	74.90%	Baseline
+ SM (spectral merging)	67.20%	74.95%	SVD alignment reduces conflict
+ CCW (class-count weighting)	67.95%	76.68%	Step-imbalance awareness
+ RTM (threshold modulation)	68.68%	77.67%	Selective protection of dominant directions
+ BSM (Balanced Softmax)	69.31%	78.07%	Handles intra-step long-tail

Key Findings¶

The impact of dual imbalance is real and significant: the greater the imbalance (smaller \(\rho\)), the larger the advantage of DIME.
Each component contributes clearly and complementarily: SM, CCW, RTM, and BSM each deliver consistent incremental gains.
Inference efficiency is strong: DIME's inference time (9.50s) and FLOPs (33.73G) are on par with the lightest baseline ACMap, while achieving approximately 4% higher accuracy.
Large steps are well protected without sacrificing small steps: task-size analysis shows DIME performs best or near-best across large, medium, and small tasks.
Robustness to hyperparameters: performance remains stable across reasonable ranges of \(r_h\), \(\gamma_{\text{head}}\), and \(\gamma_{\text{tail}}\).

Highlights & Insights¶

Precise problem formulation: The paper introduces the concept of "dual imbalance" and is the first to systematically study the compounded effects of class imbalance and step imbalance.
Elegant design philosophy: Merging is performed in an SVD-aligned space, with rank-adaptive gating realizing the principle of "stability for important directions, flexibility for secondary ones."
Strong practicality: Only a single merged adapter is maintained at inference, incurring no storage or retrieval overhead.
Introduction of weighted average accuracy \(wA\): This metric more fairly evaluates overall performance under step imbalance, as the conventional \(\bar{A}\) can be inflated by simple steps with few classes.

Limitations & Future Work¶

Validation is limited to food recognition; generalizability to other long-tail continual learning domains (e.g., medical imaging, autonomous driving) remains unverified.
SVD decomposition introduces additional computational overhead at the merging stage, though it is performed only once per step transition.
Combination with rehearsal strategies is unexplored; integrating an exemplar memory may yield further improvements.
The adapter dimension is fixed at 64; the effect of varying adapter capacity across steps of different scales is not investigated.
Only ViT-B/16 is evaluated; performance on larger backbones (e.g., ViT-L) is not reported.

Builds upon the spectral alignment idea from KnOTS, extending it from LoRA to MLP adapters.
Balanced Softmax originates from the long-tail learning literature, using log-prior compensation to address class imbalance.
Comparison with recent continual learning methods including EASE, MOS, and TUNA demonstrates systematic advantages in the dual-imbalance setting.
The rank-adaptive gating concept is generalizable to other scenarios requiring selective knowledge merging.

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual-imbalance formulation and rank-adaptive merging are innovative, though individual components each have precedents.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, multiple imbalance ratios, complete ablations, efficiency comparisons, and sensitivity analyses.
Writing Quality: ⭐⭐⭐⭐ — Problem formalization is rigorous with consistent notation.
Value: ⭐⭐⭐⭐ — Addresses a practical and underexplored problem with a method that has broad generalization potential.