Skip to content

Model-Agnostic Meta Learning for Class Imbalance Adaptation

Conference: ACL 2026
arXiv: 2604.18759
Code: GitHub
Area: Medical Imaging
Keywords: Class Imbalance, Meta-Learning, Adaptive Weighting, Hardness-Aware Resampling, Bi-level Optimization

TL;DR

This paper proposes HAMR (Hardness-Aware Meta-Resample), a unified meta-learning framework. It dynamically estimates instance-level weights through bi-level optimization to prioritize truly difficult samples. Combined with a neighborhood-aware resampling mechanism, it shifts training focus toward hard samples and their semantic neighbors, consistently outperforming strong baselines across six imbalanced NLP datasets.

Background & Motivation

Background: Class imbalance is pervasive in NLP tasks such as text classification and Named Entity Recognition (NER). Existing methods mainly fall into two categories: loss re-weighting (e.g., Focal Loss, Dice Loss) and data resampling (oversampling/synthetic generation).

Limitations of Prior Work: (1) These methods often rely on predefined static heuristics, applying the same adjustment ratio to all samples within the same class; (2) Sample difficulty is not synonymous with class membership—not all minority class instances are inherently difficult, and not all majority class samples are trivial; (3) Static schemes may erroneously down-weight informative majority samples while over-emphasizing simple minority instances.

Key Challenge: There is a need for a method that can dynamically identify and prioritize truly difficult samples—regardless of their class membership—adjusting the learning strategy based on the model's evolving understanding of the data.

Goal: Design a unified framework to simultaneously address both class imbalance and instance-level difficulty, dynamically guiding the model's learning focus.

Key Insight: Decouple "what the model should focus on" (adaptive weights) and "what the model should see" (resampling) into two complementary modules, unified and optimized via a meta-learning framework.

Core Idea: Use bi-level meta-optimization to dynamically learn instance importance weights (the inner loop uses pre-meta weights for intermediate updates, while the outer loop updates the weight network on a balanced meta-validation set to obtain post-meta weights for the actual update). This is coupled with neighborhood-enhanced resampling based on FAISS to shift the training distribution toward difficult semantic regions.

Method

Overall Architecture

HAMR consists of two core modules: (1) Adaptive Weight Estimation, which maps normalized sample losses to importance weights via a lightweight weight network \(f_\theta\) adjusted through bi-level meta-optimization; (2) Hardness-Aware Region Resampling, which dynamically adjusts the training distribution based on EMA-smoothed hardness scores and KNN-based neighborhood enhancement. The two modules collaborate within a unified training loop.

Key Designs

  1. Adaptive Weight Estimation via Bi-level Meta-Optimization:

    • Function: Dynamically adjusts the importance of each training sample based on the model's current learning state.
    • Mechanism: The inner loop uses the current pre-meta weights \(w_i^{\text{pre}}\) from the weight network to perform an intermediate gradient update, resulting in \(\phi'\). The outer loop evaluates \(f_{\phi'}\) on a balanced meta-validation set \(\mathcal{D}_{\text{meta}}\) to update the weight network parameters \(\theta\). Finally, the updated weight network recalculates the post-meta weights \(w_i^{\text{post}}\) for the actual model update. For token-level tasks, the maximum token loss per sentence is used; for classification tasks, sample-wise cross-entropy is used.
    • Design Motivation: Pre-meta weights reflect "what the model considers important before the update," while post-meta weights reflect "what is truly important under the guidance of a balanced validation set." This "try-before-you-act" strategy adapts better to training dynamics than static heuristics.
  2. Hardness-Aware Region Resampling with Neighborhood Enhancement:

    • Function: Dynamically shifts the training distribution to expose the model to more difficult semantic regions.
    • Mechanism: Global hardness scores \(h_i\) are obtained by EMA-smoothing post-meta weights: \(h_i \leftarrow \gamma \cdot h_i + (1-\gamma) \cdot w_i^{\text{post}}\). The top 20% hardest samples are selected, and FAISS-accelerated KNN is used to find \(k\) semantic neighbors for each, calculating a neighborhood enhancement score \(b_i\). The final sampling probability is \(p_i \propto (h_i + \varepsilon)^\tau \cdot (1 + \lambda b_i)\), where a temperature \(\tau < 1\) encourages balanced exploration.
    • Design Motivation: Focusing solely on isolated hard samples is insufficient; semantic neighbors of hard samples often pose similar challenges. Neighborhood enhancement diffuses difficulty from single samples to entire semantic regions.
  3. Unified Training Loop:

    • Function: Seamlessly integrates weight estimation and resampling into an end-to-end training process.
    • Mechanism: Neighborhood enhancement is updated every \(F\) epochs (to avoid the overhead of per-step KNN). Every mini-batch follows a complete pipeline: Sampling \(\rightarrow\) Pre-meta weights \(\rightarrow\) Inner update \(\rightarrow\) Meta-step \(\rightarrow\) Post-meta weights \(\rightarrow\) Outer update \(\rightarrow\) EMA update.
    • Design Motivation: The two modules reinforce each other—weight estimation provides instance-level signals, while resampling ensures the model encounters sufficient samples from difficult regions.

Loss & Training

The main loss is weighted cross-entropy or token-level loss. The balanced meta-validation set is constructed by combining the full original validation set with supplementary samples from the training set, sampled to match the median class frequency. Weights are normalized using batch-wise z-scores before being processed by the weight network, with outputs clipped to a fixed range for numerical stability.

Key Experimental Results

Main Results

Dataset Task HAMR Macro-F1 Best Baseline Macro-F1 Gain
BioNLP NER 72.7 70.6 (Dice) +2.1
TweetNER NER 60.2 59.0 (Dice/LNR) +1.2
MIT-Restaurant NER 81.1 80.4 (Dice) +0.7
Hurricane-Irma17 CLS 73.4 72.7 (ICF) +0.7
Cyclone-Idai19 CLS 65.7 63.8 (ICF) +1.9
SST-5 CLS 57.0 56.3 (ICF) +0.7

Ablation Study

Configuration BioNLP F1 Irma17 F1 Description
HAMR (Full) 72.7 73.4 Complete model
w/o Resampling 71.4 72.1 Removed neighborhood resampling
w/o Meta-Weight 70.9 71.8 Removed adaptive weighting
w/o Neighborhood Boost 71.8 72.5 Removed KNN neighborhood boost

Key Findings

  • HAMR achieves the best Macro-F1 across all 6 datasets, with the largest advantage (+1.9 pp) on the highly imbalanced Cyclone-Idai19 dataset (\(IR=98.4\)).
  • Both modules contribute synergistically; removing either leads to performance degradation, though meta-weighting contributes slightly more than resampling.
  • Neighborhood enhancement provides a consistent marginal gain (+0.6–0.9 pp), proving the value of diffusing difficulty from single points to regions.

Highlights & Insights

  • The "try-before-you-act" strategy of bi-level meta-optimization is elegant—pre-meta weights act as a "draft," and feedback from the meta-validation set allows the weight network to learn which weight distributions truly favor generalization. This concept is transferable to any scenario requiring dynamic training priority adjustment.
  • The neighborhood-enhanced resampling approach is unique—treating hard samples as "seeds" and diffusing difficulty signals through semantic neighborhoods is more robust than focusing on isolated hard samples.
  • The unified framework design decouples "what to focus on" (weights determine how to learn) and "what to see" (resampling determines what to learn from).

Limitations & Future Work

  • Reliance on FAISS for KNN may present computational bottlenecks for extremely large datasets.
  • Construction of the meta-validation set relies on assumptions regarding reasonable class distributions.
  • Validated only on BERT-based encoders; applicability in the LLM era remains unknown.
  • Integration with synthetic data augmentation methods has not been explored.
  • vs Focal Loss/Dice Loss: Static heuristics do not distinguish instance difficulty; HAMR dynamically learns instance weights.
  • vs Meta-Weight-Net: Similar meta-learning frameworks lack neighborhood resampling; HAMR adds region-level training distribution adjustments.
  • vs SMOTE: Instead of synthesizing new samples, HAMR dynamically adjusts existing sample weights and frequencies, making it more lightweight and free from generation noise.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of bi-level meta-optimization and neighborhood resampling is novel, though individual components have precedents.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covering 6 datasets across 2 tasks with detailed ablation, though more comparisons with recent methods could be included.
  • Writing Quality: ⭐⭐⭐⭐ Method is clear, and the algorithmic pseudocode is complete.
  • Value: ⭐⭐⭐⭐ Provides a general and effective solution for class imbalance in NLP.