Model-Agnostic Meta Learning for Class Imbalance Adaptation¶

Conference: ACL 2026 arXiv: 2604.18759 Code: GitHub Area: Medical Imaging Keywords: class imbalance, meta-learning, adaptive weighting, hardness-aware resampling, bi-level optimization

TL;DR¶

This paper proposes HAMR (Hardness-Aware Meta-Resample), a unified meta-learning framework that dynamically estimates instance-level importance weights via bi-level optimization to prioritize genuinely difficult samples, coupled with a neighborhood-aware resampling mechanism that shifts training focus toward hard samples and their semantic neighbors. HAMR consistently outperforms strong baselines across 6 imbalanced NLP datasets.

Background & Motivation¶

Background: Class imbalance is pervasive in NLP tasks such as text classification and named entity recognition. Existing approaches fall into two main categories: loss re-weighting (e.g., Focal Loss, Dice Loss) and data resampling (oversampling / synthetic generation).

Limitations of Prior Work: (1) These methods typically rely on predefined static heuristics that apply uniform adjustment ratios to all samples within the same class. (2) Sample difficulty is not equivalent to class membership — not all minority-class instances are inherently hard, nor are all majority-class samples trivial. (3) Static schemes may incorrectly downweight informative majority-class samples while over-emphasizing easy minority-class instances.

Key Challenge: A method is needed that can dynamically identify and prioritize genuinely difficult samples — regardless of class membership — and adapt its learning strategy in accordance with the model's evolving understanding of the data.

Goal: To design a unified framework that simultaneously addresses class imbalance and instance-level difficulty, dynamically steering the model's learning focus.

Key Insight: Decouple "what the model should attend to" (adaptive weighting) from "what the model should be exposed to" (resampling) into two complementary modules, jointly optimized within a meta-learning framework.

Core Idea: Employ bi-level meta-optimization to dynamically learn instance importance weights — the inner loop performs an intermediate update using pre-meta weights, while the outer loop updates the weight network on a balanced meta-validation set to obtain post-meta weights for the actual model update — combined with FAISS-accelerated neighborhood-enhanced resampling to shift the training distribution toward challenging semantic regions.

Method¶

Overall Architecture¶

HAMR comprises two core modules: (1) Adaptive Weight Estimation — a lightweight weight network \(f_\theta\) maps normalized per-sample losses to importance weights, dynamically adjusted via bi-level meta-optimization; and (2) Hardness-Aware Region Resampling — dynamically adjusts the training distribution based on EMA-smoothed difficulty scores and KNN neighborhood augmentation. Both modules operate collaboratively within a unified training loop.

Key Designs¶

Adaptive Weight Estimation via Bi-Level Meta-Optimization:
- Function: Dynamically adjusts the importance of each training sample based on the model's current learning state.
- Mechanism: In the inner loop, the current weight network produces pre-meta weights \(w_i^{\text{pre}}\) and performs an intermediate gradient update to obtain \(\phi'\). In the outer loop, \(f_{\phi'}\) is evaluated on a balanced meta-validation set \(\mathcal{D}_{\text{meta}}\) to update the weight network parameters \(\theta\). The updated weight network then recomputes post-meta weights \(w_i^{\text{post}}\) for the actual model update. For token-level tasks, the sentence-level maximum token loss is used; for classification tasks, per-sample cross-entropy is used.
- Design Motivation: Pre-meta weights reflect "what the model deems important before updating," while post-meta weights reflect "what is truly important under the guidance of the balanced validation set." This try-then-revise strategy adapts more effectively to training dynamics than static heuristics.
Hardness-Aware Region Resampling with Neighborhood Augmentation:
- Function: Dynamically adjusts the training distribution to expose the model more frequently to challenging semantic regions.
- Mechanism: EMA smoothing is applied to post-meta weights to obtain global difficulty scores: \(h_i \leftarrow \gamma \cdot h_i + (1-\gamma) \cdot w_i^{\text{post}}\). The top 20% hardest samples are selected, and FAISS-accelerated KNN retrieves \(k\) semantic neighbors for each hard sample, yielding a neighborhood-augmented score \(b_i\). The final sampling probability is \(p_i \propto (h_i + \varepsilon)^\tau \cdot (1 + \lambda b_i)\), where temperature \(\tau < 1\) encourages balanced exploration.
- Design Motivation: Focusing solely on isolated hard samples is insufficient — the semantic neighbors of hard samples tend to exhibit similarly challenging characteristics. Neighborhood augmentation propagates difficulty signals from individual samples to entire semantic regions.
Unified Training Loop:
- Function: Seamlessly integrates weight estimation and resampling into an end-to-end training procedure.
- Mechanism: Neighborhood augmentation is updated every \(F\) epochs to avoid per-step KNN overhead. Each mini-batch undergoes the complete pipeline: sampling → pre-meta weights → inner update → meta step → post-meta weights → outer update → EMA update.
- Design Motivation: The two modules are mutually reinforcing — weight estimation provides instance-level importance signals, while resampling ensures the model is sufficiently exposed to hard-region samples.

Loss & Training¶

The primary loss is weighted cross-entropy / token-level loss. The meta-validation set is constructed by taking the full validation set and supplementing it with training samples drawn to match the median class size, forming a balanced set. Weights undergo batch-wise z-score normalization before being processed by the weight network; outputs are clipped to a fixed range to ensure numerical stability.

Key Experimental Results¶

Main Results¶

Dataset	Task	HAMR Macro-F1	Best Baseline Macro-F1	Gain
BioNLP	NER	72.7	70.6 (Dice)	+2.1
TweetNER	NER	60.2	59.0 (Dice/LNR)	+1.2
MIT-Restaurant	NER	81.1	80.4 (Dice)	+0.7
Hurricane-Irma17	CLS	73.4	72.7 (ICF)	+0.7
Cyclone-Idai19	CLS	65.7	63.8 (ICF)	+1.9
SST-5	CLS	57.0	56.3 (ICF)	+0.7

Ablation Study¶

Configuration	BioNLP F1	Irma17 F1	Note
HAMR (full)	72.7	73.4	Complete model
w/o resampling	71.4	72.1	Remove neighborhood resampling
w/o meta-weights	70.9	71.8	Remove adaptive weighting
w/o neighborhood augmentation	71.8	72.5	Remove KNN neighborhood boost

Key Findings¶

HAMR achieves the best Macro-F1 on all 6 datasets, with the largest margin on highly imbalanced datasets (Cyclone-Idai19, IR=98.4), yielding a +1.9 pp improvement.
Both modules contribute synergistically — removing either one degrades performance, though the meta-weighting module contributes slightly more than resampling.
Neighborhood augmentation provides consistent marginal gains (+0.6–0.9 pp), validating the benefit of propagating difficulty from point-level to region-level.

Highlights & Insights¶

The bi-level meta-optimization "try-then-revise" strategy is elegant — pre-meta weights serve as a "draft," and feedback from the meta-validation set teaches the weight network what weight assignments truly benefit generalization. This paradigm is transferable to any scenario requiring dynamic adjustment of training priorities.
The neighborhood-augmented resampling approach is distinctive — hard samples are treated as "seeds" from which difficulty signals are diffused through semantic neighborhoods, yielding greater robustness than attending only to isolated hard instances.
The unified framework cleanly decouples "what to attend to" from "what to learn from" — weights govern how to learn, while resampling governs what to learn from.

Limitations & Future Work¶

Reliance on FAISS for KNN may introduce computational bottlenecks on very large datasets.
Meta-validation set construction assumes a reasonable class distribution prior.
Validation is limited to BERT-based encoders; applicability in the LLM era remains unexplored.
Integration with synthetic data augmentation methods has not been investigated.

vs. Focal Loss / Dice Loss: Static heuristics do not differentiate instance-level difficulty; HAMR dynamically learns per-instance weights.
vs. Meta-Weight-Net: A similar meta-learning framework but without neighborhood resampling; HAMR additionally introduces region-level training distribution adjustment.
vs. SMOTE: Generates synthetic samples rather than dynamically adjusting weights of existing samples; HAMR is more lightweight and free from generation noise.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of bi-level meta-optimization and neighborhood resampling is novel, though individual components have precedents.
Experimental Thoroughness: ⭐⭐⭐⭐ Six datasets across two tasks with detailed ablations, but comparisons with more recent methods are limited.
Writing Quality: ⭐⭐⭐⭐ Method is clearly presented with a complete algorithmic pseudocode.
Value: ⭐⭐⭐⭐ Provides a general and effective solution to class imbalance in NLP.