Batch Loss Score for Dynamic Data Pruning¶
Conference: CVPR 2026 arXiv: 2604.04681 Code: https://github.com/mrazhou/BLS Area: Training Efficiency / Data Pruning Keywords: dynamic data pruning, batch loss, EMA, training efficiency, sample importance
TL;DR¶
This paper proposes Batch Loss Score (BLS), a method that estimates sample importance using only the mean batch loss — which is universally available — rather than per-sample loss, which is difficult to obtain in practice. Grounded in a signal-processing perspective via EMA-based low-pass filtering, BLS offers theoretical guarantees and can be integrated into existing dynamic pruning frameworks with just 3 lines of code.
Background & Motivation¶
Dynamic data pruning accelerates deep learning training by skipping less informative samples. Per-sample loss is the most intuitive importance measure, yet obtaining it in practice poses significant obstacles: standard training pipelines are highly optimized for computing the mean batch loss, and recovering individual losses from the aggregated scalar is non-trivial. For complex objective functions (e.g., multi-component detection losses), defining and isolating a per-sample scalar requires deep task-specific knowledge and invasive code modifications.
Core insight of BLS: Although per-sample loss is hard to access, the mean batch loss is ubiquitous. By maintaining an EMA score for each sample — updated only when that sample appears in the current batch — sample importance can be inferred indirectly.
Method¶
Overall Architecture¶
Each sample is associated with a score \(s_i(t)\), updated via EMA whenever sample \(i\) appears in batch \(B_t\): \(s_i(t) = \alpha \cdot s_i(t-1) + (1-\alpha) \cdot L(B_t, t)\). BLS serves as a transparent proxy, replacing per-sample loss within existing pruning frameworks.
Key Designs¶
-
Signal Decomposition and Filtering: From a single sample's perspective, the mean batch loss equals a scaled signal (\(\frac{1}{B} l_i(t)\), the contribution of sample \(i\)) plus batch-composition noise (the loss contributions of the remaining \(B-1\) samples). EMA acts as a first-order IIR low-pass filter, attenuating high-frequency batch-composition noise while preserving low-frequency persistent loss trends.
-
Frequency Separation Assumption: The high-frequency fluctuations of batch-composition noise (caused by random sampling at each step) are substantially higher than the evolution frequency of the scaled per-sample loss (driven by slow parameter updates), making low-pass filtering effective.
-
Seamless Proxy Integration: BLS serves as a plug-and-play replacement for per-sample loss; downstream pruning algorithms remain entirely agnostic to the source of the scores, requiring no modifications to core scheduling logic or hyperparameters. Integration requires only 3 lines of code, compared to 33+ lines of invasive modification in InfoBatch.
Loss & Training¶
BLS does not alter the training loss; it only affects sample selection. The EMA decay factor \(\alpha\) controls the filtering characteristics: larger \(\alpha\) yields stronger noise suppression at the cost of slower response.
Theoretical Guarantees¶
From a signal decomposition perspective, the mean batch loss for a batch containing sample \(i\) can be decomposed into a scaled signal \(\frac{1}{B} l_i(t)\) and batch-composition noise \(\frac{1}{B}\sum_{j \neq i} l_j(t)\). The frequency separation assumption states that the high-frequency fluctuations of batch-composition noise far exceed the evolution frequency of the scaled per-sample loss. EMA acts as a first-order IIR low-pass filter \(H_\alpha\) with impulse response \(h[n] = (1-\alpha)\alpha^n u[n]\) and frequency response \(|H(e^{j\omega})| = \frac{1-\alpha}{\sqrt{1-2\alpha\cos(\omega)+\alpha^2}}\), which is maximized at \(\omega = 0\), thereby filtering out high-frequency noise while retaining low-frequency trends.
Key Experimental Results¶
Main Results¶
| Dataset / Task | Method | Pruning Ratio | Performance | Notes |
|---|---|---|---|---|
| ToCa (3M, zero-shot captioning) | BLS-SeTa | 32% | CIDEr 71.2 | ≈ SeTa 71.5 |
| MJ+ST (15M, text recognition) | BLS-SeTa | 33% | IIIT5k 96.2% | ≈ Full 96.1% |
| CIFAR10 | BLS-InfoBatch | 30% | 95.5% | ≈ Full 95.6% |
BLS acts as a transparent proxy within both InfoBatch and SeTa frameworks, requiring only 3 lines of code (vs. 33+ lines of invasive modification in InfoBatch). Downstream pruning algorithms remain fully agnostic to the score source, with no changes needed to core scheduling logic or hyperparameters.
Key Findings¶
- BLS is validated across 14 datasets, 11 tasks, and 18 models, achieving lossless pruning of 20%–50% of samples.
- When used as a proxy replacement for per-sample loss, performance is comparable to or better than the original methods.
- BLS is particularly well-suited for complex scenarios (multi-component losses, large-scale data) where per-sample loss is difficult to obtain.
- BLS is initialized with the mean loss of the first batch and subsequently updated only when a sample appears in the current batch.
Highlights & Insights¶
- A signal-processing perspective (low-pass filtering) provides rigorous theoretical grounding for BLS.
- The minimalist 3-line implementation substantially lowers the barrier to adoption.
- Decoupling "sample scoring" from "sample selection" enables combination with any loss-based pruning strategy.
- The frequency separation assumption is intuitively clear and empirically validated.
Limitations & Future Work¶
- The EMA decay factor \(\alpha\) requires task-specific tuning.
- BLS may be less accurate in the very early stages of training, before sufficient score accumulation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Using batch loss as a proxy for per-sample loss is a novel idea.
- Technical Depth: ⭐⭐⭐⭐⭐ — Signal-processing theoretical analysis is rigorous.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 14 datasets, 11 tasks, 18 models.
- Practical Value: ⭐⭐⭐⭐⭐ — 3 lines of code; extremely high practical utility.