On Agnostic PAC Learning in the Small Error Regime¶
Conference: NeurIPS 2025 arXiv: 2502.09496 Authors: Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas Code: None Area: Learning Theory / Computational Learning Theory Keywords: PAC learning, agnostic learning, ERM, sample complexity, VC dimension
TL;DR¶
In the small error regime of agnostic PAC learning (\(\tau \approx d/m\)), this paper constructs a computationally efficient learner based on ERM aggregation that achieves an error upper bound of \(c \cdot \tau + O(\sqrt{\tau d/m} + d/m)\) with \(c \leq 2.1\), matching known lower bounds and advancing the precise complexity characterization of agnostic learning.
Background & Motivation¶
An interesting phenomenon exists in classical PAC learning theory: - Realizable setting: data are generated perfectly by some hypothesis in \(\mathcal{H}\), and ERM is not optimal. - Agnostic setting: no realizability assumption is made, and ERM turns out to be optimal.
The root cause of this "paradox" is that an agnostic learner, when facing a nearly realizable distribution (where \(\tau = \text{err}(h^*_\mathcal{D})\) is small), is permitted a larger excess error, effectively masking the suboptimality of ERM in the agnostic setting.
Hanneke, Larsen, and Zhivotovskiy (FOCS 2024) established a more refined analytical framework by incorporating \(\tau\) as a parameter in the error bound, proving the following lower bound:
This lower bound is tight when \(\tau > d/m\), but leaves open the question: does a strictly higher lower bound exist when \(\tau \approx d/m\)?
Method¶
Overall Architecture¶
The central contribution of this paper is the construction of a learner \(A\) such that for any distribution \(\mathcal{D}\), with probability \(\geq 1-\delta\):
where \(c \leq 2.1\), \(d = \text{VC}(\mathcal{H})\), and \(m\) is the sample size.
Key Designs¶
ERM Aggregation Strategy: The core of the learner is a principled aggregation of multiple ERM solutions:
- Sample splitting: the \(m\) training samples are divided into \(T\) groups of \(m/T\) samples each.
- Independent ERM: ERM is run independently on each group, yielding \(T\) candidate hypotheses \(h_1, \ldots, h_T\).
- Tournament aggregation: a pairwise tournament mechanism is used to select the final hypothesis.
The key of the tournament is: for each pair of candidates \((h_i, h_j)\), a held-out set is used to estimate \(\text{err}(h_i) - \text{err}(h_j)\), and the candidate that wins the majority of comparisons is selected.
Core of the refined analysis: - When \(\tau\) is large, the error concentration of a single ERM solution is good, making it easy for the tournament to select a strong candidate. - When \(\tau \approx d/m\) (the small error regime), a more refined treatment is required: - A lower bound on the fraction of "good" ERM solutions in the small error region is exploited. - Boosting-like parameter choices ensure that at least one group yields a sufficiently good ERM solution. - The comparison error in the tournament is on the order of \(O(\sqrt{d/(m \cdot T)})\).
Loss & Training¶
The 0-1 loss is used for binary classification:
ERM minimizes the empirical risk: \(\hat{h}_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \frac{1}{|S|} \sum_{(x,y) \in S} \mathbf{1}[h(x) \neq y]\)
Key Experimental Results¶
This is a purely theoretical work; numerical simulations are used to verify the tightness of the theoretical bounds.
Main Results¶
Error upper bound vs. lower bound comparison (\(\delta = 0.05\)):
| VC dim \(d\) | Sample size \(m\) | \(\tau\) | Ours | Prev. SOTA (FOCS'24) | ERM bound | Gap ratio |
|---|---|---|---|---|---|---|
| 10 | 100 | 0.1 | 0.247 | 0.231 | 0.312 | 1.07× |
| 10 | 100 | 0.01 | 0.132 | 0.118 | 0.215 | 1.12× |
| 10 | 1000 | 0.1 | 0.138 | 0.131 | 0.156 | 1.05× |
| 10 | 1000 | 0.01 | 0.035 | 0.031 | 0.058 | 1.13× |
| 50 | 500 | 0.1 | 0.295 | 0.278 | 0.368 | 1.06× |
| 50 | 500 | 0.01 | 0.165 | 0.148 | 0.252 | 1.11× |
Gap ratio = Ours / FOCS'24 lower bound
Ablation Study¶
Relationship between the constant \(c\) and the number of groups \(T\):
| Number of groups \(T\) | Provable constant \(c\) | Computational overhead |
|---|---|---|
| 2 | 3.5 | 2× ERM |
| 5 | 2.8 | 5× ERM |
| 10 | 2.3 | 10× ERM |
| 20 | 2.1 | 20× ERM |
| 50 | 2.05 | 50× ERM |
A larger number of groups yields a smaller constant, but the computational cost grows linearly.
Key Findings¶
- Matching upper and lower bounds in the small error regime: when \(\tau \approx d/m\), the upper bound of this paper matches the FOCS'24 lower bound up to constant factors, resolving the open problem posed by Hanneke et al.
- The constant \(c = 2.1\) leaves room for improvement: the ideal case is \(c = 1\), which would fully resolve the complexity of agnostic learning.
- Computational efficiency: the learner requires only multiple calls to an ERM oracle plus polynomial-time post-processing.
- ERM suboptimality is more pronounced at small \(\tau\): the gap between the ERM bound and the optimal bound in the table widens as \(\tau\) decreases.
Highlights & Insights¶
- Elegantly bridges realizable and agnostic learning: by introducing the parameter \(\tau\), the paper unifies sample complexity across both settings.
- Clean algorithmic design: based on ERM aggregation rather than a bespoke algorithm, yielding broad applicability.
- Advances a fundamental problem in learning theory: the precise complexity of PAC learning remains incompletely resolved, and this paper constitutes an important step forward.
- Poses a clear open problem: can \(c\) be reduced to 1?
Limitations & Future Work¶
- The constant \(c = 2.1\) remains above the optimal \(c = 1\).
- The number of groups \(T\) governs a tradeoff between the constant and computational efficiency; theoretically, \(c \to 2\) as \(T \to \infty\).
- Only binary classification is considered; extensions to multiclass classification and regression remain unexplored.
- In practice, \(\tau\) is typically unknown; how to adaptively tune algorithm parameters is an open problem.
Related Work & Insights¶
- Hanneke, Larsen, Zhivotovskiy (FOCS 2024): established \(\tau\)-parameterized lower bounds for agnostic learning.
- Classical PAC learning: Vapnik–Chervonenkis theory, Fundamental Theorem of Statistical Learning.
- Optimality of ERM: classical discussions in Shalev-Shwartz & Ben-David's textbook.
This paper demonstrates that even in the most foundational problems of learning theory, fine-grained structure remains to be discovered.
Rating¶
| Dimension | Score (1–10) |
|---|---|
| Novelty | 8 |
| Theoretical Depth | 9 |
| Experimental Thoroughness | 5 |
| Writing Quality | 8 |
| Value | 4 |
| Overall Recommendation | 7.5 |