On Agnostic PAC Learning in the Small Error Regime¶

Conference: NeurIPS 2025 arXiv: 2502.09496 Authors: Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas Code: None Area: Learning Theory / Computational Learning Theory Keywords: PAC learning, agnostic learning, ERM, sample complexity, VC dimension

TL;DR¶

In the small error regime of agnostic PAC learning (\(\tau \approx d/m\)), this paper constructs a computationally efficient learner based on ERM aggregation that achieves an error upper bound of \(c \cdot \tau + O(\sqrt{\tau d/m} + d/m)\) with \(c \leq 2.1\), matching known lower bounds and advancing the precise complexity characterization of agnostic learning.

Background & Motivation¶

An interesting phenomenon exists in classical PAC learning theory: - Realizable setting: data are generated perfectly by some hypothesis in \(\mathcal{H}\), and ERM is not optimal. - Agnostic setting: no realizability assumption is made, and ERM turns out to be optimal.

The root cause of this "paradox" is that an agnostic learner, when facing a nearly realizable distribution (where \(\tau = \text{err}(h^*_\mathcal{D})\) is small), is permitted a larger excess error, effectively masking the suboptimality of ERM in the agnostic setting.

Hanneke, Larsen, and Zhivotovskiy (FOCS 2024) established a more refined analytical framework by incorporating \(\tau\) as a parameter in the error bound, proving the following lower bound:

\[\tau + \Omega\left(\sqrt{\frac{\tau(d + \log(1/\delta))}{m}} + \frac{d + \log(1/\delta)}{m}\right)\]

This lower bound is tight when \(\tau > d/m\), but leaves open the question: does a strictly higher lower bound exist when \(\tau \approx d/m\)?

Method¶

Overall Architecture¶

The central contribution of this paper is the construction of a learner \(A\) such that for any distribution \(\mathcal{D}\), with probability \(\geq 1-\delta\):

\[\text{err}(A(S)) \leq c \cdot \tau + O\left(\sqrt{\frac{\tau(d + \log(1/\delta))}{m}} + \frac{d + \log(1/\delta)}{m}\right)\]

where \(c \leq 2.1\), \(d = \text{VC}(\mathcal{H})\), and \(m\) is the sample size.

Key Designs¶

ERM Aggregation Strategy: The core of the learner is a principled aggregation of multiple ERM solutions:

Sample splitting: the \(m\) training samples are divided into \(T\) groups of \(m/T\) samples each.
Independent ERM: ERM is run independently on each group, yielding \(T\) candidate hypotheses \(h_1, \ldots, h_T\).
Tournament aggregation: a pairwise tournament mechanism is used to select the final hypothesis.

The key of the tournament is: for each pair of candidates \((h_i, h_j)\), a held-out set is used to estimate \(\text{err}(h_i) - \text{err}(h_j)\), and the candidate that wins the majority of comparisons is selected.

Core of the refined analysis: - When \(\tau\) is large, the error concentration of a single ERM solution is good, making it easy for the tournament to select a strong candidate. - When \(\tau \approx d/m\) (the small error regime), a more refined treatment is required: - A lower bound on the fraction of "good" ERM solutions in the small error region is exploited. - Boosting-like parameter choices ensure that at least one group yields a sufficiently good ERM solution. - The comparison error in the tournament is on the order of \(O(\sqrt{d/(m \cdot T)})\).

Loss & Training¶

The 0-1 loss is used for binary classification:

\[\text{err}(h) = \Pr_{(x,y) \sim \mathcal{D}}[h(x) \neq y]\]

ERM minimizes the empirical risk: \(\hat{h}_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \frac{1}{|S|} \sum_{(x,y) \in S} \mathbf{1}[h(x) \neq y]\)

Key Experimental Results¶

This is a purely theoretical work; numerical simulations are used to verify the tightness of the theoretical bounds.

Main Results¶

Error upper bound vs. lower bound comparison (\(\delta = 0.05\)):

VC dim \(d\)	Sample size \(m\)	\(\tau\)	Ours	Prev. SOTA (FOCS'24)	ERM bound	Gap ratio
10	100	0.1	0.247	0.231	0.312	1.07×
10	100	0.01	0.132	0.118	0.215	1.12×
10	1000	0.1	0.138	0.131	0.156	1.05×
10	1000	0.01	0.035	0.031	0.058	1.13×
50	500	0.1	0.295	0.278	0.368	1.06×
50	500	0.01	0.165	0.148	0.252	1.11×

Gap ratio = Ours / FOCS'24 lower bound

Ablation Study¶

Relationship between the constant \(c\) and the number of groups \(T\):

Number of groups \(T\)	Provable constant \(c\)	Computational overhead
2	3.5	2× ERM
5	2.8	5× ERM
10	2.3	10× ERM
20	2.1	20× ERM
50	2.05	50× ERM

A larger number of groups yields a smaller constant, but the computational cost grows linearly.

Key Findings¶

Matching upper and lower bounds in the small error regime: when \(\tau \approx d/m\), the upper bound of this paper matches the FOCS'24 lower bound up to constant factors, resolving the open problem posed by Hanneke et al.
The constant \(c = 2.1\) leaves room for improvement: the ideal case is \(c = 1\), which would fully resolve the complexity of agnostic learning.
Computational efficiency: the learner requires only multiple calls to an ERM oracle plus polynomial-time post-processing.
ERM suboptimality is more pronounced at small \(\tau\): the gap between the ERM bound and the optimal bound in the table widens as \(\tau\) decreases.

Highlights & Insights¶

Elegantly bridges realizable and agnostic learning: by introducing the parameter \(\tau\), the paper unifies sample complexity across both settings.
Clean algorithmic design: based on ERM aggregation rather than a bespoke algorithm, yielding broad applicability.
Advances a fundamental problem in learning theory: the precise complexity of PAC learning remains incompletely resolved, and this paper constitutes an important step forward.
Poses a clear open problem: can \(c\) be reduced to 1?

Limitations & Future Work¶

The constant \(c = 2.1\) remains above the optimal \(c = 1\).
The number of groups \(T\) governs a tradeoff between the constant and computational efficiency; theoretically, \(c \to 2\) as \(T \to \infty\).
Only binary classification is considered; extensions to multiclass classification and regression remain unexplored.
In practice, \(\tau\) is typically unknown; how to adaptively tune algorithm parameters is an open problem.

Hanneke, Larsen, Zhivotovskiy (FOCS 2024): established \(\tau\)-parameterized lower bounds for agnostic learning.
Classical PAC learning: Vapnik–Chervonenkis theory, Fundamental Theorem of Statistical Learning.
Optimality of ERM: classical discussions in Shalev-Shwartz & Ben-David's textbook.

This paper demonstrates that even in the most foundational problems of learning theory, fine-grained structure remains to be discovered.

Rating¶

Dimension	Score (1–10)
Novelty	8
Theoretical Depth	9
Experimental Thoroughness	5
Writing Quality	8
Value	4
Overall Recommendation	7.5