Block-Sample MAC-Bayes Generalization Bounds¶
Conference: ICLR2026 arXiv: 2602.12605 Code: None Area: LLM Pretraining Keywords: PAC-Bayes, MAC-Bayes, generalization bounds, information theory, block samples
TL;DR¶
This paper proposes block-sample MAC-Bayes generalization bounds (mean approximately correct) that partition the training data into \(J\) blocks and replace the monolithic KL divergence with a sum of per-block conditional KL divergences. The resulting bounds remain finite and meaningful even in settings where the original PAC-Bayes bounds are vacuous (e.g., deterministic learning algorithms such as mean estimation). The paper further establishes that a high-probability (PAC) version of these bounds is generally unattainable.
Background & Motivation¶
State of the Field¶
The PAC-Bayes framework is a central tool in statistical learning theory for bounding generalization error. It quantifies the gap between empirical and population loss via the KL divergence between the posterior \(P_{W|S}\) induced by the learning algorithm and a prior \(Q_W\). PAC-Bayes bounds have regained prominence in recent years for their ability to yield non-trivial generalization guarantees for deep neural networks. MAC-Bayes bounds are the expectation counterpart of PAC-Bayes bounds, controlling expected rather than high-probability generalization error.
Limitations of Prior Work¶
Failure of PAC-Bayes bounds for deterministic algorithms: When the learning algorithm \(P_{W|S}\) is deterministic (e.g., \(W = \frac{1}{n}\sum_i Z_i\)), \(P_{W|S}\) is a Dirac measure, and the KL divergence with respect to any prior \(Q_W\) is infinite, rendering both PAC-Bayes and the corresponding MAC-Bayes bounds vacuous.
The monolithic KL term is too coarse: The global \(D(P_{W|S} \| Q_W)\) summarizes the influence of the entire training set on the hypothesis in a single scalar. When this influence is too strong (as in deterministic algorithms), the bound explodes.
Analogous limitations of information-theoretic bounds: Generalization bounds based on mutual information \(I(W;S)\) suffer from the same deficiency.
Root Cause¶
The divergence term \(D(P_{W|S} \| Q_W)\) in PAC-Bayes bounds measures the total "informational influence" of the complete training set \(S\) on the hypothesis \(W\); the bound breaks down whenever this influence is large. Intuitively, however, the influence of a small subset (a single block \(S_j\)) on the hypothesis, \(D(P_{W|S_j} \| Q_W)\), can remain finite even when the total influence is infinite.
Paper Goals¶
(1) Construct a family of MAC-Bayes generalization bounds that exploit block structure and yield meaningful guarantees even when the original PAC-Bayes bounds are vacuous; (2) analyze the optimal choice of block size; (3) investigate whether a high-probability (PAC) counterpart can be obtained.
Starting Point¶
Inspired by the individual-sample bounds of Bu et al. (2020) in the information-theoretic literature, the paper partitions the training set \(S\) into \(J = n/m\) blocks of size \(m\) and replaces the global KL divergence with the sum of KL divergences between the marginalized posteriors \(P_{W|S_j} := \mathbb{E}_{P_{S \setminus S_j}} P_{W|S}\) and the prior.
Core Idea¶
By partitioning the training set into blocks and substituting the sum of "partial-data conditional KL divergences" for the "full-data KL divergence," the paper constructs tighter MAC-Bayes generalization bounds.
Method¶
Overall Architecture¶
Let the training set \(S = (Z_1, \ldots, Z_n)\) be i.i.d. and partition it uniformly into \(J = n/m\) blocks \(S_j\) of size \(m\). Define the marginalized posterior \(P_{W|S_j} := \mathbb{E}_{P_{S_1,\ldots,S_{j-1},S_{j+1},\ldots,S_J}} P_{W|S}\) (this is not the posterior of an algorithm trained only on \(S_j\), but rather the full algorithm's posterior averaged over the remaining blocks). The goal is to establish generalization bounds of the form:
Key Designs¶
-
Core theorem (Theorem 1) — Block-sample MAC-Bayes bound:
- Function: Provides a general block-sample generalization bound, parameterized by a divergence function \(d\) and a moment-generating function condition \(\Phi_m\).
- Mechanism: Jensen's inequality (joint convexity of \(d\)) pulls the expectation inside \(d\); Fubini's theorem decouples the blocks; the Donsker–Varadhan variational representation is then applied per block to perform the measure change from posterior to prior. Crucially, the per-block KL divergence \(D(P_{W|S_j} \| Q_W)\) depends only on the marginalized distribution \(P_{W|S_j}\), which can be far smaller than the full-data \(D(P_{W|S} \| Q_W)\) when \(m \ll n\).
- Design Motivation: For deterministic algorithms, \(D(P_{W|S} \| Q_W) = \infty\), whereas \(D(P_{W|S_j} \| Q_W)\) is finite because \(P_{W|S_j}\), obtained by averaging over the remaining blocks, is a "smoothed" distribution rather than a Dirac measure.
-
Catoni function specialization (Corollary 1):
- Function: Specializes the bound to bounded losses \(\ell(w,z) \in [0,1]\) using the Catoni function as the comparison function.
- Core result: \(\mathbb{E}_{P_S} C_\beta(\mathbb{E}_{P_{W|S}} \hat{L}, \mathbb{E}_{P_{W|S}} L) \leq \frac{1}{n} \sum_j \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W)\), with the moment-generating function term eliminated entirely. A generalization error bound of \(\text{gen} \leq \sqrt{\frac{1}{4n} \sum_j \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W)}\) follows immediately.
- This is the tightest version; the corresponding binary KL and difference-function specializations are both looser.
-
Sub-Gaussian loss extension (Corollary 2):
- Function: Extends the bound from bounded to \(\sigma^2\)-sub-Gaussian losses.
- Core result: \(\text{gen} \leq \sqrt{\frac{2\sigma^2}{n} \sum_j \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W)}\).
- Broader applicability at the cost of a slightly looser bound.
-
Impossibility of a high-probability version (Theorem 2):
- Function: Proves that a block-sample PAC-Bayes bound (high-probability version) is generally unattainable.
- Mechanism: A counterexample learning scenario is constructed in which the algorithm severely overfits the training set with small probability and outputs a zero-loss hypothesis with large probability. In this setting the MAC-Bayes bound converges at rate \(\mathcal{O}(n^{-1/2})\), yet any PAC-Bayes bound of the form \(P_S(\text{gen} > A_n + B_n f(1/\delta)) \leq \delta\) must either have \(f\) growing faster than logarithmically or have \(B_n\) converging slowly.
- Significance: This result draws a fundamental distinction between MAC-Bayes and PAC-Bayes in the block-sample setting.
Block Size Optimization¶
Under the assumption \(\mathbb{E}_{P_S} D(P_{W|S_j} \| Q_W) \leq \mathcal{O}(m^\gamma) / \Theta(n)\): - \(\gamma < 1\): A constant block size \(m\) (including \(m=1\)) is optimal, and the bound decays at rate \(\mathcal{O}(n^{-1/2})\). - \(\gamma > 1\): The block size should grow linearly, \(m = \Theta(n)\) (with \(m \neq n\) required); the Gaussian mean-estimation example corresponds to \(\gamma = 1\), the transition point. - \(\gamma = 1\) (transition): Any choice \(m \neq n\) yields an \(\mathcal{O}(n^{-1/2})\) bound.
Key Experimental Results¶
Numerical Example¶
The paper includes a single numerical validation based on Gaussian mean estimation (\(Z_i \sim \mathcal{N}(\mu, 1)\), \(W = \frac{1}{n}\sum Z_i\), truncated squared loss) rather than large-scale ML experiments.
| Block size \(m\) | Bound behavior | Remarks |
|---|---|---|
| \(m = n\) (original PAC-Bayes) | \(\infty\) (vacuous) | KL divergence is infinite |
| \(m = 1\) (finest partition) | \(\mathcal{O}(n^{-1/2})\), optimal | Tightest in this example |
| \(1 < m < n\) | Finite and \(\mathcal{O}(n^{-1/2})\) | Relatively insensitive to choice |
Theoretical Comparison¶
| Bound type | Comparator \(d\) | Bound form | Requires \(m \neq n\) |
|---|---|---|---|
| Corollary 1 (Catoni) | \(C_\beta\) | \(\sqrt{\frac{1}{4n}\sum D}\) | Yes |
| Eq. (11) (direct binary KL substitution) | \(\text{kl}\) | \(\sqrt{\frac{\log(2\sqrt{m})}{m} + \frac{1}{n}\sum D}\) | Yes, and looser |
| Corollary 2 (sub-Gaussian) | \(s - r\) | \(\sqrt{\frac{2\sigma^2}{n}\sum D}\) | Yes |
| Original PAC-Bayes | Same | $\frac{D(P_{W | S}|Q_W) + \cdots}{n}$ |
Key Findings¶
- Block-sample bounds provide meaningful \(\mathcal{O}(n^{-1/2})\) convergence guarantees in settings where the original PAC-Bayes bounds are completely vacuous.
- The bounds are relatively insensitive to the choice of block size \(m\) (as long as \(m \neq n\)), though \(m = 1\) is optimal in the Gaussian mean-estimation example.
- The Catoni function specialization (Corollary 1) strictly dominates the binary KL and difference-function alternatives.
- The impossibility of a high-probability version constitutes a fundamental rather than a merely technical limitation.
Highlights & Insights¶
- The "block marginalization" idea is remarkably elegant: by averaging the deterministic \(P_{W|S}\) over part of the data to obtain the "smoothed" \(P_{W|S_j}\), a Dirac measure is transformed into a continuous distribution, converting an infinite KL divergence into a finite one. This trick reveals that the failure of PAC-Bayes bounds is not a deficiency of the framework per se, but rather a consequence of the KL divergence operating at too coarse a granularity.
- Counterexample construction technique: The counterexample in Theorem 2 ("severe overfitting with small probability, perfect generalization with large probability") is an elegant probabilistic construction that cleanly exposes the unbridgeable gap between expectation bounds and high-probability bounds.
- The advantage of the Catoni comparator function is further amplified in the block-sample setting: it eliminates the moment-generating function term entirely, whereas the binary KL and difference-function substitutions both introduce additional terms.
Limitations & Future Work¶
- Only a simple numerical example: The paper validates only the toy Gaussian mean-estimation example and does not demonstrate effectiveness on any practical ML algorithm (e.g., SGD-trained neural networks). The authors acknowledge that "substantial follow-up work is needed to address practical applications."
- Dependence on the data distribution: The divergence term \(D(P_{W|S_j} \| Q_W)\) depends on the data-generating distribution \(P_Z\), making the bound uncomputable without knowledge of \(P_Z\). This is a common limitation of information-theoretic generalization bounds.
- Computational difficulty of the marginalized posterior: \(P_{W|S_j}\) requires averaging over the remaining \(J-1\) blocks, which is analytically or computationally intractable for complex learning algorithms such as deep networks.
- Future directions: (1) further upper-bounding the divergence term by exploiting properties of the learning algorithm; (2) deriving computable versions for practical deep learning settings; (3) exploring alternative blocking strategies (e.g., random vs. contiguous blocks).
Related Work & Insights¶
- vs. Bu et al. (2020) individual-sample bounds: Bu et al. decompose the mutual information bound to per-sample terms \(I(W; Z_i)\); the present work pursues an analogous decomposition within the PAC-Bayes framework using KL divergence and block structure. The block approach is more flexible, as the block size can be tuned to optimize the bound.
- vs. Harutyunyan et al. (2021, 2022): They also consider subset mutual information bounds and the impossibility of high-probability versions, but under different system assumptions and restricted to the special case \(m=1\).
- vs. Wu et al. (2024) recursive PAC-Bayes: They also employ block partitioning but with a recursive structure, precluding direct comparison.
- This work demonstrates that the PAC-Bayes framework still offers substantial room for improvement, and that information decomposition via a "divide-and-conquer" strategy is a promising direction.
Rating¶
- Novelty: ⭐⭐⭐⭐ The block-sample decomposition idea is novel and theoretically deep; the impossibility result enhances completeness.
- Experimental Thoroughness: ⭐⭐ Only a single toy numerical example is provided, with no validation on real ML scenarios; this is, however, acceptable for a purely theoretical contribution.