Block-Sample MAC-Bayes Generalization Bounds¶

Conference: ICLR2026 arXiv: 2602.12605 Code: None Area: LLM Pretraining Keywords: PAC-Bayes, MAC-Bayes, generalization bounds, information theory, block samples

TL;DR¶

This paper proposes block-sample MAC-Bayes generalization bounds (mean approximately correct) that partition the training data into $J$ blocks and replace the monolithic KL divergence with a sum of per-block conditional KL divergences. The resulting bounds remain finite and meaningful even in settings where the original PAC-Bayes bounds are vacuous (e.g., deterministic learning algorithms such as mean estimation). The paper further establishes that a high-probability (PAC) version of these bounds is generally unattainable.

Background & Motivation¶

State of the Field¶

The PAC-Bayes framework is a central tool in statistical learning theory for bounding generalization error. It quantifies the gap between empirical and population loss via the KL divergence between the posterior $P_{W|S}$ induced by the learning algorithm and a prior $Q_W$. PAC-Bayes bounds have regained prominence in recent years for their ability to yield non-trivial generalization guarantees for deep neural networks. MAC-Bayes bounds are the expectation counterpart of PAC-Bayes bounds, controlling expected rather than high-probability generalization error.

Limitations of Prior Work¶

Failure of PAC-Bayes bounds for deterministic algorithms: When the learning algorithm $P_{W|S}$ is deterministic (e.g., $W = \frac{1}{n}\sum_i Z_i$), $P_{W|S}$ is a Dirac measure, and the KL divergence with respect to any prior $Q_W$ is infinite, rendering both PAC-Bayes and the corresponding MAC-Bayes bounds vacuous.

The monolithic KL term is too coarse: The global $D(P_{W|S} \| Q_W)$ summarizes the influence of the entire training set on the hypothesis in a single scalar. When this influence is too strong (as in deterministic algorithms), the bound explodes.

Analogous limitations of information-theoretic bounds: Generalization bounds based on mutual information $I(W;S)$ suffer from the same deficiency.

Root Cause¶

The divergence term $D(P_{W|S} \| Q_W)$ in PAC-Bayes bounds measures the total "informational influence" of the complete training set $S$ on the hypothesis $W$; the bound breaks down whenever this influence is large. Intuitively, however, the influence of a small subset (a single block $S_j$) on the hypothesis, $D(P_{W|S_j} \| Q_W)$, can remain finite even when the total influence is infinite.

Paper Goals¶

(1) Construct a family of MAC-Bayes generalization bounds that exploit block structure and yield meaningful guarantees even when the original PAC-Bayes bounds are vacuous; (2) analyze the optimal choice of block size; (3) investigate whether a high-probability (PAC) counterpart can be obtained.

Starting Point¶

Inspired by the individual-sample bounds of Bu et al. (2020) in the information-theoretic literature, the paper partitions the training set $S$ into $J = n/m$ blocks of size $m$ and replaces the global KL divergence with the sum of KL divergences between the marginalized posteriors $P_{W|S_j} := \mathbb{E}_{P_{S \setminus S_j}} P_{W|S}$ and the prior.

Core Idea¶

By partitioning the training set into blocks and substituting the sum of "partial-data conditional KL divergences" for the "full-data KL divergence," the paper constructs tighter MAC-Bayes generalization bounds.

Method¶

Overall Architecture¶

Let the training set $S = (Z_1, \ldots, Z_n)$ be i.i.d. and partition it uniformly into $J = n/m$ blocks $S_j$ of size $m$. Define the marginalized posterior $P_{W|S_j} := \mathbb{E}_{P_{S_1,\ldots,S_{j-1},S_{j+1},\ldots,S_J}} P_{W|S}$ (this is not the posterior of an algorithm trained only on $S_j$, but rather the full algorithm's posterior averaged over the remaining blocks). The goal is to establish generalization bounds of the form:

\[\mathbb{E}_{P_S} d(\mathbb{E}_{P_{W|S}} \hat{L}(W,S), \mathbb{E}_{P_{W|S}} L(W)) \leq \frac{\sum_{j=1}^{J} \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W) + I''(n,d,J)}{n}\]

Key Designs¶

Core theorem (Theorem 1) — Block-sample MAC-Bayes bound:
- Function: Provides a general block-sample generalization bound, parameterized by a divergence function $d$ and a moment-generating function condition $\Phi_m$.
- Mechanism: Jensen's inequality (joint convexity of $d$) pulls the expectation inside $d$; Fubini's theorem decouples the blocks; the Donsker–Varadhan variational representation is then applied per block to perform the measure change from posterior to prior. Crucially, the per-block KL divergence $D(P_{W|S_j} \| Q_W)$ depends only on the marginalized distribution $P_{W|S_j}$, which can be far smaller than the full-data $D(P_{W|S} \| Q_W)$ when $m \ll n$.
- Design Motivation: For deterministic algorithms, $D(P_{W|S} \| Q_W) = \infty$, whereas $D(P_{W|S_j} \| Q_W)$ is finite because $P_{W|S_j}$, obtained by averaging over the remaining blocks, is a "smoothed" distribution rather than a Dirac measure.
Catoni function specialization (Corollary 1):
- Function: Specializes the bound to bounded losses $\ell(w,z) \in [0,1]$ using the Catoni function as the comparison function.
- Core result: $\mathbb{E}_{P_S} C_\beta(\mathbb{E}_{P_{W|S}} \hat{L}, \mathbb{E}_{P_{W|S}} L) \leq \frac{1}{n} \sum_j \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W)$, with the moment-generating function term eliminated entirely. A generalization error bound of $\text{gen} \leq \sqrt{\frac{1}{4n} \sum_j \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W)}$ follows immediately.
- This is the tightest version; the corresponding binary KL and difference-function specializations are both looser.
Sub-Gaussian loss extension (Corollary 2):
- Function: Extends the bound from bounded to $\sigma^2$-sub-Gaussian losses.
- Core result: $\text{gen} \leq \sqrt{\frac{2\sigma^2}{n} \sum_j \mathbb{E}_{P_{S_j}} D(P_{W|S_j} \| Q_W)}$.
- Broader applicability at the cost of a slightly looser bound.
Impossibility of a high-probability version (Theorem 2):
- Function: Proves that a block-sample PAC-Bayes bound (high-probability version) is generally unattainable.
- Mechanism: A counterexample learning scenario is constructed in which the algorithm severely overfits the training set with small probability and outputs a zero-loss hypothesis with large probability. In this setting the MAC-Bayes bound converges at rate $\mathcal{O}(n^{-1/2})$, yet any PAC-Bayes bound of the form $P_S(\text{gen} > A_n + B_n f(1/\delta)) \leq \delta$ must either have $f$ growing faster than logarithmically or have $B_n$ converging slowly.
- Significance: This result draws a fundamental distinction between MAC-Bayes and PAC-Bayes in the block-sample setting.

Block Size Optimization¶

Under the assumption $\mathbb{E}_{P_S} D(P_{W|S_j} \| Q_W) \leq \mathcal{O}(m^\gamma) / \Theta(n)$: - $\gamma < 1$: A constant block size $m$ (including $m=1$) is optimal, and the bound decays at rate $\mathcal{O}(n^{-1/2})$. - $\gamma > 1$: The block size should grow linearly, $m = \Theta(n)$ (with $m \neq n$ required); the Gaussian mean-estimation example corresponds to $\gamma = 1$, the transition point. - $\gamma = 1$ (transition): Any choice $m \neq n$ yields an $\mathcal{O}(n^{-1/2})$ bound.

Key Experimental Results¶

Numerical Example¶

The paper includes a single numerical validation based on Gaussian mean estimation ($Z_i \sim \mathcal{N}(\mu, 1)$, $W = \frac{1}{n}\sum Z_i$, truncated squared loss) rather than large-scale ML experiments.

Block size $m$	Bound behavior	Remarks
$m = n$ (original PAC-Bayes)	$\infty$ (vacuous)	KL divergence is infinite
$m = 1$ (finest partition)	$\mathcal{O}(n^{-1/2})$, optimal	Tightest in this example
$1 < m < n$	Finite and $\mathcal{O}(n^{-1/2})$	Relatively insensitive to choice

Theoretical Comparison¶

Bound type	Comparator $d$	Bound form	Requires $m \neq n$
Corollary 1 (Catoni)	$C_\beta$	$\sqrt{\frac{1}{4n}\sum D}$	Yes
Eq. (11) (direct binary KL substitution)	$\text{kl}$	$\sqrt{\frac{\log(2\sqrt{m})}{m} + \frac{1}{n}\sum D}$	Yes, and looser
Corollary 2 (sub-Gaussian)	$s - r$	$\sqrt{\frac{2\sigma^2}{n}\sum D}$	Yes
Original PAC-Bayes	Same	$\frac{D(P_{W	S}\|Q_W) + \cdots}{n}$

Key Findings¶

Block-sample bounds provide meaningful $\mathcal{O}(n^{-1/2})$ convergence guarantees in settings where the original PAC-Bayes bounds are completely vacuous.
The bounds are relatively insensitive to the choice of block size $m$ (as long as $m \neq n$), though $m = 1$ is optimal in the Gaussian mean-estimation example.
The Catoni function specialization (Corollary 1) strictly dominates the binary KL and difference-function alternatives.
The impossibility of a high-probability version constitutes a fundamental rather than a merely technical limitation.

Highlights & Insights¶

The "block marginalization" idea is remarkably elegant: by averaging the deterministic $P_{W|S}$ over part of the data to obtain the "smoothed" $P_{W|S_j}$, a Dirac measure is transformed into a continuous distribution, converting an infinite KL divergence into a finite one. This trick reveals that the failure of PAC-Bayes bounds is not a deficiency of the framework per se, but rather a consequence of the KL divergence operating at too coarse a granularity.
Counterexample construction technique: The counterexample in Theorem 2 ("severe overfitting with small probability, perfect generalization with large probability") is an elegant probabilistic construction that cleanly exposes the unbridgeable gap between expectation bounds and high-probability bounds.
The advantage of the Catoni comparator function is further amplified in the block-sample setting: it eliminates the moment-generating function term entirely, whereas the binary KL and difference-function substitutions both introduce additional terms.

Limitations & Future Work¶

Only a simple numerical example: The paper validates only the toy Gaussian mean-estimation example and does not demonstrate effectiveness on any practical ML algorithm (e.g., SGD-trained neural networks). The authors acknowledge that "substantial follow-up work is needed to address practical applications."
Dependence on the data distribution: The divergence term $D(P_{W|S_j} \| Q_W)$ depends on the data-generating distribution $P_Z$, making the bound uncomputable without knowledge of $P_Z$. This is a common limitation of information-theoretic generalization bounds.
Computational difficulty of the marginalized posterior: $P_{W|S_j}$ requires averaging over the remaining $J-1$ blocks, which is analytically or computationally intractable for complex learning algorithms such as deep networks.
Future directions: (1) further upper-bounding the divergence term by exploiting properties of the learning algorithm; (2) deriving computable versions for practical deep learning settings; (3) exploring alternative blocking strategies (e.g., random vs. contiguous blocks).

vs. Bu et al. (2020) individual-sample bounds: Bu et al. decompose the mutual information bound to per-sample terms $I(W; Z_i)$; the present work pursues an analogous decomposition within the PAC-Bayes framework using KL divergence and block structure. The block approach is more flexible, as the block size can be tuned to optimize the bound.
vs. Harutyunyan et al. (2021, 2022): They also consider subset mutual information bounds and the impossibility of high-probability versions, but under different system assumptions and restricted to the special case $m=1$.
vs. Wu et al. (2024) recursive PAC-Bayes: They also employ block partitioning but with a recursive structure, precluding direct comparison.
This work demonstrates that the PAC-Bayes framework still offers substantial room for improvement, and that information decomposition via a "divide-and-conquer" strategy is a promising direction.

Rating¶

Novelty: ⭐⭐⭐⭐ The block-sample decomposition idea is novel and theoretically deep; the impossibility result enhances completeness.
Experimental Thoroughness: ⭐⭐ Only a single toy numerical example is provided, with no validation on real ML scenarios; this is, however, acceptable for a purely theoretical contribution.

Block size \(m\)	Bound behavior	Remarks
\(m = n\) (original PAC-Bayes)	\(\infty\) (vacuous)	KL divergence is infinite
\(m = 1\) (finest partition)	\(\mathcal{O}(n^{-1/2})\), optimal	Tightest in this example
\(1 < m < n\)	Finite and \(\mathcal{O}(n^{-1/2})\)	Relatively insensitive to choice

Bound type	Comparator \(d\)	Bound form	Requires \(m \neq n\)
Corollary 1 (Catoni)	\(C_\beta\)	\(\sqrt{\frac{1}{4n}\sum D}\)	Yes
Eq. (11) (direct binary KL substitution)	\(\text{kl}\)	\(\sqrt{\frac{\log(2\sqrt{m})}{m} + \frac{1}{n}\sum D}\)	Yes, and looser
Corollary 2 (sub-Gaussian)	\(s - r\)	\(\sqrt{\frac{2\sigma^2}{n}\sum D}\)	Yes
Original PAC-Bayes	Same	$\frac{D(P_{W	S}\|Q_W) + \cdots}{n}$