Ensemble++: Scalable Exploration via Ensemble¶

Conference: NeurIPS 2025
arXiv: 2407.13195
Code: https://github.com/szrlee/Ensemble_Plus_Plus
Area: Model Compression
Keywords: Thompson Sampling, Ensemble Sampling, Linear Bandit, Exploration-Exploitation, Approximate Posterior

TL;DR¶

This paper proposes Ensemble++, which achieves regret bounds comparable to exact Thompson Sampling using only $\Theta(d\log T)$ ensemble size via an incremental update mechanism over shared factor matrices, with natural extension to nonlinear/neural network settings.

Background & Motivation¶

Thompson Sampling (TS) is a classical Bayesian method for balancing exploration and exploitation in sequential decision-making. However, exact posterior sampling is computationally prohibitive in high-dimensional or non-conjugate settings (e.g., neural networks). Ensemble Sampling approximates TS by maintaining $M$ model replicas, but achieving optimal regret bounds theoretically requires ensemble size $M = \Omega(T \cdot |\mathcal{X}|)$ (Qin et al., 2022), which is entirely infeasible at long time horizons or with large action spaces.

Key Challenge: How can a practically small ensemble size approximate the posterior sampling of TS while retaining near-optimal regret bounds?

Key Insight: The paper designs a novel "shared-factor ensemble" architecture that compresses $M$ ensemble directions into an approximate representation of the posterior covariance matrix via a random linear combination mechanism, fundamentally reducing the ensemble size requirement.

Method¶

Overall Architecture¶

Ensemble++ maintains a shared matrix factor $\mathbf{A}_t \in \mathbb{R}^{d \times M}$, which approximates the posterior covariance square root $\Sigma_t^{1/2}$ via incremental updates. At action selection, a "pseudo-posterior sample" is generated via the random linear combination $\theta_t(\zeta_t) = \mu_{t-1} + \mathbf{A}_{t-1}\zeta_t$, rather than randomly selecting from $M$ independent models.

Key Designs¶

Shared Factor Incremental Update: Each step requires only $O(d^2 M)$ updates — after observing the reward, the mean $\mu_t$ and ensemble matrix are updated as: $\mathbf{A}_t = \Sigma_t(\Sigma_{t-1}^{-1}\mathbf{A}_{t-1} + X_t \mathbf{z}_t^\top)$ where $\mathbf{z}_t \in \mathbb{R}^M$ is a perturbation vector. This avoids retraining from scratch or large-scale matrix factorization.
Random Linear Combination Sampling: At action selection, a reference vector $\zeta_t$ is sampled from distribution $P_\zeta$, and an approximate posterior sample is constructed as $\theta_t = \mu_{t-1} + \mathbf{A}_{t-1}\zeta_t$. Unlike conventional ensemble sampling that randomly selects a single model, this approach linearly combines all columns, substantially improving information utilization efficiency.
Symmetrized Regression Objective: The base and ensemble parameters are unified under a symmetrized regression objective: $L(\theta; D, f) = \sum_{m=1}^M \sum_{s \in D} \sum_{\beta \in \{\pm 1\}} (Y_s + \beta \mathbf{z}_{s,m} - f(X_s, \beta e_m))^2 + \lambda\|\theta\|^2$ This admits a closed-form solution in the linear case and is optimized via SGD in the neural network case, providing a seamless bridge from theory to practice.
Neural Network Extension: The linear features are replaced by a learnable neural feature extractor $h(x;w)$, while maintaining the same incremental update principle. A FIFO buffer and a fixed number of SGD steps ensure constant-time updates.

Loss & Training¶

Linear case: closed-form update of the shared factor matrix $\mathbf{A}_t$
Neural case: symmetrized loss + SGD, FIFO buffer of capacity $C$, $G$ gradient updates per step

Key Experimental Results¶

Main Results¶

Setting	Metric	Ensemble++	Ensemble+	EpiNet	Notes
Quadratic Bandit	Regret	Sublinear convergence	Linear regret	Linear regret	Significant advantage under nonlinear rewards
Neural Bandit	Accuracy	Highest	Second best	Second best	2-layer MLP setting
UCI Shuttle	Accuracy	Best	Second best	Second best	Real-world dataset
Hate Speech (GPT-2)	Accuracy	+5% vs Ensemble+	Baseline	N/A	Validated at Transformer scale

Ablation Study¶

Configuration	Key Metric	Notes
Ensemble size $M$ vs. dimension $d$	$M$ scales linearly with $d$	Validates $M = \Theta(d\log T)$ theory
$M$ vs. action set $	\mathcal{X}	$
Gaussian vs. coordinate reference distribution	Gaussian superior	Smaller $\rho/p$ ratio for continuous distributions
Buffer size $C$	Small buffer suffices	Full history storage not required

Key Findings¶

Ensemble++ with only $M=8$ matches exact TS performance at half the computational cost of conventional ensemble sampling.
Ensemble size scales linearly with dimension and is independent of action space size, validating the theoretical prediction $M = \Theta(d\log T)$.
The method scales effectively to GPT-2-level Transformers, demonstrating feasibility for extension to large models.

Highlights & Insights¶

Theoretical Contribution: The first proof in linear bandits that incremental ensemble updates with $\Theta(d\log T)$ ensemble size achieves TS-level regret bounds — an exponential reduction compared to the $\Omega(|\mathcal{X}|T)$ requirement of Qin et al.
Unified Framework: The same algorithm handles all four combinations of compact/finite action sets × stationary/time-varying contexts without modification.
Sequential JL Lemma: A sequential Johnson-Lindenstrauss variant is proposed for adaptive data collection, resolving the limitation of standard JL that requires independent projections.

Limitations & Future Work¶

The neural extension lacks rigorous theoretical guarantees; theoretical analysis is confined to the linear case.
Computational complexity $O(d^3\log T)$ incurs an additional $\log T$ factor compared to exact TS at $O(d^3)$.
Validation in truly large-scale LLM agent settings is absent; experiments are limited to the GPT-2 scale.

vs. Ensemble+ (Osband et al., 2019): Ensemble++ requires neither a large ensemble nor retraining, achieving a superior regret-computation trade-off via shared factor updates.
vs. EpiNet (Osband et al., 2023): Both inject uncertainty via architectural modifications, but Ensemble++ provides theoretical guarantees in the linear setting.
vs. LMC-TS (Xu et al., 2022): LMC-TS costs $O(d^2T)$ per step versus $O(d^2M)$ for Ensemble++, conferring a significant advantage at long horizons.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The shared-factor ensemble architecture and Sequential JL Lemma constitute entirely novel technical contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of linear and nonlinear bandit settings, though large-scale experiments are limited.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, and the extension from linear to nonlinear settings follows naturally.
Value: ⭐⭐⭐⭐⭐ Addresses the core bottleneck of ensemble sampling and provides a foundational framework for exploration in LLM agents.

Configuration	Key Metric	Notes
Ensemble size \(M\) vs. dimension \(d\)	\(M\) scales linearly with \(d\)	Validates \(M = \Theta(d\log T)\) theory
\(M\) vs. action set $	\mathcal{X}	$
Gaussian vs. coordinate reference distribution	Gaussian superior	Smaller \(\rho/p\) ratio for continuous distributions
Buffer size \(C\)	Small buffer suffices	Full history storage not required