Laplacian Kernelized Bandit¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=5majEbTAZa
Code: Not publicly available
Area: Learning Theory / Online Learning / Kernelized Bandits
Keywords: Multi-user Contextual Bandits, Gang of Bandits, Graph Homophily, RKHS, Gaussian Process, Effective Dimension, Regret Bound
TL;DR¶
This paper reduces the multi-user, nonlinear reward Gang-of-Bandits problem on graphs to learning a single "lifted function" in a unified multi-user RKHS. The reproducing kernel elegantly fuses the graph Laplacian and the arm kernel as \(K((x,u),(x',u'))=[L_\rho^{-1}]_{u,u'}K_x(x,x')\), leading to the design of LK-GP-UCB / LK-GP-TS algorithms with theoretical regret guarantees.
Background & Motivation¶
- Background: Graph structures are ubiquitous in sequential decision-making (capturing similarities between users, items, or sensors). The classical paradigm for multi-user contextual bandits is the Gang of Bandits (GOB), which treats the reward functions \(\{f_u\}\) of \(n\) users as smooth signals on a graph. It utilizes the graph Laplacian to penalize "roughness" for information sharing, which is more efficient than independent learning for each user.
- Limitations of Prior Work: Foundational works like GoB.Lin assume linear rewards \(f_u(x)=\theta_u^\top x\), and subsequent improvements mostly remain within this linear paradigm. However, rewards in real-world scenarios such as recommendation systems and personalized medicine are often nonlinear. While single-agent kernelized bandits can handle nonlinearity, previous attempts in the "multi-user + graph" setting only used heuristic products of user and arm kernels (e.g., COOP-KernelUCB), leaving a gap between modeling objectives and final algorithms.
- Key Challenge: How to simultaneously characterize the nonlinear structure of rewards relative to arm features and encode graph homophily (similar rewards for connected users), while ensuring this regularization derives a theoretically guaranteed and computationally feasible exploration algorithm—rather than an ad-hoc heuristic kernel.
- Goal: Starting from first principles, design a principled joint penalty for the "set of user reward functions" (graph smoothing + individual complexity) and prove it is equivalent to a norm in a unified space, thereby transforming multi-function learning into single-function learning to leverage mature GP bandit frameworks.
- Core Idea: [Unified Kernel] The joint penalty \(\mathrm{PEN}(f_{1:n};\rho)\) is not a makeshift regularization term but is precisely the squared norm of a single RKHS defined on the product domain \(\mathcal U\times\mathcal D\), where the reproducing kernel tensor-fuses the graph Green's function \(L_\rho^{-1}\) with the arm kernel \(K_x\).
Method¶
Overall Architecture¶
The framework consists of three steps: (1) Construct a joint penalty for the function set \(\{f_u\}\) using a "graph smoothing term + ridge penalty"; (2) Prove that this penalty equals the squared RKHS norm of the lifted function \(f(x,u):=f_u(x)\) on the product space \(\mathcal H=\mathcal H_G\otimes\mathcal H_x\), and derive the reproducing kernel explicitly; (3) Place a Gaussian Process prior on this unified kernel, using GP posterior mean/variance to drive UCB and Thompson Sampling exploration, and provide regret bounds based on the "effective dimension."
flowchart LR
A["Joint Penalty<br/>Graph Smoothing + Ridge<br/>PEN(f₁:ₙ;ρ)"] --> B["Theorem 2.1<br/>= Unified RKHS Squared Norm<br/>Explicit Reproducing Kernel K"]
B --> C["Lifted Function f(x,u)=fᵤ(x)<br/>Single GP Prior"]
C --> D["GP Posterior μ,σ<br/>recursive O(t²) update"]
D --> E1["LK-GP-UCB<br/>μ+βσ"]
D --> E2["LK-GP-TS<br/>μ+νzσ"]
E1 --> F["Regret Bound Õ(d̃√T)<br/>Effective Dimension d̃"]
E2 --> F
Key Designs¶
1. Joint Penalty: Integrating "Graph Homophily + Individual Smoothness" into one objective. The authors require an ideal set of reward functions to be smooth on the graph (homophily) and have low complexity individually within the RKHS \(\mathcal H_x\). These are combined additively into a single penalty—the graph smoothing term measures differences between connected users via squared distance in the RKHS, while a standard ridge term constrains individual norms: $\(\mathrm{PEN}(f_{1:n};\rho)=\underbrace{\tfrac12\sum_{i,j}w_{ij}\|f_i-f_j\|_{\mathcal H_x}^2}_{\text{Graph Smoothing}}+\underbrace{\rho\sum_i\|f_i\|_{\mathcal H_x}^2}_{\text{Ridge}}=\sum_{i,j}[L_\rho]_{ij}\langle f_i,f_j\rangle_{\mathcal H_x},\)$ where \(L_\rho=L+\rho I\) is the regularized Laplacian. This form is crucial: it unifies graph structure (coupling users via off-diagonal terms of \(L_\rho\)) and individual regularization (diagonal \(\rho\)) into a quadratic form, setting the stage for proving it is a norm.
2. Unified Multi-user Kernel: Tensor fusion of Laplacian and arm kernel (Core Contribution). Theorem 2.1 proves that the penalty above is exactly the squared RKHS norm of the "lifted function" \(f(x,u):=f_u(x)\) on the product space \(\mathcal H=\mathcal H_G\otimes\mathcal H_x\), where the kernel of \(\mathcal H_G\) is \(K_G(u,u')=[L_\rho^{-1}]_{u,u'}\). This explicitly derives the multi-user reproducing kernel: $\(K((x,u),(x',u'))=[L_\rho^{-1}]_{u,u'}\,K_x(x,x').\)$ Intuitively, \(K_x\) measures similarity between arms, while \(L_\rho^{-1}\) (the graph Green’s function) measures similarity between users—\([L_\rho^{-1}]_{u,u'}\) accumulates connection strength through all paths on the graph. The associated feature map is \(\phi(x,u)=L_\rho^{-1/2}e_u\otimes\varphi(x)\). This step signifies that learning "\(n\) correlated functions" can be rewritten as learning a "single function in a well-defined kernel space," removing heuristic guesswork and enabling the direct use of mature GP/Kernel bandit tools.
3. GP Exploration on Unified Kernel: UCB and Thompson Sampling. A zero-mean GP prior \(f\sim\mathrm{GP}(0,K)\) is placed on the lifted function. Given history, the posterior for any \((x,u)\) is \(\mathcal N(\mu_{u,t},\sigma_{u,t}^2)\), with mean and variance: $\(\mu_{u,t}(x)=k_t(x,u)^\top(K_t+\lambda I)^{-1}y_t,\quad \sigma_{u,t}^2(x)=K((x,u),(x,u))-k_t^\top(K_t+\lambda I)^{-1}k_t.\)$ The posterior mean matches the solution of offline "Kernel Laplacian Regularized Regression (KLRR)," aligning online exploration with batch learning objectives. Two decision rules follow naturally: LK-GP-UCB selects \(x_t=\arg\max_x\mu_{u_t,t-1}(x)+\beta_t\sigma_{u_t,t-1}(x)\) (optimism under uncertainty), and LK-GP-TS selects \(x_t=\arg\max_x\mu+\nu_t z_t(x)\sigma\) (probability matching, \(z_t\sim\mathcal N(0,1)\)). To address the \(O(t^3)\) cost of naive posterior updates, the authors use recurrence formulas to reduce mean/variance updates to \(O(t^2)\), employing a hybrid strategy of exact inversion for small \(t\) and recurrence for large \(t\) to balance stability and efficiency.
4. Effective Dimension Regret Bounds: Regret independent of user count, dependent on spectral structure. The confidence bound (Theorem 4.1) maintains the precise \(\log\det(I_t+\lambda^{-1}K_t)\) term (avoiding loosening via information-theoretic quantities) and does not require \(\lambda\ge1\). The effective dimension is defined as \(\tilde d=\frac{\log\det(I_T+K_T/\lambda)}{\log(1+TK_{\max}/\lambda)}\), a robust measure of the "dimensionality" explored by the algorithm. Both algorithms satisfy \(R_T=\tilde O(\tilde d\sqrt T)\), where regret depends on \(\tilde d\) rather than the user count \(n\) or feature dimension. Spectral analysis reveals that since \(K=K_G\otimes K_x\), its spectrum is the pairwise product of eigenvalues; under strong homophily in a complete graph, \(L_\rho^{-1}\) is approximately rank-1. "Head+Tail" decomposition (Eq. 14) shows that for \(T\le Cn\), \(\gamma_T^{\text{clique}}=O(1)\), meaning \(\tilde d=O(1/\log T)\) and regret is independent of \(n\)—a phenomenon absent in single-user settings. For \(k\) clusters, \(\tilde d=O(k/\log T)\), effectively "soft-counting" the number of user clusters.
Key Experimental Results¶
Evaluated in synthetic environments with fixed noise \(\sigma=0.1\) and \(n=20\) users, across three difficulty levels (Easy: 10 arms/50% visibility/\(T{=}1000\); Medium: 20 arms/25%/\(T{=}3000\); Hard: 50 arms/10%/\(T{=}5000\)). User graphs were generated via Erdős–Rényi or RBF random graphs.
Main Results¶
| Dimension | Setting |
|---|---|
| Ours | LK-GP-UCB, LK-GP-TS |
| Baselines | GraphUCB, GoB.Lin, COOP-KernelUCB, GP-UCB, Pooled LinUCB, Per-User LinUCB |
| Reward Regime | Linear-GOB (linear, \(\Theta=(I+\eta L)^{-1}\Theta_0\)); Laplacian-Kernel (nonlinear, generated via unified kernel GP draw / representer draw) |
| Evaluation Metrics | Cumulative regret across 9 data environments |
Key Findings¶
- Full Dominance in Nonlinear Regimes: In both Laplacian-Kernel settings (GP draw / representer draw), LK-GP-UCB and LK-GP-TS remain the optimal choices, consistently ranking first in GP draw settings.
- Only Sublinear Performance under Representer Draw: In this setting, both algorithms from this work achieve sublinear regret, while most baselines fail to reach sublinearity, leading to a widening gap over time.
- Competitive in Linear Regimes: Even in Linear-GOB—the home turf of linear bandits—the proposed methods outperform most baselines by a significant margin, confirming that they do not sacrifice performance when rewards are actually linear.
- Regret Independent of \(n\) under Strong Homophily: Figure 1 demonstrates that information gain \(\gamma_T\) barely grows (or even slightly decreases) with \(n\) under a complete graph, and \(\tilde d\) grows much slower than \((\log T)^d\), corresponding to the "rank collapse" phenomenon.
- Similarity Kernel Validation: Five similarity kernels were tested for COOP-KernelUCB; experiments confirmed \(L_\rho^{-1}\) as the optimal choice (empirical MMD closest to optimal). Even when provided with its best kernel, LK-GP-UCB remains consistently superior.
Highlights & Insights¶
- "Heuristic Kernel → First-Principles Kernel" Paradigm Shift: Instead of arbitrarily choosing a "user kernel × arm kernel" product, the authors prove it is the unique reproducing kernel \(L_\rho^{-1}K_x\) corresponding to the joint penalty, aligning the modeling objective perfectly with the algorithm.
- Leveraging a Unified Perspective: By compressing a multi-user problem into "single function + single kernel," the entire GP/Kernel bandit toolbox (posteriors, confidence bounds, SupKernelUCB improvements) becomes plug-and-play, reducing both theoretical and engineering overhead.
- Insights into Spectral Collapse: When strong homophily exists (complete graphs or few clusters), the kernel is approximately low-rank. The effective dimension \(\tilde d\) degrades to "number of user clusters / \(\log T\)," directly explaining why sharing graph information allows regret to escape dependence on \(n\).
Limitations & Future Work¶
- Synthetic Data Limitations: Validation is limited to ER/RBF random graphs and synthetic rewards, lacking verification on real-world recommendation/medical datasets or actual user graphs.
- Static and Known Graphs: The method assumes the user graph \(G\) is fully known and does not handle unknown, noisy, or time-evolving graphs.
- Loose Regret Bound for Infinite Arms: In the worst case (independent users), the \(\tilde d\sqrt T\) bound is looser than the independent learning bound \(\sqrt{nT}\) by a factor of \(\sqrt n\). The authors suggest that SupKernelUCB-style techniques could recover \(\sqrt{\tilde dT}\) in consistently finite arm spaces, but this was not implemented in the main algorithm.
- Computation and Hyperparameters: While updates are reduced to \(O(t^2)\), they remain heavy for large-scale/long-term sequences. \(\lambda\) uses spectral ratio scheduling, and \(\beta, \nu\) require tuning, making robustness dependent on parameter selection.
Related Work & Insights¶
- Gang of Bandits Lineage: GoB.Lin (Cesa-Bianchi 2013) and GraphUCB (Yang 2020) model user rewards as smooth linear signals on graphs; this work is a natural nonlinear and kernelized extension.
- Kernelized/GP Bandits: GP-UCB (Srinivas 2009), IGP-UCB (Chowdhury & Gopalan 2017), and SupKernelUCB (Valko 2013) provide the single-agent framework for confidence bounds and regret analysis, which this work elevates to multi-user product kernels.
- Laplacian Regularization and Green's Function Kernels: Smola & Kondor (2003) noted that kernels induced by Laplacian regularization in scalar cases are (regularized) graph Green's functions. This work proves the principle holds in contextual settings with arm features.
- Insight: Reducing "multi-task/multi-user + structural prior" problems to "single-function learning on product kernels" is a reusable reduction paradigm applicable to federated bandits, collaborative filtering, or Bayesian optimization on graphs.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Transforming "joint penalty = unified RKHS norm + explicit fusion kernel" elevates heuristic approaches to first principles. Spectral collapse analysis is insightful, though the blocks (Laplacian reg, GP-UCB, tensor kernels) are established.
- Experimental Thoroughness: ⭐⭐⭐ — 9 synthetic environments with baseline/kernel ablations are systematic, but the lack of real-world datasets and graphs limits the overall persuasiveness.
- Writing Quality: ⭐⭐⭐⭐ — The logical flow from penalty to kernel to algorithm to regret bound is smooth; theorems are clear, and the "Head+Tail" spectral analysis is intuitive.
- Value: ⭐⭐⭐⭐ — Provides a clean, scalable, and theoretically guaranteed unified framework for nonlinear multi-user bandits. This methodological contribution is solid for future research in structured sequential decision-making.