Skip to content

Variance-Dependent Regret Lower Bounds for Contextual Bandits

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=zM23kMEt5q
Code: None (Pure theory)
Area: Learning Theory / Online Learning
Keywords: Contextual bandits, Variance-dependent regret, Lower bounds, Peeling technique, High-probability lower bounds

TL;DR

This paper establishes the first lower bound for "variance-dependent regret" in linear contextual bandits that matches the upper bound (up to logarithmic factors) at \(\Omega\!\big(d\sqrt{\sum_k \sigma_k^2}\big)\). This result covers any pre-determined variance sequence as well as adaptive weak adversary sequences. Furthermore, it demonstrates through a counter-example that such lower bounds cannot hold once an adversary can select variances after observing the decision sets.

Background & Motivation

Background: In linear contextual bandits, the classic minimax regret is \(\tilde O(d\sqrt K)\) (where \(d\) is the feature dimension and \(K\) is the number of rounds). Recent research has shifted toward "heteroscedastic" settings—where the noise variance \(\sigma_k^2\) varies per round—improving regret to \(\tilde O\!\big(d\sqrt{\sum_{k=1}^K \sigma_k^2}\big)\). Specifically, the SAVE algorithm by Zhao et al. (2023) achieves this near-optimal upper bound (Theorem 1.1) without prior knowledge of the variances, representing the current state-of-the-art upper bound.

Limitations of Prior Work: Research on variance-dependent regret has been almost exclusively focused on "upper bounds," with a significant lack of "lower bounds." The only existing lower bound comes from Jia et al. (2024), who proved \(\Omega(\sqrt{d_{\mathrm{elu}}\Lambda})\) for general function classes with finite eluder dimension. However, this result has two major weaknesses: first, when specialized to the linear case (\(d\ge\sqrt A\)), it yields only \(\sqrt{d\Lambda}\), which is a full \(\sqrt d\) factor smaller than the upper bound, leaving it unclear if the \(d\) in SAVE is necessary. Second, it only holds for a "fixed total variance budget \(\sum_k\sigma_k^2\le\Lambda\)," constructed by using constant variance in the early rounds and zero thereafter, failing to characterize arbitrary variance sequences \(\{\sigma_1^2,\dots,\sigma_K^2\}\).

Key Challenge: To prove a lower bound, the constructed hard instance must distribute variance "sequentially" rather than as a single budget and must fully saturate the dependence on dimension \(d\). Jia's fixed-budget construction bypasses both requirements, leading to the missing \(\sqrt d\) and the inability to handle general sequences.

Goal: To prove a lower bound of \(\tilde\Omega\!\big(d\sqrt{\sum_k\sigma_k^2}\big)\) for "linear" contextual bandits under general variance sequences, thereby establishing the optimality of the SAVE upper bound, and to clarify the boundaries regarding whether variances are pre-determined or adaptively generated by different types of adversaries.

Key Insight: The authors observe a critical degree of freedom in contextual bandits compared to stochastic linear bandits: the decision set \(D_k\) can change in each round. This provides the space to construct hard instances: rounds can be grouped by variance magnitude, and mutually orthogonal decision sets can be assigned to different groups. This prevents information leakage between groups, allowing the total regret to be decomposed into the sum of regret from each group, each of which can be pushed to its specific lower bound.

Core Idea: Use "magnitude-based peeling + orthogonal sub-instances" to decompose a complex variance sequence into several simple sub-problems with fixed variances. Each sub-problem applies a base hard instance of \(d\sqrt{K\sigma^2}\), which are then summed to reconstruct \(d\sqrt{\sum_k\sigma_k^2}\).

Method

Overall Architecture

The paper does not propose an algorithm but rather a set of techniques for "hard instance construction + lower bound proof." The proof progresses through three layers. The first layer is a base hardness lemma under a fixed variance threshold \(\sigma\) (Lemma 4.3): using a hypercube action set and scaled Bernoulli rewards to create a family of instances where any algorithm incurs an expected regret of at least \(d\sqrt{K\sigma^2}/(8\sqrt6)\) over \(K\) rounds. The second layer targets "pre-determined" general variance sequences: dividing \(K\) rounds into \(L=\lceil\log_2 K\rceil+1\) groups based on variance magnitude via peeling. Each group has approximately equal variances and is assigned a decision set in an orthogonal subspace. Summing the individual lower bounds yields Theorem 4.1. The third layer addresses "adaptive" sequences: when an adversary generates variances round-by-round and group sizes are unknown, the authors use "multi-instances + cyclic allocation" to fill the gap and elevate the expected lower bound to a high-probability lower bound (Theorem 5.2). Finally, a counter-example (Theorem 5.4) shows that if an adversary can determine variances after seeing the decision set, the construction fails, and no such lower bound exists—precisely characterizing the boundary conditions for the lower bound.

Key Designs

1. Peeling by Variance Magnitude + Orthogonal Sub-instances: Decomposing Sequences into Summable Sub-problems

This is the foundation of the proof, addressing the issue that varying variances in a general sequence \(\{\sigma_k^2\}\) prevent the direct application of a uniform hard instance. The authors first fix a variance threshold \(\sigma\) and set the action set to a hypercube \(\mathcal A=\{-1,1\}^d\). Rewards follow a scaled Bernoulli \(\sigma\cdot B\!\big(1/3+\langle\mu,a\rangle\big)\) with unknown weights \(\mu\in\{-\Delta,\Delta\}^d\) and \(\Delta=1/\sqrt{96K}\). The variance of each action is bounded by \(\sigma^2\). This yields the base lower bound (Lemma 4.3):

\[\mathbb E_\mu[\mathrm{Regret}(K)] \ge \frac{d\sqrt{K\sigma^2}}{8\sqrt6},\qquad K\ge 1.5\,d^2.\]

For general sequences, rounds are partitioned into geometrically increasing groups: \(\mathcal K_0=\{k:\sigma_k\le 1/K\}\) and \(\mathcal K_i=\{k: 2^{i-1}/K<\sigma_k\le 2^i/K\}\). Each group \(\mathcal K_i\) is assigned a sub-instance \(M_i\) with dimension \(d_i=d/L\), variance threshold \(\sigma(i)=2^{i-1}/K\), and \(|\mathcal K_i|\) rounds. By embedding these into mutually orthogonal subspaces—where the decision set for a round in \(\mathcal K_i\) is non-zero only in the \(i\)-th block—orthogonality ensures that actions in one group provide no information about the weights \(\mu_j\) of other groups. Thus, total regret decomposes into the sum of individual group regrets, leading to Theorem 4.1:

\[\mathbb E[\mathrm{Regret}(K)] \ge \Omega\!\Big(d\sqrt{\textstyle\sum_{k=1}^K \sigma_k^2 \big/ \log K}\Big).\]

This matches the SAVE upper bound up to \(\sqrt{\log K}\) and adds the missing \(\sqrt d\) factor compared to Jia et al. Notably, this utilizes the varying decision sets of contextual bandits; it fails in stochastic linear bandits where the decision set is fixed, as a learner could learn perfectly with zero regret by selecting standard basis vectors (Remark 4.5).

2. Cyclic Multi-instance Allocation: Bridging Unknown Group Sizes in Adaptive Sequences

Peeling assumes that the number of rounds per group \(|\mathcal K_i|\) is known. In adaptive settings where an adversary generates variances sequentially, \(|\mathcal K_i|\) is unknown and potentially random. The solution is to create \(L=\lceil\log_2 K\rceil+1\) instances \(M_{i,j}\) for each group \(i\), where the \(j\)-th instance is tailored for the case where the round count falls in \(2^{j-1}\le|\mathcal K_i|<2^j\).

Whenever a round falls into group \(\mathcal K_i\), it is cyclically assigned to one of \(\{M_{i,1},\dots,M_{i,L}\}\). This ensures each \(M_{i,j}\) is visited roughly \(|\mathcal K_i|/L\) times. The instance \(j\) matching the true interval will then accumulate a regret of \(\tilde\Omega\!\big(d_i\sqrt{\sigma^2(i)\,|\mathcal K_i|}\ bomb\big)\). The lower bound thus holds regardless of whether group sizes are known.

3. From Expectation to High Probability: Constant Probability Compression and Independent Instance Amplification

In adaptive sequences, the cumulative variance \(\sum_k\sigma_k^2\) itself depends on the stochastic process, making expected lower bounds difficult to state. The authors upgrade the expected bound to a high-probability bound in two steps: first, converting the expected bound to one that holds with "constant probability" through precise variance control; second, creating multiple independent instances and using a union bound to amplify the constant probability to \(1-1/K\), resulting in Theorem 5.2:

\[\Pr\!\Big[\mathrm{Regret}(K)\ge\Omega\big(d\sqrt{\textstyle\sum_k\sigma_k^2}\big/\log^6(dK)\big)\Big]\ge 1-\tfrac1K,\quad \text{when }\textstyle\sum_k\sigma_k^2\ge\Omega(d^2).\]

This is the first high-probability lower bound for linear contextual bandits. The cost is a relaxation of the logarithmic factor from \(\log K\) to \(\log^6(dK)\) due to the additional randomness introduced by adaptivity.

4. Strong Adversary Counter-example: Defining Exact Boundaries via an \(O(d)\) Regret Algorithm

The previous designs assume a "weak adversary" who must determine \(\sigma_k\) before seeing the decision set \(D_k\). This design proves that this timing restriction is essential. The authors construct a "strong adversary" (who selects \(\sigma_k\) after seeing \(D_k\)) and a specific learning algorithm to prove Theorem 5.4: there exists a pair such that even if \(\sum_k\sigma_k^2\ge K/2\), the regret is bounded by \(O(d)\).

The intuition is that a strong adversary can "collude" with the learner by placing high variance only in directions that have already been explored. This breaks the link between "high variance" and "high regret." This result generalizes to stochastic linear bandits (Corollary 5.7), implying that variance-dependent lower bounds only hold under the weak adversary / contextual (varying decision set) setting.

Loss & Training

Not applicable (Theoretical proof).

Key Experimental Results

This is a pure theory paper with no numerical experiments. The following tables summarize the core theoretical conclusions.

Main Results: Lower Bound vs. Existing Bounds

Setting Ours (Lower Bound) Prev. SOTA Upper Bound (SAVE) Comparison with Jia et al. (2024)
Pre-given Sequence \(\Omega\!\big(d\sqrt{\sum_k\sigma_k^2/\log K}\big)\) (Thm 4.1) \(\tilde O\!\big(d\sqrt{\sum_k\sigma_k^2}+d\big)\) Gains \(\sqrt d\); applies to any sequence
Adaptive (Weak) \(\Omega\!\big(d\sqrt{\sum_k\sigma_k^2}/\log^6(dK)\big)\), High-Prob (Thm 5.2) As above First high-probability lower bound
Adaptive (Strong) Does not exist (Regret \(\le d\) while \(\sum\sigma_k^2\ge K/2\)) Defines boundaries of lower bounds

Conclusion: In the weak adversary/contextual setting, the lower bound matches the upper bound up to logarithmic factors, proving the optimality of the \(d\sqrt{\sum_k\sigma_k^2}\) term in the SAVE upper bound.

Key Designs and Scope

Technique Problem Solved Scope / Limitation
Peeling + Orthogonal Sub-instances Decomposes general sequences Requires varying decision sets
Cyclic Multi-instance Allocation Unknown group sizes under adaptivity Weak adversary
Expectation to High-probability Random \(\sum_k\sigma_k^2\) Transferable to other problems
Strong Adversary Counter-example Defining boundary of the lower bound Strong adversary or fixed decision sets

Key Findings

  • The \(\sqrt d\) Gap is Closed: Unlike Jia et al., this work pushes the dimension dependence to \(d\) using orthogonal partitioning, establishing the optimality of SAVE.
  • Requirement of \(\sum_k\sigma_k^2\ge\Omega(d^2)\): Theorem 4.1 requires the total variance to be sufficiently large, analogous to the requirement of \(K\ge\Omega(d^2)\) for standard linear bandit lower bounds.
  • Sensitivity to Timing: Whether the variance is generated "before" or "after" seeing the decision set determines whether the lower bound exists or vanishes.

Highlights & Insights

  • Orthogonal Partitioning enables Summation: Magnitude-based peeling combined with orthogonal decision sets prevents information leakage, cleanly decomposing total regret into the sum of sub-problem regrets.
  • Handling Adaptivity via Cyclic Instances: Using \(\log K\) instances for different round-count intervals bypasses the obstacle of unknown group sizes in adaptive sequences.
  • First High-probability Lower Bound: The conversion framework from expectation to high-probability fills a long-standing gap in linear contextual bandit theory.
  • Precise Boundary via Counter-examples: The \(O(d)\) regret counter-example elevates the "weak adversary" assumption from a technical convenience to a fundamental necessity.

Limitations & Future Work

  • Linear Only: Results are tied to linear rewards; extending the "general variance sequence" analysis to general function approximation (e.g., eluder dimension) remains open.
  • Loose Logarithmic Factors: The \(\log^6(dK)\) factor in the adaptive setting is significantly larger than the \(\log K\) in the pre-given setting.
  • Homogeneous Variance Restriction: The weak adversary lower bound assumes all actions in \(D_k\) share the same \(\sigma_k\), which is more restrictive than SAVE's action-dependent variances.
  • Stochastic Linear Bandits: The lack of a variance-dependent lower bound when decision sets are fixed suggests that the "learning performance gain" from variance is unique to the contextual setting.
  • vs. Jia et al. (2024): They proved lower bounds for fixed budgets; this work handles arbitrary sequences and adds the \(\sqrt d\) factor.
  • vs. Zhao et al. (2023) SAVE: This work provides the matching lower bound to SAVE's upper bound, establishing its optimality.
  • vs. Classic Worst-case Lower Bounds: Unlike Dani (2008) or Li (2019) which give \(\Omega(d\sqrt K)\), this work incorporates heteroscedastic noise structures into the characterization.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Theorems cover all critical adversary settings; logic is self-contained)
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐