Bandit and Delayed Feedback in Online Structured Prediction¶
Conference: NeurIPS 2025 arXiv: 2502.18709 Code: None Area: Online Learning / Structured Prediction Keywords: Online Structured Prediction, Bandit Feedback, Delayed Feedback, Surrogate Regret, Fenchel-Young Loss
TL;DR¶
This paper is the first to study bandit and delayed feedback settings in online structured prediction. By designing a novel pseudo-inverse matrix gradient estimator, it achieves an \(O(T^{2/3})\) surrogate regret bound that does not explicitly depend on the output set size \(K\).
Background & Motivation¶
Online structured prediction is the task of sequentially predicting outputs with complex structure, encompassing problems such as online classification, multi-label classification, and ranking. Given an input space \(\mathcal{X}\) and a finite output space \(\mathcal{Y}\) (with \(K = |\mathcal{Y}|\)), the learner must predict an output based on the input at each round and incur a target loss \(L(\hat{y}_t; y_t)\).
Surrogate Loss Framework: Since directly optimizing the target loss over a structured output space is generally intractable, surrogate losses such as the Fenchel-Young loss are employed in practice. Surrogate regret measures the cumulative excess of the learner's target loss over the optimal surrogate loss.
Limitations of Prior Work: - Prior work (Sakaue et al., 2024) achieves bounded surrogate regret only under full-information feedback. - Full-information feedback requires immediate access to the complete structured output, which is often impractical. - For example, in online advertising ranking, only click-through rates can be observed (bandit feedback), not the identities of which ads were clicked.
Core Challenges: 1. Under bandit feedback, the true output \(y_t\) is unobserved, making it impossible to compute the true gradient of the surrogate loss. 2. When the output set is extremely large (e.g., \(K = \binom{d}{m}\) in multi-label classification or \(K = m!\) in ranking), bounds depending on \(K\) are highly unfavorable. 3. Delayed feedback introduces additional complexity into algorithm design.
Method¶
Overall Architecture¶
This paper builds upon the online structured prediction framework with Fenchel-Young losses and develops four classes of algorithms to handle different combinations of feedback and delay settings. The core pipeline is: receive input → compute score vector via linear estimator → generate prediction via randomized decoding → receive feedback → estimate gradient and update parameters.
Key Designs¶
1. Randomized Decoding with Uniform Exploration (RDUE): - With probability \(q\), select an output uniformly at random from \(\mathcal{Y}\). - With probability \(1-q\), use standard randomized decoding. - This guarantees that each output is selected with probability at least \(q/K\), providing the basis for inverse-weighting estimation. - The loss satisfies: \(\mathbb{E}[L(\psi_\Omega(\theta); y)] \leq \frac{4\gamma}{\lambda\nu}(1-q)S_\Omega(\theta; y) + q\frac{K-1}{K}\)
2. Inverse-Weighting Gradient Estimator (\(O(\sqrt{KT})\) algorithm): $\(\tilde{G}_t = \frac{\mathbb{1}[\hat{y}_t = y_t]}{p_t(y_t)} G_t\)$ This estimator is unbiased, but its variance is proportional to \(K/q\), causing the regret bound to depend on \(K\).
3. Pseudo-Inverse Matrix Gradient Estimator (\(O(T^{2/3})\) algorithm, core contribution):
For the special subclass of Structured Encoding Loss Functions (SELF), the following surrogate label is constructed: $\(\tilde{y}_t = V^{-1} P_t^+ \hat{y}_t \langle \hat{y}_t, Vy_t \rangle\)$ where \(P_t = \mathbb{E}[\hat{y}_t \hat{y}_t^\top]\) is the second-moment matrix. The gradient estimator is then defined as: $\(\tilde{G}_t = (\hat{y}_\Omega(\theta_t) - \tilde{y}_t) x_t^\top\)$
Key property: \(\mathbb{E}_t[\|\tilde{G}_t\|_F^2] \leq 2bS_t(W_t) + 2C_x^2 \omega / q\), where \(\omega = \text{tr}(V^{-1}Q^+(V^{-1})^\top)\) does not depend on \(K\).
4. Handling Delayed Feedback: - The ODAFTRL (Optimistic Delayed Adaptive FTRL) algorithm is employed to handle fixed \(D\)-step delays. - This yields an AdaGrad-type regret bound that fully exploits the adaptive structure of the gradients.
Loss & Training¶
- Surrogate Loss: Fenchel-Young loss \(S_\Omega(\theta; y) = \Omega^*(\theta) + \Omega(y) - \langle\theta, y\rangle\)
- Target Loss: Structured Encoding Loss Function (SELF), \(L(y_t; \hat{y}_t) = \langle\hat{y}_t, Vy_t + b\rangle + c(y_t)\)
- Update Strategy: Adaptive online gradient descent (bandit setting) or ODAFTRL (delayed setting), with learning rate \(\eta_t = B / \sqrt{2\sum_{i=1}^t \|\tilde{G}_i\|_F^2}\)
Key Experimental Results¶
Main Results: Summary of Surrogate Regret Bounds¶
| Problem Setting | Feedback Type | Target Loss | Surrogate Regret Bound | Source |
|---|---|---|---|---|
| Multi-class classification | Bandit | 0-1 loss | \(O(\sqrt{KT})\) | Van der Hoeven et al. (2021) |
| Structured prediction | Full-information | SELF | Finite bound | Sakaue et al. (2024) |
| Structured prediction | Bandit | SELF | \(O(\sqrt{KT})\) | Ours (Thm 3.5, 3.6) |
| Structured prediction | Bandit | SELF* | \(O(T^{2/3})\) | Ours (Thm 3.8) |
| Structured prediction | Full-info + Delay | SELF | \(O(D^{2/3}T^{1/3})\) | Ours (Thm 4.3, 4.4) |
| Structured prediction | Bandit + Delay | SELF | \(O(\sqrt{DKT})\) | Ours (Thm 5.1) |
| Structured prediction | Bandit + Delay | SELF* | \(O(D^{1/3}T^{2/3})\) | Ours (Thm 5.2) |
Ablation Study: Numerical Comparison under Bandit Feedback (MNIST)¶
| Algorithm | Surrogate Loss | Misclassification Rate (Median) | Designed for Multi-class Classification |
|---|---|---|---|
| Gaptron | Hinge loss | Higher | Yes |
| Gappletron | Logistic loss | Moderate | Yes |
| Ours | FY loss | Lowest | No (general structured prediction) |
On the MNIST dataset (\(K=10\)), the proposed algorithm outperforms methods specifically designed for multi-class classification, despite being designed for general structured prediction.
Key Findings¶
- \(K\)-independent bound: The pseudo-inverse matrix estimator eliminates explicit dependence on \(K\) in the regret bound. This improvement is particularly significant in multi-label classification (\(K = \binom{d}{m}\)) and ranking (\(K = m!\)).
- Cost of \(O(T^{2/3})\): Eliminating the \(K\) dependence comes at the cost of increasing the exponent of \(T\) from \(1/2\) to \(2/3\). This trade-off is favorable when \(K\) is extremely large.
- Effect of delay: A fixed delay of \(D\) introduces a multiplicative factor of \(D^{1/3}\) (pseudo-inverse estimator) or \(\sqrt{D}\) (inverse-weighting estimator).
- Synthetic data experiments: The proposed method is competitive when \(K \leq 24\); for \(K = 48, 96\), specialized algorithms hold an advantage due to the use of different surrogate losses.
- High-probability bounds: Theorems 3.6 and 4.4 establish high-probability regret bounds via Bernstein's inequality.
Highlights & Insights¶
- Elegant design of the pseudo-inverse estimator: Unlike the standard inverse-weighting estimator, the pseudo-inverse estimator cleverly exploits the structure of SELF, ensuring that the second moment of the gradient estimator does not depend on \(K\).
- Completeness of the theoretical framework: The paper systematically covers three feedback scenarios—bandit, delayed, and bandit with delay—as well as two classes of loss functions, SELF and SELF*.
- Practical relevance: Settings such as online advertising ranking and recommendation systems typically involve limited (bandit) or delayed feedback, making the proposed methods directly applicable.
Limitations & Future Work¶
- Absence of lower bounds: Lower bounds on surrogate regret under bandit and delayed feedback have not been established, and the optimal dependence on \(d\) remains unknown.
- Fixed delay assumption: Only a fixed delay \(D\) is considered, whereas delays in practice are typically variable.
- Optimality of \(O(T^{2/3})\): Whether \(T^{2/3}\) can be improved to \(\sqrt{T}\) (at the cost of some polynomial dependence on \(K\)) remains an open question.
- Linear estimator restriction: The model is assumed to be a linear estimator; extensions to deep learning settings warrant further exploration.
- Restrictiveness of SELF*: The \(O(T^{2/3})\) bound holds only for a subclass of SELF; general SELF still requires dependence on \(K\).
Related Work & Insights¶
- Online multi-class classification: Van der Hoeven (2020) introduced the concept of surrogate regret; Gaptron and Gappletron are representative algorithms.
- Combinatorial bandits: The pseudo-inverse estimator of Cesa-Bianchi & Lugosi (2012) inspired the design in this paper.
- Delayed online learning: Flaspohler et al. (2021)'s ODAFTRL provides the core algorithmic foundation for handling delayed feedback.
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Theoretical Depth | 5 |
| Experimental Thoroughness | 3 |
| Writing Quality | 4 |
| Overall | 4 |