Bandit and Delayed Feedback in Online Structured Prediction¶

Conference: NeurIPS 2025 arXiv: 2502.18709 Code: None Area: Online Learning / Structured Prediction Keywords: Online Structured Prediction, Bandit Feedback, Delayed Feedback, Surrogate Regret, Fenchel-Young Loss

TL;DR¶

This paper is the first to study bandit and delayed feedback settings in online structured prediction. By designing a novel pseudo-inverse matrix gradient estimator, it achieves an $O(T^{2/3})$ surrogate regret bound that does not explicitly depend on the output set size $K$.

Background & Motivation¶

Online structured prediction is the task of sequentially predicting outputs with complex structure, encompassing problems such as online classification, multi-label classification, and ranking. Given an input space $\mathcal{X}$ and a finite output space $\mathcal{Y}$ (with $K = |\mathcal{Y}|$), the learner must predict an output based on the input at each round and incur a target loss $L(\hat{y}_t; y_t)$.

Surrogate Loss Framework: Since directly optimizing the target loss over a structured output space is generally intractable, surrogate losses such as the Fenchel-Young loss are employed in practice. Surrogate regret measures the cumulative excess of the learner's target loss over the optimal surrogate loss.

Limitations of Prior Work: - Prior work (Sakaue et al., 2024) achieves bounded surrogate regret only under full-information feedback. - Full-information feedback requires immediate access to the complete structured output, which is often impractical. - For example, in online advertising ranking, only click-through rates can be observed (bandit feedback), not the identities of which ads were clicked.

Core Challenges: 1. Under bandit feedback, the true output $y_t$ is unobserved, making it impossible to compute the true gradient of the surrogate loss. 2. When the output set is extremely large (e.g., $K = \binom{d}{m}$ in multi-label classification or $K = m!$ in ranking), bounds depending on $K$ are highly unfavorable. 3. Delayed feedback introduces additional complexity into algorithm design.

Method¶

Overall Architecture¶

This paper builds upon the online structured prediction framework with Fenchel-Young losses and develops four classes of algorithms to handle different combinations of feedback and delay settings. The core pipeline is: receive input → compute score vector via linear estimator → generate prediction via randomized decoding → receive feedback → estimate gradient and update parameters.

Key Designs¶

1. Randomized Decoding with Uniform Exploration (RDUE): - With probability $q$, select an output uniformly at random from $\mathcal{Y}$. - With probability $1-q$, use standard randomized decoding. - This guarantees that each output is selected with probability at least $q/K$, providing the basis for inverse-weighting estimation. - The loss satisfies: $\mathbb{E}[L(\psi_\Omega(\theta); y)] \leq \frac{4\gamma}{\lambda\nu}(1-q)S_\Omega(\theta; y) + q\frac{K-1}{K}$

2. Inverse-Weighting Gradient Estimator ($O(\sqrt{KT})$ algorithm): $$\tilde{G}_t = \frac{\mathbb{1}[\hat{y}_t = y_t]}{p_t(y_t)} G_t$$ This estimator is unbiased, but its variance is proportional to $K/q$, causing the regret bound to depend on $K$.

3. Pseudo-Inverse Matrix Gradient Estimator ($O(T^{2/3})$ algorithm, core contribution):

For the special subclass of Structured Encoding Loss Functions (SELF), the following surrogate label is constructed: $$\tilde{y}_t = V^{-1} P_t^+ \hat{y}_t \langle \hat{y}_t, Vy_t \rangle$$ where $P_t = \mathbb{E}[\hat{y}_t \hat{y}_t^\top]$ is the second-moment matrix. The gradient estimator is then defined as: $$\tilde{G}_t = (\hat{y}_\Omega(\theta_t) - \tilde{y}_t) x_t^\top$$

Key property: $\mathbb{E}_t[\|\tilde{G}_t\|_F^2] \leq 2bS_t(W_t) + 2C_x^2 \omega / q$, where $\omega = \text{tr}(V^{-1}Q^+(V^{-1})^\top)$ does not depend on $K$.

4. Handling Delayed Feedback: - The ODAFTRL (Optimistic Delayed Adaptive FTRL) algorithm is employed to handle fixed $D$-step delays. - This yields an AdaGrad-type regret bound that fully exploits the adaptive structure of the gradients.

Loss & Training¶

Surrogate Loss: Fenchel-Young loss $S_\Omega(\theta; y) = \Omega^*(\theta) + \Omega(y) - \langle\theta, y\rangle$
Target Loss: Structured Encoding Loss Function (SELF), $L(y_t; \hat{y}_t) = \langle\hat{y}_t, Vy_t + b\rangle + c(y_t)$
Update Strategy: Adaptive online gradient descent (bandit setting) or ODAFTRL (delayed setting), with learning rate $\eta_t = B / \sqrt{2\sum_{i=1}^t \|\tilde{G}_i\|_F^2}$

Key Experimental Results¶

Main Results: Summary of Surrogate Regret Bounds¶

Problem Setting	Feedback Type	Target Loss	Surrogate Regret Bound	Source
Multi-class classification	Bandit	0-1 loss	$O(\sqrt{KT})$	Van der Hoeven et al. (2021)
Structured prediction	Full-information	SELF	Finite bound	Sakaue et al. (2024)
Structured prediction	Bandit	SELF	$O(\sqrt{KT})$	Ours (Thm 3.5, 3.6)
Structured prediction	Bandit	SELF*	$O(T^{2/3})$	Ours (Thm 3.8)
Structured prediction	Full-info + Delay	SELF	$O(D^{2/3}T^{1/3})$	Ours (Thm 4.3, 4.4)
Structured prediction	Bandit + Delay	SELF	$O(\sqrt{DKT})$	Ours (Thm 5.1)
Structured prediction	Bandit + Delay	SELF*	$O(D^{1/3}T^{2/3})$	Ours (Thm 5.2)

Ablation Study: Numerical Comparison under Bandit Feedback (MNIST)¶

Algorithm	Surrogate Loss	Misclassification Rate (Median)	Designed for Multi-class Classification
Gaptron	Hinge loss	Higher	Yes
Gappletron	Logistic loss	Moderate	Yes
Ours	FY loss	Lowest	No (general structured prediction)

On the MNIST dataset ($K=10$), the proposed algorithm outperforms methods specifically designed for multi-class classification, despite being designed for general structured prediction.

Key Findings¶

$K$-independent bound: The pseudo-inverse matrix estimator eliminates explicit dependence on $K$ in the regret bound. This improvement is particularly significant in multi-label classification ($K = \binom{d}{m}$) and ranking ($K = m!$).
Cost of $O(T^{2/3})$: Eliminating the $K$ dependence comes at the cost of increasing the exponent of $T$ from $1/2$ to $2/3$. This trade-off is favorable when $K$ is extremely large.
Effect of delay: A fixed delay of $D$ introduces a multiplicative factor of $D^{1/3}$ (pseudo-inverse estimator) or $\sqrt{D}$ (inverse-weighting estimator).
Synthetic data experiments: The proposed method is competitive when $K \leq 24$; for $K = 48, 96$, specialized algorithms hold an advantage due to the use of different surrogate losses.
High-probability bounds: Theorems 3.6 and 4.4 establish high-probability regret bounds via Bernstein's inequality.

Highlights & Insights¶

Elegant design of the pseudo-inverse estimator: Unlike the standard inverse-weighting estimator, the pseudo-inverse estimator cleverly exploits the structure of SELF, ensuring that the second moment of the gradient estimator does not depend on $K$.
Completeness of the theoretical framework: The paper systematically covers three feedback scenarios—bandit, delayed, and bandit with delay—as well as two classes of loss functions, SELF and SELF*.
Practical relevance: Settings such as online advertising ranking and recommendation systems typically involve limited (bandit) or delayed feedback, making the proposed methods directly applicable.

Limitations & Future Work¶

Absence of lower bounds: Lower bounds on surrogate regret under bandit and delayed feedback have not been established, and the optimal dependence on $d$ remains unknown.
Fixed delay assumption: Only a fixed delay $D$ is considered, whereas delays in practice are typically variable.
Optimality of $O(T^{2/3})$: Whether $T^{2/3}$ can be improved to $\sqrt{T}$ (at the cost of some polynomial dependence on $K$) remains an open question.
Linear estimator restriction: The model is assumed to be a linear estimator; extensions to deep learning settings warrant further exploration.
Restrictiveness of SELF*: The $O(T^{2/3})$ bound holds only for a subclass of SELF; general SELF still requires dependence on $K$.

Online multi-class classification: Van der Hoeven (2020) introduced the concept of surrogate regret; Gaptron and Gappletron are representative algorithms.
Combinatorial bandits: The pseudo-inverse estimator of Cesa-Bianchi & Lugosi (2012) inspired the design in this paper.
Delayed online learning: Flaspohler et al. (2021)'s ODAFTRL provides the core algorithmic foundation for handling delayed feedback.

Rating¶

Dimension	Score (1–5)
Novelty	4
Theoretical Depth	5
Experimental Thoroughness	3
Writing Quality	4
Overall	4