NeurIPS 2025 Reinforcement Learning contextual bandits multinomial logit non-linear utility functions combinatorial assortment optimization UCB neural networks

Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities¶

Conference: NeurIPS 2025 arXiv: 2601.06913 Authors: Taehyun Hwang, Dahngoon Kim, Min-hwan Oh (Seoul National University) Code: Not available Area: Reinforcement Learning Keywords: contextual bandits, multinomial logit, non-linear utility functions, combinatorial assortment optimization, UCB, neural networks

TL;DR¶

This work presents ONL-MNL, the first computationally tractable and statistically optimal algorithm for the MNL contextual bandit problem under non-linear utility functions (including neural networks), achieving \(\widetilde{\mathcal{O}}(\sqrt{T})\) regret without relying on NTK assumptions.

Background & Motivation¶

State of the Field¶

The multinomial logit (MNL) contextual bandit is a standard sequential assortment selection model widely used in recommendation systems, online retail, and personalized marketing. At each round, an agent observes item features and selects a subset (assortment) to present to a user, who then chooses an item according to the MNL choice model. The objective is to minimize cumulative regret relative to the optimal assortment policy.

Limitations of Prior Work¶

Nearly all existing MNL contextual bandit work assumes the utility function is linear in the features, which fails to capture complex non-linear user behavior.
Zhang & Luo (2025) were the first to consider general utility function classes, but they face a fundamental tension between computational tractability and statistical efficiency:
- The \(\varepsilon\)-greedy approach is computationally tractable but achieves only \(\widetilde{\mathcal{O}}(T^{2/3})\) regret (stochastic setting) or \(\widetilde{\mathcal{O}}(T^{5/6})\) (adversarial setting).
- Log-barrier and Feel-Good TS methods achieve \(\widetilde{\mathcal{O}}(\sqrt{T})\) regret but are computationally intractable (non-convex optimization).
Existing neural bandit methods rely on NTK approximations, requiring unrealistically large over-parameterization.

Root Cause¶

The core motivation is to design an MNL contextual bandit algorithm that simultaneously achieves computational tractability and \(\widetilde{\mathcal{O}}(\sqrt{T})\) optimal regret, supporting non-linear parameterized utility functions including neural networks.

Method¶

Problem Formulation¶

MNL choice model: The probability that a user selects item \(i\) from the displayed assortment \(S_t\) is \(p(i|X_t,S_t,\mathbf{w}^*) = \frac{\exp(f_{\mathbf{w}^*}(\mathbf{x}_{ti}))}{1+\sum_{j\in S_t}\exp(f_{\mathbf{w}^*}(\mathbf{x}_{tj}))}\)
Non-linear parameterization: The utility function \(f_\mathbf{w}(\cdot)\) belongs to a general parameterized function class \(\mathcal{F}\) (e.g., a neural network), differentiable with respect to parameter \(\mathbf{w}\) but not necessarily with respect to feature \(\mathbf{x}\).
Equivalence set: \(\mathcal{W}^* := \{\mathbf{w} \in \mathcal{W}: f_\mathbf{w}(\mathbf{x}) = f_{\mathbf{w}^*}(\mathbf{x})\ \forall \mathbf{x}\}\), accommodating multiple global optima arising from parameter symmetries in neural networks.
Objective: Minimize cumulative regret \(\mathrm{Regret}_T = \sum_{t=1}^T [R_t(S_t^*, \mathbf{w}^*) - R_t(S_t, \mathbf{w}^*)]\).

Core Technical Challenges¶

The negative log-likelihood of the non-linear MNL model lacks the self-concordance property available in the linear case.
Gradients depend on the unknown parameter, precluding direct application of concentration inequalities from linear models.
The non-convex optimization landscape admits multiple global optima (e.g., permutation symmetry of neurons in neural networks).

Algorithm: ONL-MNL (Two-Phase Design)¶

Phase I — Uniform Exploration (\(t_0\) rounds):

Randomly sample assortments uniformly and collect user choice data.
Minimize the negative log-likelihood \(\mathcal{L}(\mathbf{w})\) to obtain an initial estimate \(\hat{\mathbf{w}}_0\).
Establish a convergence rate via the generalized geometric condition (Assumption 4): \(\min_{\tilde{\mathbf{w}}\in\mathcal{W}^*}\|\hat{\mathbf{w}}_0 - \tilde{\mathbf{w}}\|_2^2 = \mathcal{O}(1/t_0)\).

Phase II — Optimistic Exploration (remaining \(T-t_0\) rounds):

Construct a linearized MNL model by performing a first-order Taylor expansion of \(f_\mathbf{w}\) around the current estimate \(\hat{\mathbf{w}}_t\).
Minimize the regularized negative log-likelihood using \(\hat{\mathbf{w}}_0\) as the regularization center rather than the origin.
Construct optimistic utility estimates:

\[z_{ti} = f_{\hat{\mathbf{w}}_t}(\mathbf{x}_{ti}) + \sqrt{\beta_t}\|\nabla f_{\hat{\mathbf{w}}_t}(\mathbf{x}_{ti})\|_{\mathbf{V}_t^{-1}} + \frac{\beta_t C_h}{\lambda}\]

Select the assortment maximizing the optimistic expected revenue: \(S_t = \arg\max_{S\in\mathcal{S}} \widetilde{R}_t(S)\).
Update the Gram matrix \(\mathbf{V}_{t+1} = \mathbf{V}_t + \sum_{i\in S_t}\nabla f_{\hat{\mathbf{w}}_t}(\mathbf{x}_{ti})\nabla f_{\hat{\mathbf{w}}_t}(\mathbf{x}_{ti})^\top\).

Key Technical Innovations¶

Generalized Geometric Condition (Assumption 4): The expected squared loss \(\ell_\mathrm{sq}(\mathbf{w})\) is required to satisfy a \((\tau,\gamma)\)-growth condition or \(\mu\)-local strong convexity relative to the equivalence set \(\mathcal{W}^*\). This is strictly weaker than the assumption of a globally unique optimum required by prior work. It naturally accommodates highly non-convex loss landscapes with arbitrarily many spurious local optima and multiple global optima, making it well-suited to neural networks.

\(\lambda\)-Independent Concentration Inequality: For the large regularization regime \(\lambda = \mathcal{O}(\sqrt{T})\), a novel concentration inequality is derived that is independent of \(\lambda\) (up to logarithmic factors), breaking the bottleneck of existing Bernstein-type bounds that require \(\lambda = \mathcal{O}(d\log T)\).

Inductive Proof of Confidence Sets: Mathematical induction is used to recursively establish that the estimation error \(\min_{\tilde{\mathbf{w}}\in\mathcal{W}^*}\|\hat{\mathbf{w}}_t-\tilde{\mathbf{w}}\|_2^2 = \widetilde{\mathcal{O}}(1/t_0)\) holds for all rounds in Phase II, guaranteeing \(\beta_t = \widetilde{\mathcal{O}}(\mu^{-2}\kappa^{-4}d_w)\).

Main Theorem¶

Let \(\kappa := \min_{\mathbf{w},X,S,i} p(0|X,S,\mathbf{w})\cdot p(i|X,S,\mathbf{w})\), and set \(t_0 = \lceil \kappa^{-3/2}d_w\sqrt{T}\rceil\). Then with high probability:

\[\mathrm{Regret}_T = \widetilde{\mathcal{O}}\left(\kappa^{-3/2}d_w\sqrt{T} + \kappa^{-2}\mu^{-1}d_w\sqrt{T}\right)\]

This achieves \(\widetilde{\mathcal{O}}(\sqrt{T})\) in \(T\) and is independent of the total number of items \(N\).

Key Experimental Results¶

Experiment 1: Cumulative Regret under Realizable and Misspecified Settings¶

Setup: \(N=100\) items, \(K=5\) maximum assortment size, \(d=3\) feature dimension, 30 random seeds, \(T=1000\) rounds.

Algorithm	Utility Type	Realizability	Performance (Gaussian context)	Key Property
ONL-MNL (Ours)	Non-linear	Realizable	Lowest regret, fast convergence	Tractable + statistically optimal
ONL-MNL (Ours)	Non-linear	Misspecified	Still achieves low regret	Strong robustness
\(\varepsilon\)-greedy-MNL	General	Realizable	Regret significantly higher than ONL-MNL	Tractable but suboptimal
UCB-MNL	Linear	Misspecified	Regret grows linearly	Fails to capture non-linearity
TS-MNL	Linear	Misspecified	Regret grows linearly	Linear assumption breaks down
OFU-MNL+	Linear	Misspecified	Regret grows linearly	Near-optimal but limited to linear

Realizable setting: True utility is a 2-layer sigmoid neural network (3 hidden neurons), \(t_0=50\).
Misspecified setting: True utility is \(\cos(2\pi(\mathbf{x}^\top\mathbf{w}^*)) - \frac{1}{2}(\mathbf{x}^\top\mathbf{w}^*)\); estimator uses 15 hidden neurons, \(t_0=100\).

Experiment 2: Effect of Number of Items \(N\) on Regret¶

Setup: Realizable environment (10 hidden neurons), \(T=500\), comparing \(N=100\) and \(N=800\).

Algorithm	\(N=100\) (Gaussian)	\(N=800\) (Gaussian)	Effect of Increasing \(N\)
ONL-MNL (Ours)	Low regret	Slight increase (Phase I only)	Phase II slope unchanged; learning efficiency unaffected by \(N\)
\(\varepsilon\)-greedy-MNL	Moderate regret	Regret increases significantly	Theoretical regret \(\propto N^{1/3}\), consistent with theory

Consistent conclusions are observed under both Gaussian and Uniform context distributions: the convergence slope of ONL-MNL in Phase II is independent of \(N\).

Theoretical Comparison with Prior Methods¶

Algorithm	Utility Type	Computationally Tractable	Regret	Depends on \(N\)?
\(\varepsilon\)-greedy [Zhang & Luo]	Lipschitz	✓	\(\widetilde{\mathcal{O}}((d_wNK)^{1/3}T^{2/3})\)	Yes
Log-barrier [Zhang & Luo]	Lipschitz	✗	\(\widetilde{\mathcal{O}}(K^2\sqrt{d_wNT})\)	Yes
Feel-Good TS [Zhang & Luo]	Lipschitz	✗	\(\widetilde{\mathcal{O}}(K^2\sqrt{d_wNT})\)	Yes
ONL-MNL (Ours)	Smooth	✓	\(\widetilde{\mathcal{O}}(\kappa^{-2}\mu^{-1}d_w\sqrt{T})\)	No

Highlights & Insights¶

First tractable and optimal algorithm: ONL-MNL is the first method to simultaneously achieve computational tractability and \(\widetilde{\mathcal{O}}(\sqrt{T})\) regret for the non-linear utility MNL bandit, resolving the fundamental computation–statistics tension in prior work.
No NTK assumptions required: The algorithm directly handles general non-linear parameterizations including neural networks, without requiring over-parameterization or specific architectural constraints.
Regret independent of \(N\): Unlike prior methods that exhibit polynomial dependence on the total number of items \(N\), ONL-MNL scales to large item pools.
Generalized geometric condition: The proposed assumption strictly weakens existing geometric conditions, accommodating non-convex loss landscapes with multiple global optima, naturally suited to neural networks.
Novel concentration inequality: A new parameter estimation concentration bound independent of the regularization parameter \(\lambda\) is derived, breaking the technical bottleneck that arises when \(\lambda = \mathcal{O}(\sqrt{T})\).

Limitations & Future Work¶

Instance-dependent factor \(\kappa\): The regret bound's \(\kappa^{-1}\) term can be as large as \(\mathcal{O}(K^2)\) in the worst case; eliminating or weakening this dependence remains an open problem.
Phase I requires i.i.d. contexts: Assumption 3 requires contexts in the exploration phase to be independent and identically distributed; although Phase II admits adversarial contexts, this remains a restriction.
Difficulty verifying the generalized geometric condition: Assumption 4 is difficult to verify a priori for general neural network architectures, and the paper does not provide concrete verification methods.
Limited experimental scale: Experiments are conducted only with \(N\leq 800\) and \(d=3\); high-dimensional features and larger real-world datasets are not considered.
Fully adversarial setting not studied: Although Phase II theoretically permits adversarial contexts, experiments are conducted only under stochastic contexts.
Extension to MNL MDPs: The paper proposes extending the framework to MNL MDPs as future work but provides no concrete methodology.

UCB-MNL [Oh & Iyengar 2021], TS-MNL [Cheung & Simchi-Levi 2017]: Support only linear utilities; regret does not converge under non-linear settings.
OFU-MNL+ [Lee & Oh 2024]: A near-optimal algorithm for the linear utility case with improved \(\kappa\) dependence, but still restricted to linear utilities.
Zhang & Luo [2025]: The first to consider general utilities, but face a computation–statistics tradeoff; their computationally tractable method achieves only \(T^{2/3}\) regret.
NTK-based neural bandits [Zhou et al., Hwang et al.]: Require over-parameterization assumptions and are unsuitable for practically sized networks.
Non-NTK neural bandits [Xu et al. 2024, Huang et al. 2021]: Require specific activation functions (quadratic) or context distributions (Gaussian), limiting their applicability.
Linear/GLM bandits [Abbasi-Yadkori et al., Filippi et al.]: Global strong convexity assumptions do not extend to non-linear models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to resolve the tension between computational tractability and statistical optimality in the non-linear MNL bandit.
Experimental Thoroughness: ⭐⭐⭐ — Core claims are validated, but experiments are small-scale and lack real-world datasets.
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clear, technical development is coherent, and theoretical derivations are rigorous.
Value: ⭐⭐⭐⭐ — Fills an important theoretical gap with potential practical impact for applications such as recommendation systems.