Online Inventory Optimization in Non-Stationary Environment¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=mkl9iIXIr2
Code: TBD
Area: Learning Theory / Online Convex Optimization
Keywords: Online Inventory Optimization, Dynamic Regret, Online Convex Optimization, Smoothed Online Optimization, Switching Cost

TL;DR¶

This paper proposes a "two-stage projection + doubling trick" algorithm for Online Inventory Optimization (OIO) under non-stationary demand. By transforming the inventory carry-over constraint into a switching cost proportional to the sell-out period, the authors reduce OIO to Smoothed Online Convex Optimization (SOCO). They provide the first near-optimal dynamic regret bound of $\tilde{O}(\sqrt{L_{\max}T(1+P_T)})$ along with a matching lower bound of $\Omega(\sqrt{L_{\max}T})$.

Background & Motivation¶

Background: Inventory management is a core problem in supply chains. When demand models are unknown, the mainstream approach models "periodic replenishment + carry-over inventory" systems as Online Convex Optimization (OCO): in each round, the decision-maker sets a target inventory level $y_t$, then the environment reveals a convex loss (e.g., newsvendor loss $\ell_t(y)=|y-d_t|$) and a subgradient $g_t$. Hihat et al. (2023) formalized this framework as Online Inventory Optimization (OIO) and proposed the MaxCOSD algorithm, proving sublinear static regret.

Limitations of Prior Work: Existing OIO algorithms almost exclusively guarantee static regret—comparing against a fixed optimal level $u$, i.e., $\sum_t\ell_t(y_t)-\min_{u}\sum_t\ell_t(u)$. However, in non-stationary environments with fluctuating demand, a "fixed level" reference is unreasonable. The authors provide a clear example: in a single-item system where demand $d_t=Dt/T$ grows linearly, the optimal total loss for a fixed level under newsvendor loss is $O(DT)$, while the time-varying reference $u_t=d_t$ yields a total loss of $0$. This means even an algorithm with $O(\sqrt{T})$ static regret might suffer $\Omega(T)$ regret relative to a truly time-varying reference.

Key Challenge: To evaluate algorithms in non-stationary environments, one should use dynamic regret $R_T(u_1,\dots,u_T)=\sum_t\ell_t(y_t)-\sum_t\ell_t(u_t)$, which allows the reference sequence $u_1,\dots,u_T$ to change every round. Standard OCO already has mature two-layer structures (meta + base learner) that achieve $O(\sqrt{(1+P_T)T})$ dynamic regret, where $P_T=\sum_{t\ge 2}\|u_{t-1}-u_t\|_1$ is the path length. However, OIO introduces an inventory carry-over constraint: the level $y_t$ cannot be less than the remaining inventory $x_t$ from the previous round, whereas the reference $u_t$ is not subject to this constraint. Thus, the feasible set for $u_t$ is always a superset of the feasible set for $y_t$. Standard OCO algorithms only provide guarantees for "references within the same feasible set"; applying them directly results in $O(T)$ regret due to the gap between $\hat{u}_t$ and $u_t$. Worse, naively plugging MaxCOSD as a base learner into a two-layer structure leads to self-contradiction: the meta-learner's $y_t$ might be larger than a base learner's output $y_t^a$. If demand is low, the carry-over inventory $x_{t+1}$ might exceed $y_t^a$, violating the base learner's own "$x_t\le y_{t-1}$" assumption and causing theoretical guarantees to collapse.

Goal: Construct an algorithm under carry-over constraints and warehouse capacity (linear sum) constraints that achieves a dynamic regret of $\tilde{O}(\sqrt{L_{\max}T(1+P_T)})$ without prior knowledge of $L_{\max}$ and $P_T$. Additionally, improve the static regret from the state-of-the-art $O(L_{\max}\sqrt{T})$ to $\tilde{O}(\sqrt{L_{\max}T})$ (saving a factor of $\sqrt{L_{\max}}$).

Core Idea: Utilize a two-stage projection—the base learner freely decides $\hat{y}_{t+1}$ within the feasible set $C(0)$ (containing only capacity constraints), which is then projected onto the feasible set $C(x_{t+1})$ (containing carry-over constraints) to obtain the actual order $y_t$. The authors prove that the cost of this projection is exactly a switching cost proportional to the sell-out period $L_{\max}$, thereby cleanly reducing OIO to Smoothed Online Convex Optimization (SOCO).

Method¶

Overall Architecture¶

Problem Setup (Alg. 1): For $N$ items, the inventory vector takes values in a convex set $C\subset\mathbb{R}^N_{\ge 0}$. In each round $t$: ① Replenish to the target level $y_t$ decided in the previous round; ② The environment reveals the carry-over inventory $x_{t+1}$ (where $x^i_{t+1}=\max(0,y^i_t-d^i_t)\le y^i_t$) and subgradient $g_t\in\partial\ell_t(y_t)$; ③ The decision-maker sets the next level $y_{t+1}\in C(x_{t+1})$. The feasible set is defined as: $$C(x):=\Big\{y\in[0,D]^N \;\Big|\; y^i\ge x^i\ \forall i,\ \textstyle\sum_{i}y^i\le D\Big\},$$ which is limited by the carry-over lower bound $y^i\ge x^i$ and the linear warehouse capacity constraint $\sum_i y^i\le D$. Subgradients are bounded $\|g_t\|_2\le G$. Note an engineering detail: the stockout penalty $p\max(0,d_t-y_t)$ of the newsvendor loss is linear in $y_t$, so the subgradient can be calculated just by observing whether a stockout occurred, without needing the true demand volume—making the algorithm applicable in retail scenarios with "lost sales/unobservable demand."

Mechanism (Alg. 2): The pipeline is concise. Every round, $g_t$ is fed to a base learner $E$, which outputs a decision $\hat{y}_{t+1}\in C(0)$ considering only capacity constraints. This is then projected to the carry-over feasible set: $y_{t+1}=\Pi_{C(x_{t+1})}(\hat{y}_{t+1})$. Parallel to this, the algorithm tracks the current "cycle length" $\max L_t$. If it exceeds the current parameter $L$, $L$ is doubled and the base learner is restarted. The analysis framework is: the two-stage projection ensures the regret of real decisions is controlled by the base learner's regret (Lemma 1), with a penalty of a switching cost proportional to cycle length; the doubling trick adaptively calibrates this switching cost without knowing $L_{\max}$; finally, selecting a SOCO-oriented base learner (SOGD) achieves near-optimal dynamic regret without knowing $P_T$.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Round t Observation:<br/>g_t and carry-over x_{t+1}"] --> B["Doubling Trick:<br/>Track maxL_t<br/>Double L and restart E if exceeded"]
    B --> C["SOGD Base Learner E:<br/>Decides ŷ_{t+1} in C(0)<br/>Ignoring carry-over constraints"]
    C --> D["Two-Stage Projection:<br/>y_{t+1} = Π_{C(x_{t+1})}(ŷ_{t+1})"]
    D -->|Projection Cost = Switching Cost proportional to L_max| E2["Reduction to SOCO:<br/>Dynamic Regret Õ(√(L_max T(1+P_T)))"]
    D --> A

Key Designs¶

1. Two-Stage Projection: Decoupling Inventory Carry-over Constraints from the Base Learner

The root cause of failure in naive two-layer structures is that the base learner's output is "contaminated" by carry-over constraints, breaking the internal assumption $x_t\le y_{t-1}$. This paper allows the base learner $E$ to completely ignore carry-over inventory, making decisions $\hat{y}_{t+1}$ only within the capacity-constrained set $C(0)=\{y\in[0,D]^N\mid\sum_i y^i\le D\}$. Carry-over constraints are addressed in the second stage via $y_{t+1}=\Pi_{C(x_{t+1})}(\hat{y}_{t+1})$. The key difference: unlike MaxCOSD's cyclic update where $y_t$ only updates if a candidate is feasible, this method allows the true level $y_t$ to change even if the base learner's decision $\hat{y}_t$ is infeasible—the base learner remains independent of carry-over inventory, preserving its theoretical guarantees.

2. OIO↔SOCO Reduction: Projection Cost as Switching Cost Proportional to Cycle Length (Lemma 1)

After decoupling, the question remains: how to compensate for the regret difference between the actual order $y_t$ and the base learner's $\hat{y}_t$? The authors introduce the "cycle" concept: for item $i$, a cycle is a continuous period where $\hat{y}^i_t$ cannot be realized due to high carry-over $x^i_t$ (i.e., $y^i_t > \hat{y}^i_t$). The cycle length is denoted $L^i_t$. A core lemma shows: $$\sum_{t=1}^{T}\langle g_t,\,y_t-\hat{y}_t\rangle \;\le\; 2G\sum_{t=1}^{T}\Big(\max_{i\in[N]}L^i_t\Big)\,\|\hat{y}_t-\hat{y}_{t+1}\|_1 .$$ Essentially, the extra regret of real decisions compared to the base learner is bounded by a switching cost term with a coefficient proportional to the current cycle length $L^*_t=\max_i L^i_t$. Thus, the total dynamic regret becomes: $$R_T\le\sum_{t=1}^T\big(\langle g_t,\hat{y}_t-u_t\rangle+2GL^*_t\|\hat{y}_t-\hat{y}_{t+1}\|_1\big),$$ where the right side is exactly the dynamic regret of a SOCO problem: the base learner chooses $\hat{y}_t\in C(0)$, incurring loss $\langle g_t,\hat{y}_t\rangle$ and switching cost $2GL^*_{t-1}\|\hat{y}_{t-1}-\hat{y}_t\|_1$. This step is the "pivot" of the paper—it translates OIO-specific time-varying carry-over constraints into well-studied switching costs in SOCO. The differences from standard SOCO are that $L^*_t$ is time-varying and delay-observable (cycle length is known only after it ends), and the switching cost uses the $\ell_1$ norm instead of $\ell_2$.

3. Doubling Trick for Unknown Switching Cost: Capping Cycle Length by Sell-Out Period $L_{\max}$

Since $L^*_t$ is unknown beforehand and delay-observable, it cannot be used directly to set learning rates. This paper uses a doubling trick: maintain a parameter $L$ and track the maximum observed cycle length $\max L_t$ (requiring $O(N)$ memory). Once $\max L_t > L$, let $L \leftarrow 2L$ and restart the base learner with the new parameter. This is justified by a key environmental characterization—the sell-out period $L_{\max}$ (Definition 1): $$L_{\max}:=\min\Big\{L\in[T]\ \Big|\ \textstyle\sum_{s=t}^{\min(t+L-1,T+1)}d^i_s\ge D,\ \forall t\in[T],\,\forall i\in[N]\Big\},$$ Intuitively, at least $D$ units of each item are sold within any continuous $L_{\max}$ rounds. For static demand $d^i$, $L_{\max}=\lceil D/d^i\rceil$. Lemma 2 proves: no cycle length exceeds $L_{\max}$—as carry-over inventory must be consumed within one sell-out period. Thus, the doubling trick triggers at most $O(\log L_{\max})$ times, and $L$ stabilizes at $O(L_{\max})$. Theorem 2 concludes: if the base learner's regret can be decomposed into $L^\alpha R(T)$ with switching cost $\le O(L^{-\beta})$, then Alg. 2 regret $\le C(\alpha)R^{E(2L_{\max},T)}+\Delta(L_{\max},\beta)$, where the doubling overhead $\Delta$ is $O(L_{\max}\log L_{\max})$ (for $\beta=1$), which is negligible for sufficiently large $T$.

4. SOGD Base Learner + Matching Lower Bound: Near-Optimal Regret without knowing $P_T$

Finally, a specific base learner is chosen. Simple Online Gradient Descent (OGD, Alg. 3) with an $L$-parameterized learning rate $\eta=\sqrt{\tfrac{2D(3D+P_T)}{G^2(\sqrt{N}L+1/2)T}}$ yields $O(\sqrt{L_{\max}(1+P_T)T}+L_{\max}\log L_{\max})$ (Theorem 3), but requires knowing $P_T$ in advance. Instead, this paper adopts Smoothed Online Gradient Descent (SOGD) (Alg. 5) from Zhang et al. (2022a): a meta-algorithm uses a Discounted-Normal-Predictor to adaptively aggregate multiple OGD experts with different learning rates. This achieves $R_T\le O(\sqrt{L_{\max}(1+P_T)T\log T}+L_{\max}\log L_{\max})$ (Theorem 4) without $P_T$ as a prior, at the cost of a $\log T$ factor in per-round computation. Complementing this, the paper provides the first $\Omega(GD\sqrt{L_{\max}T})$ lower bound for OIO (Theorem 5). The matching upper and lower bounds prove $\tilde{O}(\sqrt{L_{\max}T})$ is near-optimal, answering the open question from Hihat et al. (2023) and confirming the optimality of the $\sqrt{L}$ factor in SOCO (Corollary 1).

Key Experimental Results¶

This is a purely theoretical work; no numerical experiments were conducted. "Key data" refers to the comparison of regret bounds against existing work.

Main Results: Regret Comparison¶

The table compares literature results, converting demand parameters to the $L_{\max}$ metric (S/M denotes Single/Multi-item; NV/O/F denotes Newsvendor/Obsolete/Fixed cost loss).

Work	Regret Type	Upper Bound	Lower Bound	Items	Capacity	Demand
Huh & Rusmevichientong (2009)	Static	$O(L_{\max}\sqrt{T})$	—	S	Interval	i.i.d.
Zhang et al. (2018a)	Static	$O(L_{\max}\sqrt{T})$	$\Omega(\sqrt{T})$	S	Interval	i.i.d.
Agrawal & Jia (2022)	Static	$\tilde{O}(\sqrt{T}+L_{\max})$	—	S	Interval	i.i.d.
Hihat et al. (2023)	Static	$O(L_{\max}\sqrt{T})$	—	M	Convex	non-i.i.d.
Ours	Static	$\tilde{O}(\sqrt{L_{\max}T})$	$\Omega(\sqrt{L_{\max}T})$	M	Linear	non-i.i.d.
Ours	Dynamic	$\tilde{O}(\sqrt{L_{\max}(1+P_T)T})$	$\Omega(\sqrt{(1+P_T)T})$	M	Linear	non-i.i.d.

Results for different base learners¶

Base learner	Dynamic Regret Upper Bound	$P_T$ Prior Needed?	Computation/Round
OGD (Alg. 3, Theorem 3)	$O(\sqrt{L_{\max}(1+P_T)T}+L_{\max}\log L_{\max})$	Yes	$O(1)$
SOGD (Alg. 5, Theorem 4)	$O(\sqrt{L_{\max}(1+P_T)T\log T}+L_{\max}\log L_{\max})$	No	$O(\log T)$

Key Findings¶

Static Regret Improved by $\sqrt{L_{\max}}$: Prior works generally achieved $O(L_{\max}\sqrt{T})$; this work achieves $\tilde{O}(\sqrt{L_{\max}T})$, reducing the dependence on $L_{\max}$ from linear to square root, and proving optimality with an $\Omega(\sqrt{L_{\max}T})$ lower bound.
First Dynamic Regret Guarantee: Under linear capacity constraints, no previous work provided OIO algorithms with theoretical guarantees in non-stationary environments. The $\tilde{O}(\sqrt{L_{\max}(1+P_T)T})$ bound fills this gap.
Impossibility of Sublinear Regret when $L_{\max}=\Omega(T)$: The lower bound analysis shows that if the sell-out period is as long as $\Omega(T)$ (demand remains extremely low permanently), sublinear regret is unattainable—defining the solvable boundary of the "non-stationary" problem.

Highlights & Insights¶

Translating "Constraint Difference" to "Switching Cost": Lemma 1 is the most elegant contribution—addressing the fact that OIO's $y_t$ and reference $u_t$ reside in different feasible sets by quantifying the difference as a switching cost proportional to cycle length. This reduction strategy can be transferred to other online decision problems with state constraints.
Appropriate Environmental Metric $L_{\max}$: This metric only restricts demand fluctuations during the "window that determines $L_{\max}$" and does not restrict them elsewhere—providing an analytical handle without over-limiting non-stationarity, more fitting for inventory semantics than metrics like demand variance.
Symmetry of Lower Bounds: Because OIO and SOCO are connected via reduction, the OIO lower bound also proves the optimality of the $\sqrt{L}$ factor in SOCO—a structural observation where one domain's lower bound constrains another.
Subgradients only require "Stockout" observations: By leveraging the linearity of newsvendor loss, the algorithm updates without knowing the true demand volume, directly addressing the reality of unobservable demand in retail.

Limitations & Future Work¶

Lead Time and Fixed Costs: The model does not account for delays between ordering and arrival or fixed ordering costs. While these are studied under i.i.d. demand, extensions to non-stationary environments remain open.
Reliance on Linear Capacity Constraints: Capacity constraints are restricted to linear sums $\sum_i y^i\le D$. While earlier work allowed general convex constraints, this restriction was key for Lemma 5/6. Generalizing this is left for future work.
$\log T$ Computational Overhead: To achieve low dynamic regret, SOGD's meta-algorithm requires $O(T\log T)$ per round, which is more expensive than OGD.
Logarithmic Factors in Regret: The dynamic regret bound contains a $\sqrt{\log T}$ factor not strictly matched by the lower bound; removing this factor is a potential research direction.

vs. MaxCOSD (Hihat et al., 2023): The preceding work on the same OIO setup used cyclic updates that only achieved $O(L_{\max}\sqrt{T})$ static regret. This paper permits updates even when candidates are infeasible, improving static regret and providing the first dynamic results, albeit by narrowing general convex constraints to linear ones.
vs. Standard Two-Layer Dynamic Regret (Ader, Zhang et al., 2018b): Standard OCO structures fail in OIP because carry-over inventory breaks base learner assumptions. This paper's two-stage projection + SOCO reduction bypasses this contradiction.
vs. SOCO Series (Lin et al., 2011; Zhang et al., 2021/2022a): This paper reuses SOGD as a base learner after reduction and, in turn, uses OIO results to prove the optimality of SOCO parameters.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First near-optimal dynamic regret bound for OIO + first $\Omega(\sqrt{L_{\max}T})$ lower bound; elegant reduction to SOCO.
Experimental Thoroughness: ⭐⭐⭐ Purely theoretical; no numerical experiments, but theoretical "thoroughness" via matching bounds is high.
Writing Quality: ⭐⭐⭐⭐ Motivations are introduced with clear counter-examples; the reduction logical path (Lemma 1→2→Theorem 2→4) is well-structured.
Value: ⭐⭐⭐⭐ Resolves the open question from Hihat et al. (2023) and significantly advances online learning theory for non-stationary inventory management.

Work	Regret Type	Upper Bound	Lower Bound	Items	Capacity	Demand
Huh & Rusmevichientong (2009)	Static	\(O(L_{\max}\sqrt{T})\)	—	S	Interval	i.i.d.
Zhang et al. (2018a)	Static	\(O(L_{\max}\sqrt{T})\)	\(\Omega(\sqrt{T})\)	S	Interval	i.i.d.
Agrawal & Jia (2022)	Static	\(\tilde{O}(\sqrt{T}+L_{\max})\)	—	S	Interval	i.i.d.
Hihat et al. (2023)	Static	\(O(L_{\max}\sqrt{T})\)	—	M	Convex	non-i.i.d.
Ours	Static	\(\tilde{O}(\sqrt{L_{\max}T})\)	\(\Omega(\sqrt{L_{\max}T})\)	M	Linear	non-i.i.d.
Ours	Dynamic	\(\tilde{O}(\sqrt{L_{\max}(1+P_T)T})\)	\(\Omega(\sqrt{(1+P_T)T})\)	M	Linear	non-i.i.d.

Base learner	Dynamic Regret Upper Bound	\(P_T\) Prior Needed?	Computation/Round
OGD (Alg. 3, Theorem 3)	\(O(\sqrt{L_{\max}(1+P_T)T}+L_{\max}\log L_{\max})\)	Yes	\(O(1)\)
SOGD (Alg. 5, Theorem 4)	\(O(\sqrt{L_{\max}(1+P_T)T\log T}+L_{\max}\log L_{\max})\)	No	\(O(\log T)\)