Online Inventory Optimization in Non-Stationary Environment¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=mkl9iIXIr2
Code: TBD
Area: Learning Theory / Online Convex Optimization
Keywords: Online Inventory Optimization, Dynamic Regret, Online Convex Optimization, Smoothed Online Optimization, Switching Cost
TL;DR¶
This paper proposes a "two-stage projection + doubling trick" algorithm for Online Inventory Optimization (OIO) under non-stationary demand. By transforming the inventory carry-over constraint into a switching cost proportional to the sell-out period, the authors reduce OIO to Smoothed Online Convex Optimization (SOCO). They provide the first near-optimal dynamic regret bound of \(\tilde{O}(\sqrt{L_{\max}T(1+P_T)})\) along with a matching lower bound of \(\Omega(\sqrt{L_{\max}T})\).
Background & Motivation¶
Background: Inventory management is a core problem in supply chains. When demand models are unknown, the mainstream approach models "periodic replenishment + carry-over inventory" systems as Online Convex Optimization (OCO): in each round, the decision-maker sets a target inventory level \(y_t\), then the environment reveals a convex loss (e.g., newsvendor loss \(\ell_t(y)=|y-d_t|\)) and a subgradient \(g_t\). Hihat et al. (2023) formalized this framework as Online Inventory Optimization (OIO) and proposed the MaxCOSD algorithm, proving sublinear static regret.
Limitations of Prior Work: Existing OIO algorithms almost exclusively guarantee static regret—comparing against a fixed optimal level \(u\), i.e., \(\sum_t\ell_t(y_t)-\min_{u}\sum_t\ell_t(u)\). However, in non-stationary environments with fluctuating demand, a "fixed level" reference is unreasonable. The authors provide a clear example: in a single-item system where demand \(d_t=Dt/T\) grows linearly, the optimal total loss for a fixed level under newsvendor loss is \(O(DT)\), while the time-varying reference \(u_t=d_t\) yields a total loss of \(0\). This means even an algorithm with \(O(\sqrt{T})\) static regret might suffer \(\Omega(T)\) regret relative to a truly time-varying reference.
Key Challenge: To evaluate algorithms in non-stationary environments, one should use dynamic regret \(R_T(u_1,\dots,u_T)=\sum_t\ell_t(y_t)-\sum_t\ell_t(u_t)\), which allows the reference sequence \(u_1,\dots,u_T\) to change every round. Standard OCO already has mature two-layer structures (meta + base learner) that achieve \(O(\sqrt{(1+P_T)T})\) dynamic regret, where \(P_T=\sum_{t\ge 2}\|u_{t-1}-u_t\|_1\) is the path length. However, OIO introduces an inventory carry-over constraint: the level \(y_t\) cannot be less than the remaining inventory \(x_t\) from the previous round, whereas the reference \(u_t\) is not subject to this constraint. Thus, the feasible set for \(u_t\) is always a superset of the feasible set for \(y_t\). Standard OCO algorithms only provide guarantees for "references within the same feasible set"; applying them directly results in \(O(T)\) regret due to the gap between \(\hat{u}_t\) and \(u_t\). Worse, naively plugging MaxCOSD as a base learner into a two-layer structure leads to self-contradiction: the meta-learner's \(y_t\) might be larger than a base learner's output \(y_t^a\). If demand is low, the carry-over inventory \(x_{t+1}\) might exceed \(y_t^a\), violating the base learner's own "\(x_t\le y_{t-1}\)" assumption and causing theoretical guarantees to collapse.
Goal: Construct an algorithm under carry-over constraints and warehouse capacity (linear sum) constraints that achieves a dynamic regret of \(\tilde{O}(\sqrt{L_{\max}T(1+P_T)})\) without prior knowledge of \(L_{\max}\) and \(P_T\). Additionally, improve the static regret from the state-of-the-art \(O(L_{\max}\sqrt{T})\) to \(\tilde{O}(\sqrt{L_{\max}T})\) (saving a factor of \(\sqrt{L_{\max}}\)).
Core Idea: Utilize a two-stage projection—the base learner freely decides \(\hat{y}_{t+1}\) within the feasible set \(C(0)\) (containing only capacity constraints), which is then projected onto the feasible set \(C(x_{t+1})\) (containing carry-over constraints) to obtain the actual order \(y_t\). The authors prove that the cost of this projection is exactly a switching cost proportional to the sell-out period \(L_{\max}\), thereby cleanly reducing OIO to Smoothed Online Convex Optimization (SOCO).
Method¶
Overall Architecture¶
Problem Setup (Alg. 1): For \(N\) items, the inventory vector takes values in a convex set \(C\subset\mathbb{R}^N_{\ge 0}\). In each round \(t\): ① Replenish to the target level \(y_t\) decided in the previous round; ② The environment reveals the carry-over inventory \(x_{t+1}\) (where \(x^i_{t+1}=\max(0,y^i_t-d^i_t)\le y^i_t\)) and subgradient \(g_t\in\partial\ell_t(y_t)\); ③ The decision-maker sets the next level \(y_{t+1}\in C(x_{t+1})\). The feasible set is defined as: $\(C(x):=\Big\{y\in[0,D]^N \;\Big|\; y^i\ge x^i\ \forall i,\ \textstyle\sum_{i}y^i\le D\Big\},\)$ which is limited by the carry-over lower bound \(y^i\ge x^i\) and the linear warehouse capacity constraint \(\sum_i y^i\le D\). Subgradients are bounded \(\|g_t\|_2\le G\). Note an engineering detail: the stockout penalty \(p\max(0,d_t-y_t)\) of the newsvendor loss is linear in \(y_t\), so the subgradient can be calculated just by observing whether a stockout occurred, without needing the true demand volume—making the algorithm applicable in retail scenarios with "lost sales/unobservable demand."
Mechanism (Alg. 2): The pipeline is concise. Every round, \(g_t\) is fed to a base learner \(E\), which outputs a decision \(\hat{y}_{t+1}\in C(0)\) considering only capacity constraints. This is then projected to the carry-over feasible set: \(y_{t+1}=\Pi_{C(x_{t+1})}(\hat{y}_{t+1})\). Parallel to this, the algorithm tracks the current "cycle length" \(\max L_t\). If it exceeds the current parameter \(L\), \(L\) is doubled and the base learner is restarted. The analysis framework is: the two-stage projection ensures the regret of real decisions is controlled by the base learner's regret (Lemma 1), with a penalty of a switching cost proportional to cycle length; the doubling trick adaptively calibrates this switching cost without knowing \(L_{\max}\); finally, selecting a SOCO-oriented base learner (SOGD) achieves near-optimal dynamic regret without knowing \(P_T\).
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Round t Observation:<br/>g_t and carry-over x_{t+1}"] --> B["Doubling Trick:<br/>Track maxL_t<br/>Double L and restart E if exceeded"]
B --> C["SOGD Base Learner E:<br/>Decides ŷ_{t+1} in C(0)<br/>Ignoring carry-over constraints"]
C --> D["Two-Stage Projection:<br/>y_{t+1} = Π_{C(x_{t+1})}(ŷ_{t+1})"]
D -->|Projection Cost = Switching Cost proportional to L_max| E2["Reduction to SOCO:<br/>Dynamic Regret Õ(√(L_max T(1+P_T)))"]
D --> A
Key Designs¶
1. Two-Stage Projection: Decoupling Inventory Carry-over Constraints from the Base Learner
The root cause of failure in naive two-layer structures is that the base learner's output is "contaminated" by carry-over constraints, breaking the internal assumption \(x_t\le y_{t-1}\). This paper allows the base learner \(E\) to completely ignore carry-over inventory, making decisions \(\hat{y}_{t+1}\) only within the capacity-constrained set \(C(0)=\{y\in[0,D]^N\mid\sum_i y^i\le D\}\). Carry-over constraints are addressed in the second stage via \(y_{t+1}=\Pi_{C(x_{t+1})}(\hat{y}_{t+1})\). The key difference: unlike MaxCOSD's cyclic update where \(y_t\) only updates if a candidate is feasible, this method allows the true level \(y_t\) to change even if the base learner's decision \(\hat{y}_t\) is infeasible—the base learner remains independent of carry-over inventory, preserving its theoretical guarantees.
2. OIO↔SOCO Reduction: Projection Cost as Switching Cost Proportional to Cycle Length (Lemma 1)
After decoupling, the question remains: how to compensate for the regret difference between the actual order \(y_t\) and the base learner's \(\hat{y}_t\)? The authors introduce the "cycle" concept: for item \(i\), a cycle is a continuous period where \(\hat{y}^i_t\) cannot be realized due to high carry-over \(x^i_t\) (i.e., \(y^i_t > \hat{y}^i_t\)). The cycle length is denoted \(L^i_t\). A core lemma shows: $\(\sum_{t=1}^{T}\langle g_t,\,y_t-\hat{y}_t\rangle \;\le\; 2G\sum_{t=1}^{T}\Big(\max_{i\in[N]}L^i_t\Big)\,\|\hat{y}_t-\hat{y}_{t+1}\|_1 .\)$ Essentially, the extra regret of real decisions compared to the base learner is bounded by a switching cost term with a coefficient proportional to the current cycle length \(L^*_t=\max_i L^i_t\). Thus, the total dynamic regret becomes: $\(R_T\le\sum_{t=1}^T\big(\langle g_t,\hat{y}_t-u_t\rangle+2GL^*_t\|\hat{y}_t-\hat{y}_{t+1}\|_1\big),\)$ where the right side is exactly the dynamic regret of a SOCO problem: the base learner chooses \(\hat{y}_t\in C(0)\), incurring loss \(\langle g_t,\hat{y}_t\rangle\) and switching cost \(2GL^*_{t-1}\|\hat{y}_{t-1}-\hat{y}_t\|_1\). This step is the "pivot" of the paper—it translates OIO-specific time-varying carry-over constraints into well-studied switching costs in SOCO. The differences from standard SOCO are that \(L^*_t\) is time-varying and delay-observable (cycle length is known only after it ends), and the switching cost uses the \(\ell_1\) norm instead of \(\ell_2\).
3. Doubling Trick for Unknown Switching Cost: Capping Cycle Length by Sell-Out Period \(L_{\max}\)
Since \(L^*_t\) is unknown beforehand and delay-observable, it cannot be used directly to set learning rates. This paper uses a doubling trick: maintain a parameter \(L\) and track the maximum observed cycle length \(\max L_t\) (requiring \(O(N)\) memory). Once \(\max L_t > L\), let \(L \leftarrow 2L\) and restart the base learner with the new parameter. This is justified by a key environmental characterization—the sell-out period \(L_{\max}\) (Definition 1): $\(L_{\max}:=\min\Big\{L\in[T]\ \Big|\ \textstyle\sum_{s=t}^{\min(t+L-1,T+1)}d^i_s\ge D,\ \forall t\in[T],\,\forall i\in[N]\Big\},\)$ Intuitively, at least \(D\) units of each item are sold within any continuous \(L_{\max}\) rounds. For static demand \(d^i\), \(L_{\max}=\lceil D/d^i\rceil\). Lemma 2 proves: no cycle length exceeds \(L_{\max}\)—as carry-over inventory must be consumed within one sell-out period. Thus, the doubling trick triggers at most \(O(\log L_{\max})\) times, and \(L\) stabilizes at \(O(L_{\max})\). Theorem 2 concludes: if the base learner's regret can be decomposed into \(L^\alpha R(T)\) with switching cost \(\le O(L^{-\beta})\), then Alg. 2 regret \(\le C(\alpha)R^{E(2L_{\max},T)}+\Delta(L_{\max},\beta)\), where the doubling overhead \(\Delta\) is \(O(L_{\max}\log L_{\max})\) (for \(\beta=1\)), which is negligible for sufficiently large \(T\).
4. SOGD Base Learner + Matching Lower Bound: Near-Optimal Regret without knowing \(P_T\)
Finally, a specific base learner is chosen. Simple Online Gradient Descent (OGD, Alg. 3) with an \(L\)-parameterized learning rate \(\eta=\sqrt{\tfrac{2D(3D+P_T)}{G^2(\sqrt{N}L+1/2)T}}\) yields \(O(\sqrt{L_{\max}(1+P_T)T}+L_{\max}\log L_{\max})\) (Theorem 3), but requires knowing \(P_T\) in advance. Instead, this paper adopts Smoothed Online Gradient Descent (SOGD) (Alg. 5) from Zhang et al. (2022a): a meta-algorithm uses a Discounted-Normal-Predictor to adaptively aggregate multiple OGD experts with different learning rates. This achieves \(R_T\le O(\sqrt{L_{\max}(1+P_T)T\log T}+L_{\max}\log L_{\max})\) (Theorem 4) without \(P_T\) as a prior, at the cost of a \(\log T\) factor in per-round computation. Complementing this, the paper provides the first \(\Omega(GD\sqrt{L_{\max}T})\) lower bound for OIO (Theorem 5). The matching upper and lower bounds prove \(\tilde{O}(\sqrt{L_{\max}T})\) is near-optimal, answering the open question from Hihat et al. (2023) and confirming the optimality of the \(\sqrt{L}\) factor in SOCO (Corollary 1).
Key Experimental Results¶
This is a purely theoretical work; no numerical experiments were conducted. "Key data" refers to the comparison of regret bounds against existing work.
Main Results: Regret Comparison¶
The table compares literature results, converting demand parameters to the \(L_{\max}\) metric (S/M denotes Single/Multi-item; NV/O/F denotes Newsvendor/Obsolete/Fixed cost loss).
| Work | Regret Type | Upper Bound | Lower Bound | Items | Capacity | Demand |
|---|---|---|---|---|---|---|
| Huh & Rusmevichientong (2009) | Static | \(O(L_{\max}\sqrt{T})\) | — | S | Interval | i.i.d. |
| Zhang et al. (2018a) | Static | \(O(L_{\max}\sqrt{T})\) | \(\Omega(\sqrt{T})\) | S | Interval | i.i.d. |
| Agrawal & Jia (2022) | Static | \(\tilde{O}(\sqrt{T}+L_{\max})\) | — | S | Interval | i.i.d. |
| Hihat et al. (2023) | Static | \(O(L_{\max}\sqrt{T})\) | — | M | Convex | non-i.i.d. |
| Ours | Static | \(\tilde{O}(\sqrt{L_{\max}T})\) | \(\Omega(\sqrt{L_{\max}T})\) | M | Linear | non-i.i.d. |
| Ours | Dynamic | \(\tilde{O}(\sqrt{L_{\max}(1+P_T)T})\) | \(\Omega(\sqrt{(1+P_T)T})\) | M | Linear | non-i.i.d. |
Results for different base learners¶
| Base learner | Dynamic Regret Upper Bound | \(P_T\) Prior Needed? | Computation/Round |
|---|---|---|---|
| OGD (Alg. 3, Theorem 3) | \(O(\sqrt{L_{\max}(1+P_T)T}+L_{\max}\log L_{\max})\) | Yes | \(O(1)\) |
| SOGD (Alg. 5, Theorem 4) | \(O(\sqrt{L_{\max}(1+P_T)T\log T}+L_{\max}\log L_{\max})\) | No | \(O(\log T)\) |
Key Findings¶
- Static Regret Improved by \(\sqrt{L_{\max}}\): Prior works generally achieved \(O(L_{\max}\sqrt{T})\); this work achieves \(\tilde{O}(\sqrt{L_{\max}T})\), reducing the dependence on \(L_{\max}\) from linear to square root, and proving optimality with an \(\Omega(\sqrt{L_{\max}T})\) lower bound.
- First Dynamic Regret Guarantee: Under linear capacity constraints, no previous work provided OIO algorithms with theoretical guarantees in non-stationary environments. The \(\tilde{O}(\sqrt{L_{\max}(1+P_T)T})\) bound fills this gap.
- Impossibility of Sublinear Regret when \(L_{\max}=\Omega(T)\): The lower bound analysis shows that if the sell-out period is as long as \(\Omega(T)\) (demand remains extremely low permanently), sublinear regret is unattainable—defining the solvable boundary of the "non-stationary" problem.
Highlights & Insights¶
- Translating "Constraint Difference" to "Switching Cost": Lemma 1 is the most elegant contribution—addressing the fact that OIO's \(y_t\) and reference \(u_t\) reside in different feasible sets by quantifying the difference as a switching cost proportional to cycle length. This reduction strategy can be transferred to other online decision problems with state constraints.
- Appropriate Environmental Metric \(L_{\max}\): This metric only restricts demand fluctuations during the "window that determines \(L_{\max}\)" and does not restrict them elsewhere—providing an analytical handle without over-limiting non-stationarity, more fitting for inventory semantics than metrics like demand variance.
- Symmetry of Lower Bounds: Because OIO and SOCO are connected via reduction, the OIO lower bound also proves the optimality of the \(\sqrt{L}\) factor in SOCO—a structural observation where one domain's lower bound constrains another.
- Subgradients only require "Stockout" observations: By leveraging the linearity of newsvendor loss, the algorithm updates without knowing the true demand volume, directly addressing the reality of unobservable demand in retail.
Limitations & Future Work¶
- Lead Time and Fixed Costs: The model does not account for delays between ordering and arrival or fixed ordering costs. While these are studied under i.i.d. demand, extensions to non-stationary environments remain open.
- Reliance on Linear Capacity Constraints: Capacity constraints are restricted to linear sums \(\sum_i y^i\le D\). While earlier work allowed general convex constraints, this restriction was key for Lemma 5/6. Generalizing this is left for future work.
- \(\log T\) Computational Overhead: To achieve low dynamic regret, SOGD's meta-algorithm requires \(O(T\log T)\) per round, which is more expensive than OGD.
- Logarithmic Factors in Regret: The dynamic regret bound contains a \(\sqrt{\log T}\) factor not strictly matched by the lower bound; removing this factor is a potential research direction.
Related Work & Insights¶
- vs. MaxCOSD (Hihat et al., 2023): The preceding work on the same OIO setup used cyclic updates that only achieved \(O(L_{\max}\sqrt{T})\) static regret. This paper permits updates even when candidates are infeasible, improving static regret and providing the first dynamic results, albeit by narrowing general convex constraints to linear ones.
- vs. Standard Two-Layer Dynamic Regret (Ader, Zhang et al., 2018b): Standard OCO structures fail in OIP because carry-over inventory breaks base learner assumptions. This paper's two-stage projection + SOCO reduction bypasses this contradiction.
- vs. SOCO Series (Lin et al., 2011; Zhang et al., 2021/2022a): This paper reuses SOGD as a base learner after reduction and, in turn, uses OIO results to prove the optimality of SOCO parameters.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First near-optimal dynamic regret bound for OIO + first \(\Omega(\sqrt{L_{\max}T})\) lower bound; elegant reduction to SOCO.
- Experimental Thoroughness: ⭐⭐⭐ Purely theoretical; no numerical experiments, but theoretical "thoroughness" via matching bounds is high.
- Writing Quality: ⭐⭐⭐⭐ Motivations are introduced with clear counter-examples; the reduction logical path (Lemma 1→2→Theorem 2→4) is well-structured.
- Value: ⭐⭐⭐⭐ Resolves the open question from Hihat et al. (2023) and significantly advances online learning theory for non-stationary inventory management.