ICLR2026 Reinforcement Learning vehicle routing problem multi-task learning reinforcement-learning constraint-aware decoding neural combinatorial optimization

Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs¶

Conference: ICLR2026 arXiv: 2603.01667 Code: To be confirmed Area: Reinforcement Learning Keywords: vehicle routing problem, multi-task learning, reinforcement-learning, constraint-aware decoding, neural combinatorial optimization

TL;DR¶

This paper proposes Chain-of-Context Learning (CCL), which achieves stepwise dynamic constraint-aware decoding via Relevance-Guided Context Reformulation (RGCR, adaptively aggregating constraint information to construct context) and Trajectory-Shared Node Re-embedding (TSNR, sharing node updates across trajectories to avoid redundant computation). CCL comprehensively outperforms existing methods across 48 VRP variants (16 in-distribution + 32 out-of-distribution).

Background & Motivation¶

Background: The Vehicle Routing Problem (VRP) is a core task in logistics optimization. Neural solvers employ RL with an encoder-decoder architecture to autoregressively generate routes, achieving high efficiency but typically training on a single VRP variant. Multi-task VRP requires a unified framework to handle diverse constraint combinations (capacity, backhaul, open routes, time windows, multi-depot, etc.).

Limitations of Prior Work: Existing multi-task VRP solvers (MTPOMO, MVMoE, CaDA, etc.) encode constraint and node information as static embeddings—node representations are not updated throughout decoding once encoding is complete. However, decoding is sequential: as routes progress, the relative importance of different constraints shifts (e.g., distance limits become more critical when the vehicle is nearly full), and static embeddings cannot capture this dynamic.

Key Challenge: Although the context (e.g., current time, remaining capacity) is updated at each step, node embeddings are not—resulting in a context-node misalignment that leads to inaccurate state estimation and degraded decision quality. This violates the Markov property in MDPs, which requires that the state contain sufficient information for optimal decision-making.

Goal: To enable both constraint information and node representations to be progressively and dynamically updated during multi-task VRP decoding, achieving accurate state representation at every step.

Key Insight: Constraint information is explicitly incorporated into context construction at each step (RGCR), and the updated context is in turn used to refresh node embeddings (TSNR), forming a "context chain" in which node embeddings at each step carry historical information.

Core Idea: Synchronously update both context and node embeddings during RL-based VRP decoding, so that each decision is grounded in the most accurate current state rather than outdated static embeddings.

Method¶

Overall Architecture¶

CCL follows the encoder-decoder paradigm. Encoder: A Transformer encodes constraint tokens and node attributes into initial embeddings. Decoder: At each step, RGCR first constructs a context embedding by aggregating constraint attributes and the current node → TSNR then updates node embeddings by incorporating multi-trajectory context → the updated context and node embeddings are used to make selection decisions. During training, 16 VRP tasks (4 constraint combinations) are mixed; at inference, the model generalizes zero-shot to 32 additional variants.

Key Designs¶

RGCR (Relevance-Guided Context Reformulation):
- Function: Adaptively aggregates information from each constraint at every step to construct the current-step context embedding.
- Mechanism: Operates in three stages—(1) each constraint's attributes are projected via a linear layer to produce constraint embeddings \(\mathbf{C}^k\) (e.g., remaining capacity and demand for the capacity constraint, time windows and current time for the TW constraint); (2) the dot-product relevance between each constraint embedding and the current node embedding is computed as \(s^k = \mathbf{H}_{\tau} \cdot \mathbf{C}^k\); (3) constraint embeddings are relevance-weighted and summed, then concatenated with the raw constraint features to yield a unified context.
- Design Motivation: Different constraints carry different importance at different decoding steps—capacity constraints are most critical when the vehicle is nearly full, and TW constraints become paramount near a window's closing time. Weighting constraints by their similarity to the current node embedding allows the model to automatically learn where to focus attention.
TSNR (Trajectory-Shared Node Re-embedding):
- Function: Efficiently updates shared node embeddings using multi-trajectory context.
- Mechanism: Node embeddings serve as queries; the concatenation of node embeddings and all trajectory contexts serves as keys/values; multi-head attention is applied to update node embeddings. A distance bias term \(\mathbf{B}_j\) (inter-node distances plus the distance from each trajectory's current position to each node) is added to prevent overfitting to time windows. Updated node embeddings are passed to the next step as queries.
- Design Motivation: In multi-trajectory parallel exploration, independently updating node embeddings per trajectory incurs prohibitive computational overhead. TSNR allows all trajectories to share a single set of node embeddings while dynamically updating them by aggregating per-trajectory context via attention, balancing efficiency and information completeness.
Chain-of-Context Propagation:
- Function: Propagates node embeddings across steps to capture sequential dependencies.
- Mechanism: The output \(\mathbf{H}_j\) from TSNR at step \(j\) serves as the input query at step \(j+1\). Update frequency is controlled by probability \(P_{tr}\) (training) and \(P_{ts}\) (testing).
- Design Motivation: Prior methods make decisions at each step using initial (step-0) embeddings combined with the current context, discarding information accumulated over intermediate steps. Chain propagation allows node embeddings at each step to "remember" the historical decision context.

Loss & Training¶

The REINFORCE algorithm is used with multi-trajectory parallel exploration (different starting points); the reward is the negative total route length. Training mixes 16 VRP variants (combinations of 4 constraints: B/L/O/TW) with uniform sampling. At inference, the model directly generalizes zero-shot to 32 new variants incorporating MB/MD constraints not seen during training.

Key Experimental Results¶

Main Results (16 In-Distribution Tasks, \(N=50\))¶

Method	CVRP Gap	OVRP Gap	VRPTW Gap	OVRPTW Gap	Avg. Gap
HGS-PyVRP	0%*	0%*	0%*	0%*	Baseline
MTPOMO	1.42%	3.19%	2.42%	1.56%	~2.1%
CaDA	1.29%	2.59%	1.75%	1.12%	~1.7%
CaDA†	0.96%	2.21%	1.67%	1.03%	~1.5%
CCL	0.98%	1.96%	0.98%	0.54%	~1.1%
CCL†	0.88%	1.57%	0.91%	0.51%	~1.0%

Ablation Study¶

Configuration	Avg. Gap (\(N=50\))	Note
CCL (full)	1.0%	All components
w/o RGCR	~1.5%	No adaptive constraint aggregation
w/o TSNR	~1.4%	No node embedding updates; degrades to static embeddings
w/o distance bias	~1.2%	No distance bias in attention
w/o chain propagation	~1.3%	Uses initial embeddings instead of previous-step embeddings

Key Findings¶

CCL achieves the best performance on all 16 in-distribution tasks: The advantage is largest on tasks with time-window constraints (VRPTW, OVRPTW), confirming that dynamic context is most valuable for time-sensitive constraints.
CCL also leads on the majority of 32 out-of-distribution tasks: Zero-shot generalization to MB/MD constraint combinations unseen during training demonstrates that the dynamic constraint-awareness mechanism transfers well.
Both RGCR and TSNR contribute substantially: Removing either module increases the gap by 0.3–0.5%, confirming that dynamic context construction and dynamic node updating are both indispensable.
Computational overhead is reasonable: CCL inference takes approximately 5–7 s (\(N=50\))—somewhat slower than CaDA (1–3 s) but far faster than HGS-PyVRP (10+ minutes), and the quality gap (~1%) is considerably smaller than that of other neural methods.

Highlights & Insights¶

Core insight on context-node alignment: The paper explicitly identifies the misalignment between static node embeddings and dynamic context as a deficiency in MDP state representation. This framing suggests a general improvement applicable to any autoregressive decoding scheme where the environment state changes over time.
Efficiency design via trajectory-shared node embeddings: Sharing node embeddings across trajectories and aggregating per-trajectory context via attention avoids the \(O(N)\) overhead of per-trajectory updates—a practical engineering optimization for multi-trajectory parallel search.
Distance bias as an anti-overfitting trick: Without the distance bias, the model is found to over-rely on time-window information; adding a simple Euclidean distance bias mitigates this. This technique may generalize to other combinatorial optimization problems with multiple constraints.

Limitations & Future Work¶

Evaluation limited to the VRP family: Although VRP encompasses many variants, combinatorial optimization extends far beyond VRP. CCL's applicability to other problems such as TSP, JSP, and BPP remains unverified.
Limited training constraint diversity: Training uses combinations of only 4 constraints (B/L/O/TW) with zero-shot transfer to 2 new constraint types. Scalability to larger constraint spaces (e.g., stochastic travel times, dynamic demands) requires validation.
~1% gap versus LKH/HGS persists: Despite inference speedups exceeding 100×, CCL's absolute solution quality still falls short of traditional solvers, which may be unacceptable in precision-critical applications without additional search augmentation.
TSNR update frequency requires tuning: \(P_{tr}\) and \(P_{ts}\) are hyperparameters whose optimal values may vary with problem scale and constraint type.

vs. CaDA: CaDA also targets multi-task VRP but maintains static node embeddings. CCL's dynamic node updates yield consistent improvements, especially on complex constraint combinations involving time windows.
vs. MTPOMO: CCL inherits MTPOMO's multi-trajectory exploration paradigm but adds constraint-aware context construction and trajectory-shared node updating.
vs. the classic Attention Model: The original AM (Kool et al., 2019) relies on static encoding; TSNR can be viewed as an enhancement to the AM decoder that frees node embeddings from being "encoded once and frozen for life."

Rating¶

Novelty: ⭐⭐⭐⭐ — Dynamic context-node alignment is a clear and significant contribution; the RGCR+TSNR design is well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 48 VRP variants (16 in-distribution + 32 OOD), multiple scales, and complete ablations.
Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear; the three-challenge / three-solution correspondence provides a coherent structure.
Value: ⭐⭐⭐⭐ — Offers broadly applicable insights for the decoding process in combinatorial optimization RL.