Skip to content

Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP

Conference: NeurIPS 2025
arXiv: 2510.21453
Code: github.com/panyxy/moses_vrp
Area: Model Compression
Keywords: Vehicle Routing Problem, Mixture of Experts, State-Decomposable MDP, LoRA, Multi-Task Learning
arXiv: 2510.21453
Code: None
Area: Model Compression

TL;DR

This paper proposes the State-Decomposable MDP (SDMDP) framework, which reformulates multiple VRP variants as Cartesian products of base state spaces, and introduces the Mixture-of-Specialized-Experts Solver (MoSES), which leverages dedicated LoRA experts to enable latent space reuse of base policies, efficiently handling 16 VRP variants.

Background & Motivation

Background: The Vehicle Routing Problem (VRP) is a classic combinatorial optimization problem. Real-world applications involve numerous variants (capacity constraints, open routes, backhaul pickup, time windows, distance limits, etc.), and neural solvers have demonstrated strong performance on individual VRP variants.

Limitations of Prior Work: Specialized neural solvers require training from scratch for each VRP variant, incurring prohibitive costs. While unified solvers can handle multiple variants simultaneously, they fail to exploit the combinatorial structure of VRP variants—each variant is derived from shared base VRP components.

Key Challenge: The number of VRP variants grows exponentially with attribute combinations (5 base constraints yield 16 variants), and existing fine-tuning/adapter methods do not scale efficiently; unified models, meanwhile, fail to leverage the specialized knowledge of base solvers.

Goal: To enable unified solvers to be aware of the shared component nature among VRP variants, and to actively reuse specialized solvers trained on base VRP variants.

Key Insight: Starting from the decomposition of MDP state spaces, the paper proves that the optimality of base policies can be transferred to a unified policy.

Core Idea: The VRP state space is decomposed into a Cartesian product of base state spaces; frozen LoRA base experts combined with an adaptive gating mechanism mix policies in the latent space.

Method

Overall Architecture

A three-stage pipeline: (1) pre-train a shared backbone on CVRP as the 0-th base policy; (2) fine-tune dedicated LoRA experts for each base VRP variant using Gated-LoRA; (3) freeze backbone and LoRA weights, then train an adaptive gating mechanism to dynamically aggregate experts.

Key Design 1: State-Decomposable MDP (SDMDP)

  • Function: Formally decomposes the MDP state space of VRP into a Cartesian product of base state spaces.
  • Mechanism: The complete state space \(\bar{\mathcal{S}}_t = \prod_{i=0}^{n} \mathcal{S}_t^{(i)}\), where each state \(\bar{s}_t = \{s_t^{(i)}\}_{i=0}^n\) decomposes into conditionally independent base states. Transition probabilities factorize accordingly: \(\mathcal{P}(s_{t+1}|s_t, a_t) = \prod_{i=0}^m \mathcal{P}(s_{t+1}^{(b_i)} | s_t^{(b_i)}, a_t)\).
  • Design Motivation: The dynamic contexts and constraints of VRP variants naturally originate from different base variants; this structured decomposition makes base policy reuse feasible.
  • Theoretical Guarantee (Theorem 1): The optimal unified policy \(\pi^*\) and the \(i\)-th optimal base policy \(\pi^{(i)*}\) yield equal value functions on the states of the corresponding base task.

Key Design 2: Latent Space-based SDMDP (LS-SDMDP)

  • Function: Reuses optimal base policies in the latent space.
  • Mechanism: A mixing function \(f_\phi\) maps embeddings \(\{z^{(b_i)}\}\) generated by base policies into a unified state embedding: \(z = f_\phi(z^{(b_0)}, \ldots, z^{(b_m)}; s)\). The policy is rewritten as: $\(\pi(a|s) = \sum_{z^{(b_0)},\ldots,z^{(b_m)}} \pi(a|f_\phi(z^{(b_0)},\ldots,z^{(b_m)};s)) \prod_{i=0}^m \pi(z^{(b_i)}|s^{(b_i)})\)$
  • Theoretical Guarantee (Theorem 2): Under mild assumptions, the optimal unified policy of LS-SDMDP can recover the optimal unified policy of SDMDP.

Key Design 3: MoSES Implementation

  • Gated-LoRA Experts: For each base VRP variant (excluding CVRP), LoRA matrices \(B_i A_i\) are appended to the linear layers of the frozen backbone, with dynamic gating to suppress task-irrelevant features in the backbone: $\(h^{\text{out}} = \text{sigmoid}(\langle W^g, h^{\text{in}} \rangle) W_0 h^{\text{in}} + B_i A_i h^{\text{in}}\)$
  • Adaptive Gating Aggregation: With all weights frozen, a gating function \(G(h^{\text{in}}) = \text{act}(W^G h^{\text{in}})\) is trained to compute coefficients \(\{\alpha_i\}_0^4\), augmented with a residual LoRA \(\hat{B}\hat{A}\): $\(h^{\text{out}} = \alpha_0 W_0 h^{\text{in}} + \sum_{i=1}^4 \alpha_i B_i A_i h^{\text{in}} + \hat{B}\hat{A} h^{\text{in}}\)$
  • Three Gating Activations: softmax (convex combination), norm_softplus (prevents gradient vanishing), sigmoid (relaxes the unity-sum constraint).
  • Three Routing Strategies: Dense Routing, Variant-Aware Routing-I (Top-K), Variant-Aware Routing-II (exact matching).

Loss & Training

The REINFORCE algorithm is used, with reward \(-c(\tau)\) (negative total route length); training proceeds through three separate stages, each optimized independently.

Key Experimental Results

Main Results: 16 VRP Variants (N=50, N=100)

Solver CVRP Cost (N=50) Gap OVRP Cost (N=50) Gap
HGS-PyVRP (exact) 10.372 * 6.507 *
RF-TE 10.504 1.276% 6.684 2.693%
CaDA 10.483 1.072% 6.662 2.350%
MoSES(RF) 10.465 0.900% 6.632 1.892%
MoSES(CaDA) 10.462 0.873% 6.629 1.857%
Solver VRPTW Cost (N=100) Gap VRPBLTW Cost (N=100) Gap
HGS-PyVRP 25.423 * 29.026 *
RF-TE 26.234 3.177% 30.688 2.923%
CaDA 26.128 2.753% 30.586 2.579%
MoSES(RF) 26.143 2.822% 30.627 2.712%
MoSES(CaDA) 26.032 2.383% 30.510 2.329%

Ablation Study

  • Gating Activation: sigmoid > norm_softplus > softmax; relaxing the unity-sum constraint enables more flexible expert combination.
  • Routing Strategy: Variant-Aware Routing-II (exact matching of base variants) is generally optimal.
  • Residual LoRA: Removing \(\hat{B}\hat{A}\) degrades performance, confirming the expressiveness limitation of linear expert mixing alone.
  • Gated-LoRA vs. Standard LoRA: The gating mechanism significantly outperforms direct LoRA fine-tuning.

Key Findings

  • MoSES outperforms existing unified solvers (MTPOMO, MVMoE, RF series, CaDA) across all 16 VRP variants.
  • Inference time increases by approximately 2–3× compared to baselines (due to parallel expert computation), but remains far faster than exact methods (seconds vs. minutes).
  • The framework is plug-and-play: applicable to different backbone networks (RF and CaDA).

Highlights & Insights

  1. Theory-Practice Unification: SDMDP is not merely a formalization tool; Theorems 1 and 2 provide theoretical justification for expert reuse.
  2. Exploiting Combinatorial Structure: This is the first work to explicitly model the combinatorial attribute structure of VRP as state space decomposition, circumventing the exponential growth in the number of variants.
  3. Parameter Efficiency: The frozen backbone + LoRA architecture makes the marginal cost of adding new base variants extremely low.
  4. Interpretable Expert Activation: Gating coefficients can reveal the degree to which different VRP variants depend on each base policy.

Limitations & Future Work

  1. Domain Knowledge Dependency in Base Variant Definition: The selection of 5 base VRP variants is predefined; automatic discovery of base tasks remains a future direction.
  2. Inference Efficiency: Multi-expert parallelism introduces inference overhead (~2–3×), which may be prohibitive in real-time scenarios.
  3. Scaling to More Constraints: How the gating mechanism scales as the number of base variants grows further requires investigation.
  4. Gap from Exact Methods: An optimality gap of 2–5% remains on complex variants.
  5. LoRA Rank Selection: A fixed rank is used throughout; adaptive rank selection may yield further gains.
  • Distinction from MVMoE: MVMoE learns implicit, less interpretable expert specialization, whereas MoSES uses explicitly pre-trained base solvers as experts.
  • Distinction from LoRA Hub-style Methods: MoSES goes beyond simple composition of LoRA adapters, providing a theoretical guarantee of optimal policy recovery.
  • Implications for Other Combinatorial Optimization: The state decomposition idea underlying SDMDP can be generalized to any combinatorial optimization setting that naturally decomposes into sub-problems.

Rating

⭐⭐⭐⭐ (4/5)

An outstanding contribution with rigorous theory and comprehensive experiments. The SDMDP framework elegantly addresses the exponential proliferation of VRP variants, and the MoSES implementation is concise and effective. The primary limitations are inference efficiency overhead and the manual dependency in defining base variants.