Parametrized Multi-Agent Routing via Deep Attention Models¶

Conference: AAAI2026 arXiv: 2507.22338 Authors: Salar Basiri, Dhananjay Tiwari, Srinivasa M. Salapaka (UIUC) Code: GitHub Area: Optimization Keywords: Combinatorial Optimization, Multi-Agent Path Planning, Maximum Entropy Principle, Transformer, Facility Location

TL;DR¶

This paper proposes the Deep FLPO framework, which integrates the algebraic structure of the Maximum Entropy Principle (MEP) with a permutation-invariant encoder-decoder neural network (SPN) to address the NP-hard mixed-integer problem of joint facility location and routing optimization. The framework achieves a 100× speedup in policy inference while matching Gurobi's exact solutions at 1500× faster runtime.

Background & Motivation¶

Parametrized Sequential Decision-Making (ParaSDM)¶

Traditional neural combinatorial optimization (NCO) focuses on discrete optimization over fixed state/action spaces (e.g., TSP, VRP). However, a more complex class of problems exists in practice — Parametrized Sequential Decision-Making (ParaSDM): $N$ agents make discrete decisions over $K$ steps while jointly optimizing a set of shared continuous parameters $\mathcal{Y} \subset \mathbb{R}^d$ (e.g., facility locations). This induces strong coupling and high non-convexity between discrete policies and continuous parameters.

The FLPO Problem¶

A canonical instance of ParaSDM is the Facility Location and Path Optimization (FLPO) problem: $N$ agents (e.g., drones) must travel from a source to a destination through $M$ facilities (e.g., charging stations), with the objective of jointly optimizing discrete routes $\eta^i(\gamma)$ and continuous node positions $\mathcal{Y}$ to minimize total transportation cost. Even at the small scale of $N=2, M=4$, the solution space contains numerous local optima.

Scalability Bottleneck of MEP-Based Methods¶

Prior work based on the Maximum Entropy Principle (MEP) relaxes hard assignments $\eta^i \in \{0,1\}$ to soft probabilities $p^i \in [0,1]$, balancing exploration and exploitation via an annealing parameter $\beta$. The MEP policy variables admit a closed-form Gibbs distribution:

\[p_\beta^i(\gamma) = \frac{\exp[-\beta \cdot d^i(\gamma)]}{\sum_{\gamma' \in \mathcal{G}^i} \exp[-\beta \cdot d^i(\gamma')]}\]

However, the per-step gradient complexity is $O(NM^4)$ (even after Markov DP optimization), rendering it infeasible for medium-scale problems.

Method¶

Shortest Path Network (SPN)¶

The SPN is a permutation-invariant encoder-decoder model that approximates the MEP Gibbs policy distribution.

Encoder: Employs an induced attention mechanism. The source $S$, destination $\Delta$, and facility coordinates $Y^t$ for all agents are concatenated into $E_0 \in \mathbb{R}^{N \times (M+2) \times d_h}$, processed through $L_{\text{enc}}$ attention blocks. Crucially, $M^*$ learnable inducing points $T \in \mathbb{R}^{M^* \times d_h}$ reduce the standard self-attention complexity from $O(NM^2)$ to $O(NM \cdot M^*)$:

\[\hat{E}_{l-1} = \text{MHA}(T, E_{l-1}, E_{l-1})$$ $$I_l = \text{LN}(E_{l-1} + \text{MHA}(E_{l-1}, \hat{E}_{l-1}, \hat{E}_{l-1}))\]

Decoder: Fuses current position and destination information via gated interpolation to produce a goal-aware query vector:

\[\alpha = \sigma([h_X, h_\Delta] W), \quad q = \alpha \odot h_X + (1-\alpha) \odot h_\Delta$$ $$\Pi_k = \text{softmax}(E_{L_{\text{enc}}} \cdot q / \sqrt{d_h})\]

Two decoder variants are proposed: DED (Direct Embedding Decoder) and DCAD (Dual Cross-Attention Decoder). Ablation studies show that DED with annealing supervision achieves the best performance, with an average optimality gap of approximately 6%.

Mixed-Sampling Gradient Estimation¶

To approximate the annealing behavior of MEP, $L$ paths are sampled per agent: the first $b$ are generated via SPN beam search or sampling (short paths assigned high Gibbs weights for exploitation), while the remaining $L-b$ are drawn from uniform random sampling (promoting exploration at low $\beta$). The free energy gradient is estimated as:

\[\widehat{\nabla_{\mathcal{Y}^t} F_\beta} = \sum_{i=1}^N \rho_i \sum_{q=1}^L \hat{p}_\beta^i(\gamma^q) \nabla_{\mathcal{Y}^t} d^i(\gamma^q)\]

where $\hat{p}_\beta^i$ re-normalizes Gibbs weights over the sample set. All gradients $\nabla_{\mathcal{Y}^t} d^i(\gamma^q)$ are obtained via PyTorch automatic differentiation.

Curriculum Learning Training¶

Phase	Mode	Pretrained	Epochs	Nodes $M$
1	Supervised (KL divergence)	None	10k	10
2	Supervised	Phase 1	1.5k	50
3	RL (Policy Gradient)	None/Phase 2	50k	50
4	RL (Policy Gradient)	Phase 3	10k	100
FT	RL (Fine-Tuning)	Phase 4	1k	{10, 100}

The supervised phase minimizes the KL divergence between the model policy and the Gibbs target distribution; the reinforcement learning phase uses REINFORCE with a POMO-style baseline.

Key Experimental Results¶

Scalability Speedup¶

Gradient computation: SPN + Sampling is approximately $200\times$ faster than the original ParaSDM solver at $N=200, M=100$
Shortest path inference: SPN greedy inference is approximately $100\times$ faster than ParaSDM DP at $N=M=200$

Comparison with Baselines¶

Method	Cost (vs. Gurobi)	Speed
GA / SA / CEM	$>10\times$ higher than Deep FLPO	Slower
Gurobi (exact)	Optimal	Baseline
Deep FLPO (high $\beta$)	Only 2% higher	$\sim 1500\times$ faster
Deep FLPO (annealing)	Matches Gurobi	$\sim 1500\times$ faster

Large-Scale FLPO Scenarios¶

$N=200, M=40$: SPN-only is $10\times$ faster than the annealing variant, while the annealing variant achieves 17% lower cost
$N=800, M=200$: The annealing variant achieves 8% lower cost but is $\sim 3\times$ slower — at this scale, both Gurobi and metaheuristics become practically infeasible
SPN generalization: Trained on 10–100 nodes, the average optimality gap is only 6% when evaluated on 10–300 nodes

Ablation Study¶

DED + annealing supervision achieves an average gap of ~6% on $M=10\sim300$, outperforming DCAD + annealing by ~4%
Supervised pretraining reduces the gap for DED by ~3%; without supervised pretraining, DCAD's RL training collapses (400–500% gap)

Highlights & Insights¶

Deep integration of structural priors and neural networks: Rather than black-box replacement of classical methods, the algebraic structure of MEP's Gibbs distribution is embedded directly into the SPN architecture
First scalable neural solver for ParaSDM: The original $O(NM^4)$ DP complexity is substantially reduced to $O(NM^2)$ with GPU parallelism
Elegant annealing + mixed-sampling strategy: Uniform exploration at low $\beta$ avoids local optima, while SPN focuses on short paths for exploitation at high $\beta$
Cross-scale generalization: Trained on 10–100 nodes, the model maintains a 6% optimality gap at 300-node test instances
Practical utility: Matches Gurobi's exact solutions while running 1500× faster, supporting large-scale scenarios with 800 agents × 200 nodes

Limitations & Future Work¶

Euclidean distance costs only: Road networks, obstacles, and other realistic cost functions are not addressed
No facility capacity constraints: Real charging stations have capacity limits; extensions are needed to handle multi-agent competition
Primarily 2D settings: Although the formulation supports $d$-dimensional spaces, experiments are conducted only on $\mathcal{D} = [0,1]^2$
SPN training cost: The four-phase curriculum training likely requires substantial GPU time, which is not reported in detail
Fixed Gibbs temperature during training: SPN is trained and evaluated at high $\beta$, but the choice of $\beta$ significantly affects performance

Attention Model (Kool et al.): Designed for TSP/VRP; cannot jointly optimize continuous parameters $\mathcal{Y}$. SPN extends its decoder to accommodate ParaSDM
Pointer Network: Uses LSTM + attention; SPN adopts Transformer + induced attention for greater efficiency
SIL (Self-Improved Learning): Extends NCO with linear attention but does not handle parametrized action spaces
Original MEP (Srivastava & Salapaka): Theoretically rigorous but not scalable at $O(NM^4)$; Deep FLPO preserves MEP's structural guarantees while achieving 100× speedup
Gurobi: Exact but not scalable to large instances; Deep FLPO is 1500× faster at matching accuracy

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First work to integrate MEP structural priors with NCO transformers for ParaSDM, opening a new research direction
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers scalability, baseline comparisons, and ablations, but lacks validation on real-world application scenarios
Writing Quality: ⭐⭐⭐⭐ — Problem formulation is rigorous and method derivation is clear, though heavy notation increases reading difficulty
Value: ⭐⭐⭐⭐⭐ — Addresses the core scalability bottleneck of ParaSDM with strong practical utility

Method	Cost (vs. Gurobi)	Speed
GA / SA / CEM	\(>10\times\) higher than Deep FLPO	Slower
Gurobi (exact)	Optimal	Baseline
Deep FLPO (high \(\beta\))	Only 2% higher	\(\sim 1500\times\) faster
Deep FLPO (annealing)	Matches Gurobi	\(\sim 1500\times\) faster