Embracing Large Language Models in Traffic Flow Forecasting¶

Conference: ACL 2025
arXiv: 2412.12201
Code: https://github.com/YushengZhao/LEAF
Area: Autonomous Driving
Keywords: Traffic Flow Forecasting, LLM Discriminative Ability, Graph Neural Network, Hypergraph, Ranking Loss

TL;DR¶

The LEAF framework is proposed, which utilizes a dual-branch predictor comprising a graph branch (for pair-wise relations) and a hypergraph branch (for non-pair-wise relations) to generate candidate forecasts. A frozen LLM is then employed as a selector (interpreting discriminatively rather than generatively) to choose the optimal forecast, optimizing the predictor through feedback via ranking loss, achieving SOTA on PEMS datasets.

Background & Motivation¶

Background: Traffic flow forecasting is a core challenge in Intelligent Transportation Systems (ITS). Extant mainstream approaches employ GNN/RNN/Transformer models to capture spatio-temporal dynamics. Recently, pioneering works have begun introducing LLMs into traffic forecasting.

Limitations of Prior Work: (1) Prior methods assume identical train/test distributions, yet traffic conditions undergo distribution shift due to special events, weather, or epoch-level transitions, causing performance degradation; (2) graphs only capture pair-wise relationships, whereas hypergraphs only model non-pair-wise relationships, making a single structure insufficient.

Key Challenge: The intuitive method for leveraging LLMs in traffic forecasting is to let the LLM directly "generate" forecast values. However, traffic data encapsulates complex spatio-temporal dynamics, which is overly challenging for the generative capabilities of language models—indeed, LLM-MPE performs worse than simple GNN methods across multiple datasets.

Goal: How can one harness the generalization and reasoning capabilities of LLMs to enhance traffic flow forecasting, while avoiding having the LLM directly process complex spatio-temporal relationships?

Key Insight: Instead of utilizing the "generative capability" of the LLM, this work leverages its "discriminative capability"—allowing the LLM to select the most plausible forecast from multiple candidates.

Core Idea: Generate candidate forecasts using traditional spatio-temporal models, employ an LLM as a selector, and adopt ranking loss based on the selection feedback to train the predictor.

Method¶

Overall Architecture¶

LEAF consists of two parts: (1) a dual-branch predictor—where the graph branch captures pair-wise spatio-temporal relationships and the hypergraph branch models non-pair-wise relationships; (2) an LLM-based selector—where a frozen LLaMA 3 70B selects the optimal prediction from a candidate set. Workflow: pre-train the dual branches \(\rightarrow\) generate forecasts during testing \(\rightarrow\) construct candidate outputs (with transformations) \(\rightarrow\) select using the LLM \(\rightarrow\) fine-tune the predictor via ranking loss \(\rightarrow\) iterate.

Key Designs¶

Spatio-Temporal Graph Construction and Graph Branch:
- Function: Expands \(T \times N\) spatio-temporal data into a spatio-temporal graph, using GCNs to capture pair-wise spatio-temporal relationships.
- Mechanism: Constructs a spatio-temporal graph \(\mathcal{G}^{ST}\) where nodes are the \(TN\) spatio-temporal points and edges include spatial edges (adjacent sensors at the same time step) and temporal edges (adjacent time steps for the same sensor). Information is propagated through 7 layers of standard GCN convolution: \(X^{(l)} = \sigma(\hat{A}^{ST} X^{(l-1)} W_G^{(l)})\).
- Design Motivation: The graph branch excels at modeling localized propagation effects (e.g., traffic congestion at one junction affecting adjacent junctions).
Hypergraph Branch:
- Function: Learns a hypergraph incidence matrix to capture non-pair-wise group dynamics.
- Mechanism: Uses a learnable incidence matrix \(I_H = \text{softmax}(X_H^{(l-1)} W_H)\) to execute hypergraph convolutions via \(X_H^{(l)} = I_H(I_H^\top X_H^{(l-1)} + \sigma(W_E I_H^\top X_H^{(l-1)}))\). The first term models node interactions inside hyperedges, and the second models inter-hyperedge interactions.
- Design Motivation: Morning peak commuting from residential areas to commercial districts is a typical non-pair-wise relationship—where a set of nodes fluctuates synchronously, which cannot be accurately represented by simple pair-wise graph edges.
Candidate Set Construction and LLM Selector:
- Function: Applies various transformations to the forecasts from the dual branches to expand the candidate set, from which the LLM selects.
- Mechanism: Transformations include smoothing, increasing trend (linear increase from 1-12%), decreasing trend, overestimation (+5%), and underestimation (-5%). Together with the raw prediction, this yields 12 candidates. A prompt containing the task description, spatio-temporal information, historical data, and candidate set is constructed for LLaMA 3 70B to select the optimal item.
- Design Motivation: (1) Providing more candidates gives the LLM greater latitude to cope with distribution shifts (e.g., selecting an increasing trend during Monday morning peaks); (2) choosing (discriminating) among candidates is substantially easier for the LLM than direct numerical generation, effectively unleashing its commonsense reasoning capabilities.
Ranking Loss Feedback:
- Function: Backpropagates the LLM selection feedback using ranking loss to train the predictor.
- Mechanism: \(\mathcal{L}^G = [\Delta(y_i^G, \hat{y}_i) - \inf_{y_i' \in \mathcal{C}_i \setminus \{\hat{y}_i\}} \Delta(y_i^G, y_i') + \epsilon]_+\), which forces the predictor output to be closer to the chosen candidate than to sub-optimal candidates.
- Design Motivation: Since the LLM selection might not perfectly match the ground truth, directly optimizing with MSE/MAE would introduce excessive noise. Ranking loss only demands correct relative ranking, making it more robust.

Loss & Training¶

Pre-training phase: Both branches are trained independently using MAE loss.
Test-time adaptation: Ranking loss (Huber distance, margin \(\epsilon=0\)), updating for \(M=5\) steps per round, with \(K=2\) prediction-selection iteration rounds.
Hidden dimension \(d=64\), 7 layers, batch training.

Key Experimental Results¶

Main Results¶

Method	PEMS03 MAE	PEMS04 MAE	PEMS08 MAE	PEMS08 RMSE	PEMS08 MAPE
DCRNN (GNN+RNN)	29.99	34.36	31.41	43.91	15.44%
STSGNN (GNN)	28.21	33.43	29.58	41.95	12.90%
DyHSL (Hypergraph)	27.10	33.36	27.34	39.05	11.56%
STAEformer (Transformer)	27.87	33.77	27.43	38.16	11.36%
LLM-MPE (LLM Generative)	33.82	35.63	26.42	40.02	10.61%
LEAF (Ours)	25.46	31.49	24.68	36.07	10.56%

Ablation Study (PEMS08)¶

Configuration	MAE	RMSE	MAPE
Graph branch only	29.12	41.36	13.54%
Hypergraph branch only	27.94	39.11	11.82%
w/o hypergraph	26.29	38.18	12.83%
w/o graph	25.80	37.23	11.00%
w/o transformation	25.47	36.47	11.01%
w/o ranking loss	25.41	37.00	11.34%
LEAF	24.68	36.07	10.56%

Key Findings¶

LLM as a discriminator is significantly superior to generator: LLM-MPE (generative) yields 33.82 MAE on the large-scale network PEMS03, falling short of simple GNNs, whereas LEAF (discriminative) achieves 25.46 MAE, leading by a large margin.
Dual branches complement each other: Eliminating either branch leads to performance degradation, demonstrating the importance of modeling both pair-wise and non-pair-wise relations.
Significant impact of the LLM selector: Graph branch alone yields 29.12 MAE \(\rightarrow\) with LLM = 26.29; hypergraph alone yields 27.94 \(\rightarrow\) with LLM = 25.80.
Ranking loss is superior to direct fitting: Discarding the ranking loss escalates the RMSE from 36.07 to 37.00.
Greater advantage in long-term forecasting: In 12-step prediction, LEAF's performance in early steps yields margins close to the base branches, but error is significantly reduced in later steps.

Highlights & Insights¶

"Employing LLM for selection rather than generation": This is the core architectural insight. LLMs excel at semantic comprehension and commonsense reasoning (e.g., "peak hours end at 7 PM") but struggle with precise numerical regressions. Positioning it as a discriminator rather than a generator provides an elegant capabilities match.
Ranking loss tolerates selection noise: Since the LLM selection is imperfect, ranking loss only requires correct relative rankings, elegantly handling noisy supervisory signals.
Candidate expansion via transformations: Even simple transformations (trend/smoothing/offsetting) supply the LLM with sufficient operational range to adapt to distribution shifts.

Limitations & Future Work¶

Evaluated solely on PEMS traffic datasets, lacking validation on other spatio-temporal tasks (such as meteorology or energy).
The LLM remains unfine-tuned. Employing Parameter-Efficient Fine-Tuning approach like LoRA could further elevate selector performance.
Performance degrades when iteration rounds exceed \(K>2\), due to the absence of cross-turn context memory, which leads to redundant consideration of identical factors.
Inference cost of LLaMA 3 70B is heavy, requiring careful efficiency considerations for practical deployment.
Only 10% of the training data is utilized; performance under full data scale remains unexplored.

vs LLM-MPE: LLM-MPE forces LLMs to generate predictions directly, showing poor results on large networks. LEAF shifts to a discriminative approach, bypassing the LLMs' bottleneck in handling complex spatio-temporal dynamics.
vs DyHSL: DyHSL is the predecessor of LEAF's hypergraph branch. LEAF further improves upon it by integrating a graph branch and an LLM selector.
vs STAEformer: Pure Transformer schemes lack explicit graph/hypergraph structure modeling.

Rating¶

Novelty: ⭐⭐⭐⭐ The positioning of "LLM as a discriminator" is very creative, but the dual-branch predictor design itself is a compilation of existing works.
Experimental Thoroughness: ⭐⭐⭐ Only evaluated on 3 PEMS datasets, and the 10% training data configuration is highly unique.
Writing Quality: ⭐⭐⭐⭐ The motivation is clearly articulated, and the visual analyses are highly convincing.
Value: ⭐⭐⭐⭐ Provides a practical paradigm for leveraging LLMs in spatio-temporal forecasting.