doi: 10.11916/j.issn.1005-9113.24075
Multi-Step Short-Term Traffic Flow Prediction of Urban Road Network Based on ISTA-Transformer Model
School of TransportationSoutheast UniversityNanjing 211189 China
Funds: Sponsored by National Key Research and Development Program of China(Grant No.2020YEB1600500).
Abstract
Short-term traffic flow prediction plays a crucial role in the planning of intelligent transportation systems. Nowadays, there is a large amount of traffic flow data generated from the monitoring devices of urban road networks, which contains road network traffic information with high application value. In this study, an improved spatio-temporal attention transformer model ( ISTA-transformer model) is proposed to provide a more accurate method for predicting multi-step short-term traffic flow based on monitoring data. By embedding a temporal attention layer and a spatial attention layer in the model, the model learns the relationship between traffic flows at different time intervals and different geographic locations, and realizes more accurate multi-step short-time flow prediction. Finally, we validate the superiority of the model with monitoring data spanning 15 days from 620 monitoring points in Qingdao, China. In the four time steps of prediction, the MAPE ( Mean Absolute Percentage Error) values of ISTA-transformer's prediction results are 0.22, 0.29, 0.37, and 0.38, respectively, and its prediction accuracy is usually better than that of six baseline models (Transformer, GRU, CNN, LSTM, Seq2Seq and LightGBM), which indicates that the proposed model in this paper always has a better ability to explain the prediction results with the time steps in the multi-step prediction
0 Introduction
Intelligent Transportation Systems ( ITS) have become a focal point of research in recent years, serving as a cornerstone for the development of smart cities. Traffic flow prediction, as a key component of ITS, provides critical data that enables optimised allocation of urban traffic resources, helps city managers respond quickly to varying traffic conditions, and provides travellers with more efficient route planning. However, with the continued expansion of urban road networks and the increase in vehicle ownership, the stochastic and uncertain nature of traffic flows has become more pronounced, posing significant challenges to accurate short-term forecasting within specific urban areas.
Short-term, multi-step traffic flow forecasting has significant value for traffic management in these areas, where multi-step refers to predicting a sequence of future time steps based on historical data, ratherthan making independent forecasts for each time step. Accurate multi-step forecasting can help urban transport authorities allocate resources in advance, manage congestion at a micro level and implement demand-based control measures tailored to specific network segments. By accurately predicting traffic flow over multiple time steps, decision makers can better deal with the unique challenges posed by sudden fluctuations, improving the accuracy and efficiency of real-time traffic management. For instance, shorter time steps ( 15-30 min) can support immediate congestion mitigation, whereas longer time steps (45-60 min) can aid in strategic planning for traffic signal adjustments and rerouting.
Although recent advances in predictive models, including deep learning and spatio-temporal methods, have improved prediction accuracy, current approaches often lack the ability to fully capture the spatio-temporal relationships between multiple monitoring points within a region. This limitation affects the robustness of predictions, especially indynamically changing urban environments. To address these challenges, this study proposes a model that exploits the spatio-temporal relationships between different monitoring points within a given area to improve the accuracy of short-term, multi-step forecasts over the next few intervals, thereby supporting more effective and timely urban traffic management.
1 Related Work
Traffic flow prediction is a typical regression problem which is about forcasting future traffic flow based on historical information[1]. As research in the field of traffic flow prediction continues to expand, a multitude of models and methodologies have been proposed. Prediction models that are most commonly utilized can be broadly classified into two categories: linear theoretical models and neural network models[2].
1.1 Linear Theoretical Models
The linear theory forecasting model incorporates traditional time series analysis methodologies, including the ARIMA ( Autoregressive Integrated Moving Average) model, exponential smoothing method, the Markov Chain ( MC), and so on. The ARIMA method was initially proposed by Ahmed et al.[3]in 1996, with van der Voort et al.[4]subsequently proposing a hybrid method, the Kohonen-ARIMA model, based on existing studies. Later, Lee et al.[5]used a subset ARIMA model, and Williams[6]tried using an ARIMAX model with explanatory variables. Kamarianakis et al.[7]proposed a space-time ARIMA model, while Williams et al.[8]put forward the seasonal ARIMA model byconsidering univariate modelling as a periodic process. Exponential smoothing is a process of repeatedly updating forecasts based on recent experience. Typically, in metropolitan areas where weekday traffic flow patterns differ from weekend traffic flow patterns, Holt-Winters exponential smoothing works well when the data is both trend and seasonality[9]. In traffic flow forecasting, triple exponential smoothing method is commonly used to display trend and seasonal data. Markov model is a powerful tool for measuring state space and analyzing time series data.
It is suitable for stochastic systems with large randomness and obvious data fluctuation[10]. These models are typically predicated on the assumption of a linear relationship or a simple change rule of data, which is applicable to relatively stable and linearrelationship strong time series data forecasting[11]. Nevertheless, time series models analogous to ARIMA, despite their efficacy in linear analysis, are incapable of addressing nonlinear relationships in traffic flow due to their inadequate encapsulation of the dependencies between historical data.
1.2 Neural Network Models
Neural network prediction models include feed-forward neural networks, recurrent neural networks such as Long Short-Term Memory ( LSTM), networks, Convolutional Neural Networks ( CNN), and other deep learning models. These models are capable of handling complex non-linear relationships and time series data, and enhance the accuracy and generalization ability of time series forecasting by learning patterns and features from a substantial quantity of data. The advent of artificial neural networks has opened up new avenues for research in the field of traffic flow prediction. In 2015, Ref.[12] demonstrated the applicability of deep learning methods in the field of traffic flow prediction. The application of deep learning, an extension of multi-layer artificial neural networks, has led to a significant advancement in the field of traffic flow prediction. By leveraging large amounts of historical data, deep learning has enabled a more accurate and comprehensive approach to traffic flow prediction,marking a notable shift in the research landscape. In the present era, predictive models based on deep learning networks have become the dominant paradigm in the domain of traffic flow prediction. In recent years, a significant number of scholars have sought to enhance the performance of deep learning models by improving the original models or combining different deep learning methods in order to leverage their respective advantages. For example, Ref. [ 13 ] introduced the SW-BiLSTM model, which considers the interaction between adjacent road segments and achieved impressive predictive accuracy when validated with real-world GPS data. In Ref. [ 14], a BiGRU-BiGRU model with two modules is proposed to extract temporal and periodic features from traffic data, where a novel limited attention mechanism is incorporated in the first module to improve the model’s performance by focusing only on the most recent relevant information in the traffic flow sequence. Luo et al. [15] proposed a novel model, theGraph Temporal Convolutional Long Short-Term Memory network ( GT-LSTM), which is primarily constituted of feature splicing and pattern capturing. In feature splicing, the spatial dependencies of traffic flow are captured through the employment of a self-adaptive Graph Convolutional Network ( GCN). Zhang et al. [16] proposed a deep learning framework that utilizes Seq2Seq models and graph convolutional networks to capture spatial and temporal dependencies, incorporating attention mechanisms and a novel training method to tackle multi-step prediction challenges and effectively address the temporal heterogeneity of traffic patterns.
Summarizing the literature on these different approaches, it is easy to see that most studies optimize the ability of the model to capture the temporal and spatial relationships of historical traffic flows, thus improving the accuracy of the predictions [17-20] . However, if the prediction sequences are too long, memory degradation and information loss may even occur due to the limitations of the above networks [21] . The transformer network is a sequence-to-sequence learning model that fully relies on the self-attention mechanism first used in natural language translation to compute the mapping relationship between input and output in parallel [22] . In 2017, the transformer architecture consisting entirely of attention mechanisms was introduced [22] . In recent years, it has been appliedto traffic flow prediction. For example, Xiao and Chen [23] developed a novel temporal attention module in a spatio-temporal transformer graph network that uses local context to improve the stability of long-termpredictions in the temporal dimension. Hu et al. [24] proposed a multi-layer model based on a transformer and deep learning, which uses multiple encoders and decoders to perform feature extraction on the initial traffic data without human experience. Conversely, these models only consider single-step prediction, which results in a limited total time span of prediction when the time granularity is fine, or the loss of some detailed information when the time granularity is too large. In contrast, multi-step short-time prediction is capable of further dividing the long time span and obtaining more detailed and complete prediction information [25-26] .
In conclusion, this paper puts forth a multi-step prediction model to address the challenge of short-term flow forecasting at pivotal monitoring locations within an urban road network. This model, designated as the ISTA-transformer ( Improved Spatio-Temporal Attention transformer), employs a novel approach that incorporates enhanced spatial and temporal attention mechanisms. This study proposes a systematic processing framework (as shown in Fig. 1) for the collected road network monitoring point data, and optimizes the underlying transformer algorithm by innovatively embedding a spatio-temporal attention mechanism that takes the historical flow and spatial weight matrices as inputs and allows the model to capture the spatio-temporal relationships therein and predict the future short-term flow on a given road section. The process of short term flow prediction for road network monitoring point flow data can be simply divided into the following steps:
Fig.1Flowchart of traffic flow prediction
1) Data preparation, where toll booth traffic data is segmented by time periods, and the flow for each period is calculated, converting the data into a format that the model can process. 2) Model training, historical traffic data is used to train the constructed traffic prediction model, allowing the model to extract the spatio-temporal features contained in the data. 3) Model evaluation, predicted traffic is compared with actual traffic, and various metrics are used to assess the prediction accuracy.
2 Problem Description and Modeling
2.1 Problem Description
The short-term traffic flow multi-step prediction problem can be formulated as follows: a data series consisting of traffic flow observations from several historical periods is input into a prediction model, which outputs a prediction of a traffic flow sequence for several future time periods [27] . The objective of this study is to predict the traffic flow at various monitoring points along a road network for a number of future time steps. This is based on historical monitoring point flow data and the latitude and longitude data of the road network. The core problem is to capture the spatio-temporal relationships of the traffic flow at each monitoring point within thestudy area. In this paper, we adopt a multi-step prediction strategy of ‘ multiple inputs-multiple outputs’, which allows for the acquisition of more detailed prediction information while widening the prediction time span. In this study, the selection of specific time steps is determined by analyzing the time-variation characteristics of traffic flow in the study area. The definitions that are pertinent to this discussion are presented below.
1)Feature matrix: The traffic flow information at time series is utilized as the feature matrix, denoted as QtRN×1, where N represents the number of monitoring points within the traffic network. The model􀆳 s input sequence is defined as {Qt-l+1, Qt-l+2, , Qt}∈ℝl×N×1, with l indicating the size of the input time window. The output sequence produced by the model is {Qt+1, Qt+2, , Qt+k} ∈ ℝk×N×1, which represents the predicted traffic flow for the next k time steps.
2) Spatial relationship matrix: The structure of the road network location map, which depicts the positioning of monitoring points, is represented as G= (V, E, A) . This is the topological framework of thetransportation network. The set V={v1, v2, , vn} represents the monitoring points, while the setEdenotes edges. The adjacency matrix A, which is a matrix of dimensions Rn×n, encapsulating the node connectivity relationships. For any two points vi, vjV and (vi, vj) E, the nodes are connected, and the element aij in the adjacency matrix is 1, otherwise it is 0.
3) Problem formulation: The concept of the prediction task can be understood as the process of learning a mapping function, designated as f, from input historical traffic data {Qt-l+1, Qt-l+2, , Qt} and a known spatial relation matrix, designated as A. This function is then used to predict future traffic flow in several specific time intervals. As illustrated in Eq.(1):
{Qt+1,Qt+2,,Qt+k}=f((Qt-l+1,Qt-l+2,,Qt),G)
(1)
2.2 Architecture of ISTA-Transformer Model
The transformer model was initially developed for natural language processing and comprises of two principal components: the encoder and the decoder. The encoder is composed of the following layers, arranged from top to bottom: The remaining layers are as follows: Multi-Head Self-Attention, Add & Norm (residual connection and layer normalization), Feed-Forward Neural Network, and another Add & Norm layer. The decoder comprises the following layers, arranged from top to bottom: Masked Multi-Head Self-Attention, an Add & Norm layer, a Feed-Forward Neural Network, and another Add & Norm layer.
In this paper, PyTorch is employed to construct an enhanced spatio-temporal attention transformer model, designated the transformer (ISTA-transformer). The model􀆳 s comprehensive architectural framework is depicted in Fig.2.
The complete encoder architecture comprises a spatio-temporal embedding layer and K encoders. Each encoder is constituted of multiple encoder layers, which in turn comprise multi-head self-attention mechanisms and feed-forward neural networks. The historical sequence S and the spatial correlation matrix are taken as input in order to obtain a high-order spatio-temporal feature embedding representation STEI (K) . The self-attention mechanism enables the model to consider dependencies both in a longitudinal manner, within the same time sequence, and laterally, across different time sequences from disparate traffic monitoring points. This mechanismenables the model to learn dynamic patterns within time sequences while also capturing common featuresor associations across different traffic monitoring points.
Fig.2Architecture of ISTA-transformer
In the proposed model, the traditional transformer is extended by the incorporation of spatial self-attention sub-layers and temporal hierarchical diffusion convolution sub-layers within each encoder layer. The incorporation of spatial self-attention sub-layers enhances the model’s capacity to discern spatial dependencies across disparate monitoring points, enabling each point to attend to other points within the network. This integration enables the model to effectively learn spatial correlations and dependencies, which are crucial for understanding traffic patterns across different locations.
The temporal hierarchical diffusion convolution sub-layers are introduced to capture temporal dependencies more effectively by applying diffusion convolution across a range of time scales. The hierarchical approach permits the model to extract both short-term variations and long-term trends within the traffic flow data. The combination of these temporal convolution layers with the self-attention mechanism enables the model to better capture complex temporal patterns that traditional self-attention might miss.
The combination of these specialized sub-layers with the traditional transformer architecture allows our model to address the spatial and temporal dimensions of the data simultaneously, resulting in a more comprehensive and accurate spatio-temporal feature extraction.
A residual connection is added after each sub-layer to ensure effective gradient backpropagation, while layer normalization is applied to stabilize training. Finally, the encoder feeds the spatio-temporal feature representation of the historical sequence into each decoder in the lower layers.
2.2.1 Spatio-temporal embedding layer
In this paper, a learnable embedding matrix is employed to encode the spatial information, therebygenerating the corresponding spatial embedding, which is then integrated with the traffic flow data. The precise formula is given by Eq.(2):
STEI(0)=Add(Concat(X,Wsp),St)
(2)
where X represents the traffic feature matrix of monitoring point,Wsp is the spatial correlation coefficient matrix,St is the learnable temporal embedding, Add is the element-wise addition function, and Concat is the concatenation function that merges traffic feature matrix (X) and the spatial correlation coefficient matrix Wsp along a specified dimension.
The spatio-emporal embedding information STEI(0)is used as the input to the first encoder layer.Residual skip connections are employed to prevent the vanishing embedding features as the number of encoders or decoders increases. The results of the embedding for each node are connected to the node features and then projected into the model dimension.
The combination of X and Wsp,followed by the addition of St, results in the generation of an initial feature representation that encompasses both spatial and temporal data. This constitutes the initial stage of the feature extraction process. Once the preliminary embedding has been produced, the subsequent step is to incorporate the requisite spatial information into the feature matrix, thus enabling the model to utilize spatial dependencies more effectively.
The spatio-temporal embedding layer of the encoder establishes a connection between the flow data from each traffic monitoring point and the spatial correlation coefficient matrix between these points. The spatial correlation coefficient matrix embedding is derived from a preliminary analysis of the spatial relationships between traffic monitoring points and isemployed for the computation of spatial attention.
By applying the matrix Wsp to transform the inputfeature matrix X, we obtain:
X'=X×Wsp
(3)
This transformed feature matrix X ′ is subsequently fed into the self-attention layers, providing a representation that incorporates the learned spatial correlations. Furthermore, in convolutional layers designed to capture spatial patterns, the spatial weight matrix can be applied to adjust the convolutional filters, enhancing the spatial feature extraction process. This integration of the spatial weight matrix within different components of the encoder architecture ensures a more comprehensive and accurate capture of both spatial and temporal dependencies in the traffic data.
2.2.2 Attention mechanism
The multi-head attention mechanism represents a pivotal element of the transformer model, devised to address both local and global dependencies inherent in sequential data. This is achieved by the model learning multiple sets of different attention weights (heads), which are then merged to form a composite representation of the input data. This enables the model to concurrently concentrate on disparate elements of the input sequence across distinct subspaces, thereby facilitating more efficacious capture of dependencies within the sequence. Thisattention mechanism in the ISTA-transformer, through the learning of spatial dependencies between disparate monitoring points and the selection of pivotal dependencies between time series, enhances the inefficient dependencies between time steps observed in the conventional self-attention mechanism. In the multi-head attention mechanism, the input query (Q) , key (K) and value (V) matrices are processed through multiple heads. For the ith header, the computation process is as follows:
Attentioni(Q,K,V)=softmax(QWiQ(KWiK)Tdk)VWiV
(4)
where WiQ, WiK, and WiV are the weight matrices of the query, key, and value of the ith head, and dk is the dimension of the key vector. In ISTA-transformer, the multi-head attention mechanism facilitates the capture of both global and local dependencies, while also enhancing spatial self-attention through the introduction of a spatial weight matrix, denoted as sp. In particular, for each subject, the calculation yields astandardized attention score, designated as Ai
Ai=softmax(QWiQ(KWiK)Tdk)
(5)
The attention matrix is adjusted using the spatial weight matrix Wsp:
A'i=AiWsp
(6)
where ☉ denotes element-wise multiplication. This modification ensures that the attention scores are influenced by the actual spatial dependencies, allowing the model to emphasize more relevant spatial relationships in the attention process. The normalized attention scores A'i are then used to compute the weighted sum of the value vectors, facilitating a more contextually aware representation.
Apply the adjusted attention matrix to the value vectors and merge the outputs of all the heads:
MultiHeadAttention(Q,K,V)=Concat(Attention1(Q,K,V),Attention2(Q,K,V),,Attentionn(Q,K,V))W0
(7)
where n is the number of heads and W0 is the weight matrix used to combine the outputs of all the heads.
The multi-head attention mechanism, a core component of the ISTA-transformer, markedly enhances the model’s capacity to discern intricate spatio-temporal patterns. This is achieved by enabling the model to simultaneously learn diverse attention patterns and spatial dependencies. The incorporation of the spatial weight matrix Wsp enables each attention head to adjust the attention scores in accordance with the spatial correlations, thereby enhancing the model’s capacity for spatio-temporal feature extraction. This enhancement not only optimizes the attention mechanism but also improves the model’s capacity toprocess real monitoring point data.
3 Case Study
3.1 Dataset Processing
This paper presents an analysis of the road network in the Shibei District and Shinan District of Qing dao, China. Fig.3(a) illustrates that the distribution of traffic monitoring points and monitoring devices on the road network is relatively dense, encompassing the primary traffic nodes within the designated research area. Following an initial screening process, whereby data from certain monitoring points were excluded due to errors resulting from device ageing or damage, this study selected a total of 620 monitoring points for theanalysis of 15 days of traffic data (from 12 January to 26 January, 2022). The data set comprises approximately one million vehicle travel records for each day. The original data set comprises a series of records, including the location, ID number, license plate number and monitoring time of each monitoring point. These records are transformed into traffic flow data, representing the number of vehicles passing through a specific direction of a road within a given time period.
Traffic flow is defined as the number of vehicles passing through a road monitoring point during a specific time interval. In order to gain a preliminary understanding of the temporal characteristics of traffic flow in Qingdao, this study counted the daily number of vehicles travelling on weekdays and weekends during the study period, as shown in Fig.3(b). As can be seen from the figure, there is a significant difference between the number of vehicles travelling on weekdays and weekends, and it can be deduced that the overall flow on weekdays is higher than that on weekends. Therefore, in order to ensure that the training model can learn a stable spatio-temporal pattern, it is necessary that the training set fully covers the characteristics of traffic flow changes in different periods, such as the traffic difference between weekdays and weekends. In addition, the test set should select data in subsequent periods in order to test the generalization ability of the model on previously unseen data and evaluate its prediction accuracy for new traffic flow conditions.
Fig.3Traffic flow distribution feature
Furthermore, it should also include both weekdays and weekends. Consequently, the data from 12 January 2022 ( Wednesday) to 21 January 2022 (Friday) are allocated to the training sets, and the data from 22 January 2022 (Saturday) to 26 January 2022 (Wednesday) are allocated to the test sets. This partitioning method is a common approach in the field of traffic flow prediction, and it has been demonstrated to be an effective means of evaluating model performance.
Of the aforementioned traffic monitoring points, Xianggang Middle Road and Fuzhou South Road represent two of the most significant traffic arteries in the Shinan District. The intersection is situated at the core of the densely interconnected road network in the Shinan District. The location is in close proximity to two major commercial districts and a multitude of office buildings. The mean daily traffic volume at this intersection is approximately 120000 vehicles, with an average daily congestion time of 1.5 h. Moreover, the surrounding intersections are closely related to this one. This is a relatively typical urban area road traffic monitoring point.
The data from the traffic monitoring points at the Fuzhou South Road-Xianggang Middle Road intersection ( ID 601018114050) are used as the prediction target. A spatial autocorrelation analysis is conducted on the traffic flow within a1.2 km radius centered around the target point to identify groups of monitoring points with strong correlations, as illustrated in Fig. 4. The model is trained to complete traffic flow prediction for this specific traffic monitoring point.
Fig.4Monitoring points group
3.2 Traffic Spatio-Temporal Feature Analysis
1) Time features: Traffic flow can be classified into distinct intervals based on the temporal resolution of the data, including daily, hourly, and 15 min traffic flow. Fig. 5 depicts the fluctuations in traffic volume at identical traffic monitoring points with varying sampling intervals. Fig. 5 ( a) depicts traffic statistics for a5-minute interval. It can be observed that a very short sampling time window leads to significant fluctuations in traffic volume with relatively low absolute values, which is not conduciveto prediction and practical application.Conversely, an overly long sampling time window can result in a reduction in the number of data samples. Therefore, a15 min sampling interval, as illustrated in Fig. 5(c), can effectively smooth the traffic monitoring points traffic fluctuation curve while preserving relatively rich detail features.
Fig.5Traffic volume at different intervals
The individual elements of the input datasets are the traffic flows at the16 monitoring points, recordedat 15 min intervals. Fig. 6 illustrates the traffic flow in three-dimensional format for 12 January 2022 from 5 ∶ 00 to 22 ∶ 00 at the choke point for a15-analysis interval. It is evident that there are discrepancies in the traffic flow patterns observed at different monitoring points within the same time frame. Additionally, the traffic flow at a given monitoring point exhibits considerable fluctuations over time. Accordingly, this paper investigates the spatial and temporal relationship between the surrounding chokepoint traffic and the predicted target chokepoint traffic, with the objective of making a more accurate prediction of the target chokepoint traffic.
Fig.6 illustrates the historical flow data for the16 selected monitoring points at 15-minute statistical intervals, and it can be seen that there are significant differences in the flow values between the points, but they are roughly similar in terms of periodicity, making it necessary to explore the effect of coupling between the flow and spatial location of the surrounding and target monitoring points. The following section will further examine the spatial correlation between the traffic flow at different monitoring points and the flow at the target prediction points.
Fig.6Traffic flows at group of monitoring points (interval is 15 min)
2) Spatial features: The configuration of urban road traffic networks is inherently complex, characterized by the intersection of road segments and the relatively short distances between roads. The traffic flow on a given road segment can be affected by the traffic flow on the upstream segment, which in turn can affect the downstream traffic flow. The congestion or smooth flow of traffic on the downstream segment is likely to be influenced by the traffic conditions on the upstream segment. Thisillustrates that urban road traffic flow displays spatial correlation. As illustrated in Fig. 6, the flow change curves of monitoring points,and,situated on the same main road, exhibit a notable degree of resemblance.
In this study, the Pearson coefficient is employed to quantify the relationships between the target traffic monitoring points and their surrounding traffic monitoring points. This coefficient incorporates a spatial weight matrix,wij, to derive spatial correlation results. The formula for calculation is presented in Eq.(8):
ρij=wijE[(Xi(t)-X¯i)(Xj(t)-X¯j)]σXiσXj
(8)
Where Xi, Xj are the time series traffic flow data for traffic monitoring points i, j, respectively. Xit represents the time series traffic flow data at traffic monitoring pointsat time. Xj (t) represents the time series traffic flow data at traffic monitoring pointsjat time t.E denotes the mathematical expectation. σ represents the standard deviation.
Fig.7 depicts the spatial correlation coefficient diagram between the research objects. This diagram illustrates the spatial correlations between the various traffic monitoring points under study, based on the spatial correlation analysis conducted using the provided formula. The coefficients offer insights into the relationship between the traffic flow at one traffic monitoring point and the traffic flow at neighbouring traffic monitoring points, thereby facilitating anunderstanding of the spatial relationships within the road network.
Fig.7Spatial correlation coefficient
A sliding window is a data processing technique employed for the generation of sample sequences, which are subsequently utilized for the analysis oftime series data. In this article, the sliding window method is employed to transform the original time series data into a set of data points suitable for training and testing deep learning models. This involves the creation of a time window with a size of 12 time steps, wherein the observations within this window serve as input features, and the traffic values for the subsequent four time steps are used as the target sequence ( see Fig. 8). Following the application of the sliding time window for the division of thesamples, the dimensions of the samples processed by the algorithm employed in this study are as follows: (number of samples, length of time window, number of monitoring points). In other words, each sample comprises data from 12 time steps with 16 features per time step, while the dimensions of the set labels are (number of samples, number of predicted time steps, number of predicted points). Consequently, each label contains data from 4 time steps with 1 feature per time step.
Fig.8Slide prediction time window
3.3 Model Evaluation
To evaluate the model’s performance, error functions are commonly used as evaluation metrics for prediction results, including Mean Absolute Percentage Error ( MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared R2.
MAPE=1ni=1n|y^i-yiyi|×100%
(9)
MSE=1ni=1n(yi^-yi)2
(10)
RMSE=1ni=1n(yi^-yi)2
(11)
R2=1-i=1n(y^i-yi)2i=1n(y^i-y¯i)2
(12)
To evaluate the performance of the ISTA-transformer model, it was compared with several baseline models, including the Transformer, GRU ( Gated Recurrent Unit), CNN ( Convolutional Neural Network), LSTM Network, Sequential Neural Network (Seq2Seq) and LightGBM.
1) Transformer: Configured with 6 layers for both the encoder and decoder, 8 attention heads, a hidden dimension of 512, a feedforward dimension of 2048, and a dropout rate of 0.1.
2) Gated Recurrent Unit ( GRU): A variant of recurrent neural networks (RNNs) that simplifies the architecture compared with LSTM models. The GRU model used here has 2 layers, 64 hidden units, and a dropout rate of 0.2.
3) Convolutional Neural Network ( CNN): Designed to process and learn from spatial and temporal patterns in data. For this comparison, the CNN was configured with 3 convolutional layers, a kernel size of 3, a stride of 1, and 64 filters.
4) LSTM Network: A type of recurrent neural network designed to capture long-term dependencies in sequential data. The LSTM model used in this evaluation has 2 layers, 64 hidden units per layer, and a dropout rate of 0.2.
5) Sequential Neural Network ( Seq2Seq): A neural network architecture designed to address sequence-to-sequence problems. The model comprises two principal components, namely the encoder and decoder. The LSTM layers of the encoder and decoder are configured to have64 hidden units, and the network has two LSTM layers, a dropout rate of 0.2 is employed. The model was trained for 200 cycles, with each batch comprising32 samples.
6) LightGBM: An efficient gradient boosting framework for handling large-scale datasets and high-dimensional features. Number of estimators is 200, learning rate is 0.01, number of leaves is 31, random_state is 42, stopping_rounds is 50. Additionally, a logging callback function is utilized to monitor the evaluation metrics throughout the training process and to log the model performance every 10 rounds.
These baseline models were used to assess the performance of the ISTA-transformer model by providing a range of different forecasting approaches and capabilities.
4 Results and Discussion
4.1 Model Training
As illustrated in Fig. 9, the mean-square error loss function curve is depicted when the ISTA-transformer model processes normalized traffic data. The raw loss values are displayed with a blue curve, showing all fluctuation details, while the red curve represents the smoothed loss trend. In the figure, a5-epoch moving window is used to calculate the mean and standard deviation, with the red shaded area indicating the fluctuation range of ±1 standard deviation. It can be observed that the loss function exhibited significant growth during the initial10 epochs, followed by slight fluctuations within the subsequent 100 epochs. Subsequently, it rapidly converged. This suggests that the model is a suitable tool for traffic flow prediction tasks.
Fig.9Loss curve of the ISTA-transformer model
4.2 Prediction Results
The ISTA-transformer model was evaluated in comparison with all baseline models for performance assessment on the monitoring points traffic dataset,including four sequential time steps (15, 30, 45, and 60 min),which are four consecutive time intervals in 15-minute steps, collectively referred to as “four-step prediction”. This formulation enables the model to consider dependencies across multiple future intervals, thereby improving forecasting stability over different time horizons. The results are presented in Fig.10.
As demonstrated in Fig. 10, the accuracy of the predictions varies at different time steps, with all models demonstrating superior prediction at the first time step in comparison to the subsequent time steps. This finding suggests that, with limited historical data, each model exhibits greater capacity to predict traffic flow in the initial15 min of the future. However, as the time step increases, the discrepancybetween the predicted values and the true values output by each model becomes significantly larger, indicating a decline in accuracy over longer time spans. This may be attributable to the elevated uncertainty surrounding traffic conditions, in conjunction with the time-dependent and error-passing effects inherent in the model, which render forecasting more arduous.
Fig.10Actual values versus predicted values
Furthermore, the majority of baseline models exhibit overly optimistic prediction trend during off-peak hours, and for peak periods, baseline models such as LightGBM consistently show predictions that are significantly higher than the true values. The baseline model also exhibits an outlier at the fourth time step, deviating significantly from the actual prediction and accompanied by overall anomalous fluctuations (the prediction is close to 0). In contrast, the ISTA-transformer model demonstrates a superior capacity in capturing traffic flow trends, exhibiting substantial consistency between predicted and actual values during periods of significant traffic flow variability. The predicted and actual values are highly proximate to each other, both during peak and off-peak traffic flow periods.
In order to provide a clearer indication of the error level of the predicted value, the inverse normalization process was carried out to convert the normalized data back to the original scale. The four evaluation indicators are then calculated using Eqs. (9)-(12) (see Table1 and Fig.11). Obviously, for each time step, the ISTA-transformer model consistently outperforms the other baseline models on all four evaluation metrics. A review of the overall performance across the different metrics shows that the ISTA-transformer model gives the most favourable predictions in the early time steps.
When predicting traffic flow in the first 15 min, the ISTA-transformer model has a MAPE value of 0.22, which is significantly lower than the other models. This shows that it is better at capturing the dynamic characteristics of traffic flow in the short term. For traffic flow predictions beyond the initial30-minute interval, the ISTA-transformer model continues to have the most favourable MAPE value of 0.29. For a forecast time step of 45 min, the ISTA-transformer model has a MAPE of 0.37, which is slightly higher, but still shows superior performance compared with the other models. This result shows that ISTA-transformer is still effective in capturing the spatio-temporal dependence of the data as time steps increase, while other models begin to show larger errors. When the longest time step was 60 min, the MAPE of all models increased significantly. The ISTA-transformer model has a MAPE of 0.38, which is the lowest of all the models. This shows that even when making predictions over a longer time horizon, the ISTA-transformer model still shows superior robustness and generalization. This shows that ISTA-transformer can effectively adapt to short timeintervals and traffic data with significant periodicity. The model consistently achieves higher R2 values at different time steps, demonstrating its strong ability to adapt to fluctuations in traffic data and identify potential patterns, confirming its advantages in traffic flow prediction tasks.
Table1Prediction error results for different time steps for each model
Fig.11Comparison of errors of different models and time steps
5 Conclusions and Future Work
The ISTA-transformer traffic flow prediction model, as proposed in this paper, is designed to facilitate effective multi-step prediction of short-term traffic flow. This model incorporates the spatial attention weight matrix into the traffic flow prediction process, thereby addressing the spatial dimension of traffic data and simulating the highly nonlinear spatial correlation between road segments. On this basis, a spatio-temporal attention mechanism is introduced, whereby the spatio-temporal features extracted by the encoder are superimposed. The historical traffic flow data from 16 monitoring points within the selected research range are used as feature inputs, with a time step of 15 min and a sliding time window to divide training samples. This enables the prediction of the traffic flow value at the target monitoring point for thenext four time steps. The experimental results on the Qingdao traffic monitoring dataset demonstrate that the ISTA-transformer model proposed in this paper outperforms the six selected baseline models. In the context of multi-step short-term prediction, the models under consideration are the GRU, CNN, LSTM, Transformer, Seq2Seq and LightGBM. The ISTA-transformer model demonstrated superior performance compared with the other baseline models in terms of MSE, RMSE, MAPE, anderror evaluation indicators. In the case study, the ISTA-transformer model, trained on a limited amount of historical data, demonstrated robust predictive capabilities in the test set. It is evident that the model’s performance is most precise at shorter time intervals (15-30 min),indicating that the model has the potential for application in the field of traffic flow prediction. Furthermore, it can provide high accuracy while also conserving computing resources. While an increase in prediction errors is observed for longer horizons (45-60 min), indicating that ISTA-transformer effectively captures spatio-temporal dependencies, incorporating advanced techniques such as adaptive attention mechanisms or hybrid models might be necessary to enhance the accuracy of long-horizon forecasting.
In the future, this research intends to consider additional factors, such as weather and traffic accidents, in traffic predictions and translate these influencing factors into learnable features.Additionally, further analysis will be conducted on the distinct characteristics of different time steps in multi-step prediction. Exploration of techniques such as dynamic time-step adjustment and uncertainty quantification will be undertaken to enhance the robustness of long-horizon forecasts. Moreover, the objective is to enhance fine-grained forecasting of peak hour traffic, bringing the model closer to real traffic scenarios, thereby improving its performance in multivariate forecasting.
Yuan H, Li G. A survey of traffic prediction:from spatio-temporal data to intelligent transportation. Data Science and Engineering,2021,6(1):63-85. DOI:10.1007/s41019-020-00151-z.
Wang X, Sun F, Ma X,et al. Short-term traffic flow prediction based on vehicle trip chain features. Transportation Letters,2024,17(1):157-168. DOI:10.1080/19427867.2024.2334100.
Ahmed M S, Cook A R. Analysis of freeway traffic time-series data by using box-jenkins techniques. Transportation Research Record,1979,722:1-9.
van der Voort M, Dougherty M, Watson S. Combining Kohonen maps with ARIMA time series models to forecast traffic flow. Transportation Research Part C: Emerging Technologies,1996,4(5):307-318. DOI:10.1016/S0968-090X(97)82903-8.
Lee S, Fambro D B. Application of subset autoregressive integrated moving average model for short-term freeway traffic volume forecasting. Transportation Research Record,1999,1678:179-188.https://doi.org/10.3141/1678-22.
Williams B M. Multivariate vehicular traffic flow prediction: Evaluation of ARIMAX modeling. Transportation Research Record: Journal of the Transportation Research Board,2001,1776(1):194-200. DOI:10.3141/1776-25.
Kamarianakis Y, Prastacos P. Forecasting traffic flow conditions in an urban network: Comparison of multivariate and univariate approaches. Transportation Research Record,2003,1857:74-84. DOI:10.3141/1857-09.
Williams B M, Hoel L A. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. Journal of Transportation Engineering,2003,129(6):664-672. DOI:10.1061/(ASCE)0733-947X(2003)129:6(664).
Karami Z, Kashef R. Smart transportation planning: Data,models,and algorithms. Transportation Engineering,2020,2:100013. DOI:10.1016/j.treng.2020.100013.
Luan X, Yang B, Zhang Y. Structural hierarchy analysis of streets based on complex network theory. Geomatics and Information Science of Wuhan University,2012,37(6):728-732.
Medina-Salgado B, Sánchez-DelaCruz E, Pozos-Parra P,et al. Urban traffic flow prediction techniques: A review. Sustainable Computing: Informatics and Systems,2022,35:100739. DOI:10.1016/j.suscom.2022.100739.
Lv Y, Duan Y, Kang W,et al. Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems,2015,16(2):865-873. DOI:10.1109/TITS.2014.2345663.
Xia D, Yang N, Jian S,et al. SW-BiLSTM:a Spark-based weighted BiLSTM model for traffic flow forecasting. Multimedia Tools and Applications,2022,81(17):23589-23614. DOI:10.1007/s11042-022-12039-3.
Chauhan N S, Kumar N, Eskandarian A. A novel confined attention mechanism driven Bi-GRU model for traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems,2024,25(8):9181-9191. DOI:10.1109/TITS.2024.3375890.
Luo Y, Zheng J, Wang X,et al. GT-LSTM: A spatio-temporal ensemble network for traffic flow prediction. Neural Networks,2024,171:251-262. DOI:10.1016/j.neunet.2023.12.016.
Zhang Z, Li M, Lin X,et al. Multistep speed prediction on traffic networks: A deep learning approach considering spatio-temporal dependencies. Transportation Research Part C: Emerging Technologies,2019,105:297-322. DOI:10.1016/j.trc.2019.05.039.
Ali A, Zhu Y, Zakarya M. Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Networks,2022,145:233-247. DOI:10.1016/j.neunet.2021.10.021.
He R, Xiao Y, Lu X,et al. ST-3DGMR: Spatio-temporal 3D grouped multiscale ResNet network for region-based urban traffic flow prediction. Information Sciences,2023,624:68-93. DOI:10.1016/j.ins.2022.12.066.
Ma Y, Lou H, Yan M,et al. Spatio-temporal fusion graph convolutional network for traffic flow forecasting. Information Fusion,2024,104:102196. DOI:10.1016/j.inffus.2023.102196.
He R, Liu Y, Xiao Y,et al. Deep spatio-temporal 3D densenet with multiscale ConvLSTM-Resnet network for citywide traffic flow forecasting. Knowledge-Based Systems,2022,250:109054. DOI:10.1016/j.knosys.2022.109054.
Yang J, Sun X, Wang R G,et al. PTPGC: Pedestrian trajectory prediction by graph attention network with ConvLSTM. Robotics and Autonomous Systems,2022,148:103931. DOI:10.1016/j.robot.2021.103931.
Vaswani A, Shazeer N, Parmar N,et al. Attention is all you need. Proceedings of the 31th Conference on Neural Information Processing Systems. New York: ACM,2017:6000-6010.
Xiao L, Chen H. Spatio-temporal transformer graph network for traffic flow forecasting. Proceedings of the 2024 3rd International Joint Conference on Information and Communication Engineering. Fuzhou: JCICE,2024:224-228. DOI:10.1109/JCICE61382.2024.00053.
Hu H-X, Hu Q, Tan G,et al. A multi-layer model based on transformer and deep learning for traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems,2024,25(1):443-451. DOI:10.1109/TITS.2023.3311397.
Zhan X, Zhang S, Szeto W Y,et al. Multi-step-ahead traffic speed forecasting using multi-output gradient boosting regression tree. Journal of Intelligent Transportation Systems: Technology, Planning,and Operations,2020,24(2):125-141. DOI:10.1080/15472450.2019.1582950.
Bai L, Yao L, Wang X,et al. Deep spatial-temporal sequence modeling for multi-step passenger demand prediction. Future Generation Computer Systems,2021,121:25-34. DOI:10.1016/j.future.2021.03.003.
Zhao K, Guo D, Sun M,et al. Short-term traffic flow prediction based on hybrid decomposition optimization and deep extreme learning machine. Physica A: Statistical Mechanics and Its Applications,2024,647:129870. DOI:10.1016/j.physa.2024.129870.

LINKS