Writing in the front
The following article is based on Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, a best paper published in AAAI21. This paper proposes an Informer model to improve the prediction of long sequences in Transformer for a series of problems, such as quadratic time complexity, high memory utilization and Encoder-Decoder structural limitations. Experimental results show that the proposed model is superior to existing methods and provides a new solution for long sequence prediction. The original paper is obtained at the end of the paper.
1
Abstract
Many real-world applications require long series time series predictions, such as practical problems such as power consumption planning. Long Sequence Time-series forecasting (LSTF) requires a model with high predictive capability, that is, it can accurately capture the long-term dependence between output and input. Recent studies have shown that Transformer has the potential to improve its forecasting capabilities. However, Transformer has several serious issues that prevent it from being directly applicable to LSTF issues, such as quadratic time complexity, high memory usage, and inherent limitations of the encoder-decoder architecture. In order to solve these problems, this paper designs a LSTF model based on Transformer, namely the Informer model, which has three significant features:
- A ProbSpare self-attention mechanism that can be achieved in terms of time complexity and memory usage.
- The self-attention mechanism highlights the dominant attention by halving the cascading level of input and effectively handles excessively long input sequences.
- Although the concept of generative decoder is simple, it performs a forward operation instead of step-by-step to predict the long sequence, which greatly improves the inference speed of long sequence prediction.
Finally, a large number of experiments on four large-scale data sets show that the Informer method is significantly superior to the existing method and provides a new solution to the LSTF problem.
Time series forecasting is a key factor in many areas, such as sensor network monitoring, energy and smart grid management, economics and finance A 2002) and disease transmission analysis. In these scenarios, we can use a large amount of time series data about past behavior to make long-term predictions, i.e., long series time series predictions. However, existing methods are designed with limited problem Settings, such as predicting 48 points or less, however, the predictive power of models becomes limited by increasingly long sequences. The figure below shows the prediction results on real data sets, in which the hourly temperature of a substation is predicted by LSTM network from short-term (12 points, 0.5 days) to long-term (480 points, 20 days). Among them, when the prediction length is larger than 48 points, the overall performance gap is very large, that is, from figure (c), when the number of prediction points is larger than 48 points, MSE begins to increase rapidly, and the inference speed also decreases rapidly.
Therefore, the main challenge facing the LSTF is to improve predictive capabilities to meet the growing requirements for long sequences, which require (a) extraordinary remote alignment capabilities and (b) efficient operation of long sequence inputs and outputs. Recently, the Transformer model has shown superior performance in capturing remote dependencies compared to the RNN model. The self-attention mechanism can reduce the maximum propagation path length of network signals to O(1), which is the theoretical shortest, and avoid recursive structure. Therefore, transformers show great potential in LSTF problems. But on the other hand, the self-attention mechanism violates the requirement of (b) because of its L-quadratic calculation and memory consumption of l-length input/output. Some large-scale transformer models are resource-intensive and produce impressive results on NLP tasks (Brown et al. 2020), but the training and expensive deployment costs of dozens of Gpus make these models unaffordable for real-world LSTF problems. The effectiveness of self-attention mechanism and transformer frame becomes the bottleneck of its application to LSTF problem
In this article, the author raises the following questions about the Transformer model: can the Transformer model be improved to be more efficient in computing, memory and architecture, while maintaining higher predictive power? The Transformer model mainly has the following three problems:
- Quadratic computation complexity of self-attntion mechanism. The dot product operation of the self-attention mechanism makes the time complexity and memory usage of each layer zero.
- Memory bottlenecks when stacking long inputs. The stack of J encoder-decoder layers brings the total memory usage to zero, which limits the scalability of the model when receiving long sequences of inputs.
- The plunge in the rate of long-run output is forecast. Dynamic decoding of Transformer results in very slow step-by-step reasoning.
To this end, the work of this paper explicitly addresses these three issues. Firstly, the sparsity in self-attention mechanism is studied, the network component is improved, and extensive experiments are carried out. All contributions to the article are summarized below:
- Informer was proposed to successfully improve the predictive power of LSTF problems, which validated the potential value of a Transformer like model in capturing individual long-term dependencies between outputs and inputs of long sequence time series.
- The PorbSpare self-attention mechanism is proposed to replace the standard self-attention mechanism effectively and realize the time complexity and memory consumption.
- A self-attention extraction method with operation-dominated attention in J-stacking layers is proposed and the spatial complexity is greatly reduced to.
- A generative Decoder is proposed to obtain long sequence output, only step forward output, to avoid the accumulation of errors.
2
Model is introduced
The overall framework of the model proposed in this paper is shown in the figure below. It can be seen that the proposed Informer model still preserves the Encoder-Decoder architecture: \
Firstly, the whole problem is defined in the following way, that is, the input data at time T is:
The purpose is to predict the corresponding output data, i.e. : \
For LSTF problems, the length of the output sequence is required to be longer. \
Self-attention mechanism
First, the traditional self-attention mechanism input form is, and then apply the scaled dot-product, that is:
Among them,
The probability form of the attention coefficient for the ith Query is: \
The self-attention mechanism requires a quadratic time complexity dot product to compute the above probability, calculating the required space complexity. Therefore, this is a major obstacle to improving forecasting ability. In addition, previous studies have found that self-attentional probability distributions are potentially sparsity, and have designed some “selective” counting strategies for all of them without significantly affecting performance. Therefore, the author firstly qualitatively evaluates the typical self-attention learning mode. The distribution of “sparse” self-attention shows a long-tail distribution, that is, a few dot products contribute to the main attention, and other dot product pairs can be ignored. So, the next question is how do you tell them apart?
To measure query sparsity, we use KL divergence. The evaluation formula for the sparsity of the ith Query is:
Where the first term is the log-sum-exp (LSE) for all keys, and the second term is their arithmetic average.
Based on the above evaluation method, the formula of ProbSparse self-attetion can be obtained, that is:
Where, is the sparse matrix with the same size, and it only contains the top-U queries under the sparse evaluation. Where the size of u is determined by a sampling parameter. This enables ProbSparse self-attention to compute only the dot product operation for each Query-key. In addition, Lemma 1 proves that the upper boundary of sparse evaluation is calculated, so as to ensure the time and space complexity of computation is.
Encoder\
Encoder is designed to extract the long-term dependencies of long sequence inputs. As a result of the ProbSpare self-attention mechanism, encoder feature maps have redundant combinations of values V. Therefore, distilling operation is used to assign higher weights to dominant features with dominant features, and focus self-attention feature maps are generated at the next layer. The process of distilling from J to J +1 is as follows:
It includes multi-head Probsparse self-attention and key operations in attention block. Conv1d represented a one-dimensional convolution operation over time series and was activated by ELU.
Decoder\
A standard Decoder structure (Vaswani et al. 2017) is used in the Decoder section, which consists of two identical multi-attentional layers. In addition, generative reasoning is used to mitigate the decline in the speed of long-term predictions. We provide the following input vector to decoder:
Among them, masked multi-head attention is applied to the calculation of probsparse self-attention. It avoids autoregressions by preventing each position from noticing the next. Finally, a fully connected layer gets the final output, whose output dimension depends on whether we are making univariate or multivariable predictions.
Loss Function
MSE is selected as loss function in the model to transmit the error of decoder output and target sequence back. \
3
Experimental verification
In the experimental part, the author compares with some common models from the perspective of univariate and multivariate prediction. The multivariable prediction is realized by changing the number of output units of the fully connected layer at the last layer of the model. The evaluation indicators used are MSE and MAE. The experimental results are shown in the following table:
In addition, the hyperparameter of the model is also discussed: \
The validity of each part of the model is also discussed:
Finally, the spatial and temporal complexity of the model is analyzed: \
4
conclusion
In this paper, an Informer model for long series time series prediction is proposed. Specifically, the authors designed a ProbSparse self-attention mechanism and distilling operation to address the challenges of the Transformer’s secondary time complexity and secondary memory usage. At the same time, the carefully designed generative decoder alleviates the limitations of the traditional encoder-decoder architecture. Finally, experiments on real data demonstrate the effectiveness of Informer algorithm in improving LSTF problem prediction ability.
Bibliography: \
Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin & Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting.
Follow “ARTIFICIAL Intelligence Quantization Laboratory” public account, send 068 backstage can obtain the original paper.
Learn more about artificial intelligence and quantitative finance
<- Please scan for attention
Let me know you’re watching