Encoding Time Series as Images
Although deep learning is well developed in computer vision and speech recognition, it is difficult to build predictive models when it comes to time series. The reasons include that the recurrent neural network is difficult to train, some studies are difficult to apply, and there is no existing and training network, so 1D-CNN is inconvenient.
But with Gramian Angular Field (GAF), you can turn time series into images, taking advantage of current machine vision.
This article will include the following:
- Mathematical prior knowledge;
- Why Gram Matrix can construct a good two-dimensional representation for univariate time series;
- Why the dot product of Gram Matrix cannot represent the data of CNN;
- What is the operation of preparing Gram Matrix structure for CNN;
Python code will also be included:
- NumPy tool for GAF calculation;
The following GIF shows polar encoding of the data and then performing operations similar to Gram matrices on the generated angles:
1. Mathematical prior knowledge
The mathematical method of GAF is deeply related to the inner product and the corresponding Gram matrix.
1.1. Dot Product
An inner product is an operation between two vectors that measures their “similarity”. It allows the use of concepts derived from traditional Euclidian Geometry: length, Angle, orthogonality of the second and third dimensions.
In two-dimensional space, the inner product between two vectors uuu and VVV is defined as:
Or:
If the norm of uuu and VVV is 1, we get:
So, if you’re dealing with unit vectors, their inner product is determined only by the Angle theta \theta theta, which can be expressed in radians. It’s going to be in [-1,1]. Keep these theorems in mind for use elsewhere in this article.
Note: In the Euclidian set (n-dimension), the inner product of two vectors is formally defined as:
1.2. Gram Matrix
In linear algebra and geometry, the Gram matrix is a useful tool that is often used to calculate the linear correlation of a set of vectors.
Definition: A Gram matrix of n vectors is a matrix defined by the dot product of each pair of vectors. Mathematically, this can be explained as:
Again, assuming that all two-dimensional vectors are unit vectors, we get:
Where φ (I,j)\Phi(I,j) φ (I,j) is the included Angle of the two vectors.
Key conclusion: Why Gram matrix?
The Gram matrix preserves the time dependence. Since time increases as position moves from the top left to the bottom right, the time dimension is encoded into the geometry of the matrix.
Note: Univariate time series cannot explain the co-occurrence and potential state of data to some extent; Our goal should be to find alternative and richer representations.
2. Implementation method
Suppose there is a time sequence X=x1,… xnX={x_1,\cdots,x_n}X=x1,… xn:
2.1. The zoom
A min-max scaler limited to [-1,1] is used to scale the time series to [-1,1]. The reason for this is that the inner product is not biased towards the observation with the largest value.
In this use case, a standard zoomer is not a good candidate because both its output range and the resulting internal product can exceed [-1,1].
However, in combination with the minimum-maximum scaler, the inner product does preserve the output range:
The choice of dot product between [-1,1] is not harmless. It is desirable to use [-1,1] as an input range if not necessary.
2.2. Noise images
After scaling the time series, we compute the dot product of each pair and place them in the Gram matrix:
Let’s look at the value of G and take a look at this picture:
You can see:
-
The output appears to follow a gaussian distribution centered on 0.
-
The resulting picture is noisy.
The former explains the latter, because the more gaussian the data is distributed, the harder it is to distinguish it from gaussian noise.
This is a problem for our neural network. In addition, IT has been proven that CNN works better with sparse data.
2.3. Non-sparsity
The Gaussian distribution is not surprising. When looking at the graph of the three-dimensional inner product ZZZ, for all the possible combinations of (x,y)∈R²(x, y)∈R^²(x,y)∈R², we get a three-position surface of the dot product:
It is assumed that the values of the time series obey uniform distribution [-1,1], then the values of the matrix obey Gaussian distribution. Below are the histograms of Gram matrix values output for different time series of length N:
3. Start coding
Since univariate time series are one-dimensional, the dot product cannot distinguish valuable information from gaussian noise, and there is no other way to use the “Angle” relation except changing the space.
Therefore, we must encode the time series as at least two dimensional space before using a construct similar to Gram matrix. To do this, we will construct a bijective map between a one-dimensional time series and a two-dimensional space so that no information is lost.
This encoding is largely inspired by the polar coordinate transformation, but in this case the radius coordinate represents time.
3.1. Zoom sequence
Step 1: Scale the sequence to [-1,1] with min-max Scaler
Our process is similar to the implementation above. With the addition of min-max scaler, our polar encoding will be bijective, using the ArcCosarcCosarccos function (see next step).
Step 2: Convert the scaled time series to “polar coordinates”
There are two things to consider, the value of the time series and its corresponding timestamp. These two variables are denoted by Angle and radius respectively.
Assuming that our time series consists of NNN timestamp TTT and corresponding XXX, then:
- The Angle is calculated using arccos(x) Arccos (x)arccos(x), with values between [0,π][0,\ PI][0,π].
- First, the radius variable is calculated. We divide the interval [0,1] into NNN equal parts. As a result, we get the N + 1 N + 1 N + 1 a separation point {0,…, 1} \ \ {\ cdots 0, 1}} {0,…, 1. We then discard the zeros and continuously associate these points with the time series.
The mathematical definition is:
These encodings have several advantages:
- The whole encoding is bijective (as a combination of bijective functions).
- It maintains time dependence through RRR coordinates. This is a very useful advantage.
4. Inner product of time series
In two dimensions, the next question is how we deal with sparsity using inner product operations.
4.1. Why not inner product of polar coded values?
The inner product of two-dimensional polar space has several limitations because the norm of each vector is adjusted for time dependence. It’s more accurate to say:
- The inner product between two distinct observations will be biased towards the nearest one (because the norm increases with time);
- When calculating the inner product of the observed value with itself, the norm obtained is also biased.
So, if there is an inner product operation like this, it should only depend on angles.
4.2. Use angles
Since any operation similar to the inner product inevitably converts the information from two different observations into one value, we cannot retain the information given by both angles. We have to make some concessions.
To best explain individual and join information from two perspectives, the authors define another operation of the inner product:
Where θ\thetaθ represents the Angle of XXX and YYy.
Note: I chose a different notation instead of using the inner product, because this operation does not satisfy the requirements of the inner product (linear, positive definite).
This results in the following Gram matrix:
The author’s motivation for choosing to do so is that polar coordinates retain absolute time relation with respect to Cartesian coordinates.
advantage
- The diagonal is composed of the original values of the scaled time series (we will approximately reconstruct the time series according to the high-level features learned by the deep neural network);
- Temporal correlation is explained by relative correlation through the superposition of the direction of KKK over time intervals.
4.3. Sparse representation
Now let’s draw the density distribution of the values of the Gramian Angular Field:
As we can see from the figure above, the Gram Angle field is much sparser. To explain this, let us reexpress it in terms of U ⊕vu\oplus vu⊕ V in Cartesian coordinates:
We noticed in the previous item that the newly constructed operation corresponds to a penalized version of the traditional inner product:
To see how this kind of punishment works. Let’s take a look at the entire operation in 3D:
You can see:
- Penalty moves the average output to -1;
- The closer XXX and YYy get to zero, the greater the penalty. The main reason is that these dots are closer to gaussian noise;
- For x=yx=yx=y: will convert to -1;
- The output is easily distinguishable from gaussian noise.
disadvantages
- The main diagonal, however, generates a large GAM due to n↦n2n\mapsto n^2n↦n2, while the length of the original time series is N. The authors propose to reduce the size by using the Piecewise Aggregation Approximation.
- This operation is not really an inner product.
5. The code
A NumPy implementation for converting univariate time series to images can be found on GitHub.
The code above doesn’t seem to work, so I wrote a working demo of my own using the PyTS package and posted it on my GitHub. See another blog post on Python: Converting one-dimensional time series into TWO-DIMENSIONAL images using PyTS
Summary: This blog post is mainly inspired by a detailed paper by Zhiguang Wang and Tim, who used tiled convolutional neural networks to encode time series into images for visual inspection and classification. Another interesting coding technique is also mentioned in the paper: Markov transform fields.