• Time Series Analysis, Visualization & Forecasting with LSTM
  • By Susan Li
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Minghao23
  • Proofreader: Xuyuey, TrWestdoor

Statistical normality test, stationarity Dickey-Fuller test, long short-term memory network

The title says it all.

Without further ado, let’s get straight to the business.

data

The data, which measures a household’s electricity consumption at a one-minute sampling rate over a period of nearly four years, can be downloaded here.

The data includes different electrical values and some sub-meter values. However, we focus only on the Global_active_power variable.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.float_format'.lambda x: '%.4f' % x)
import seaborn as sns
sns.set_context("paper", font_scale=1.3)
sns.set_style('white')
import warnings
warnings.filterwarnings('ignore')
from time import time
import matplotlib.ticker as tkr
from scipy import stats
from statsmodels.tsa.stattools import adfuller
from sklearn import preprocessing
from statsmodels.tsa.stattools import pacf
%matplotlib inline

import math
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import *
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from keras.callbacks import EarlyStopping

df=pd.read_csv('household_power_consumption.txt', delimiter='; ')
print('Number of rows and columns:', df.shape)
df.head(5)
Copy the code

The following data preprocessing and characteristic engineering steps need to be completed:

  • Merge the date and time into the same column and convert to datetime type.
  • Convert Global_active_power to a numeric type and remove the missing value (1.2%).
  • Create year, quarter, month, and day characteristics.
  • Create a weekly feature. 0 indicates the weekend and 1 indicates the working day.
df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
df = df.dropna(subset=['Global_active_power'])
df['date_time']=pd.to_datetime(df['date_time'])
df['year'] = df['date_time'].apply(lambda x: x.year)
df['quarter'] = df['date_time'].apply(lambda x: x.quarter)
df['month'] = df['date_time'].apply(lambda x: x.month)
df['day'] = df['date_time'].apply(lambda x: x.day)
df=df.loc[:,['date_time'.'Global_active_power'.'year'.'quarter'.'month'.'day']]
df.sort_values('date_time', inplace=True, ascending=True)
df = df.reset_index(drop=True)
df["weekday"]=df.apply(lambda row: row["date_time"].weekday(),axis=1)
df["weekday"] = (df["weekday"] < 5).astype(int)

print('Number of rows and columns after removing missing values:', df.shape)
print('The time series starts from: ', df.date_time.min())
print('The time series ends on: ', df.date_time.max())
Copy the code

After removing the missing values, the data included 2,049,280 measurements from December 2006 to November 2010 (47 months).

Initial data includes multiple variables. We’ll focus on a single variable here: the house’s Global_active_power history, which is the average amount of active power consumed per minute throughout the house, in kilowatts.

Statistical normality test

There are statistical tests that can be used to quantify whether our data looks like a Gaussian sampling. We’re going to use D ‘Agostino’s K² test.

In SciPy’s implementation of this test, we interpret the p value as follows.

  • P <= alpha: reject H0, non-normal.
  • P > alpha: does not reject H0, normal.
stat, p = stats.normaltest(df.Global_active_power)
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('Data looks Gaussian (fail to reject H0)')
else:
    print('Data does not look Gaussian (reject H0)')
Copy the code

Kurtosis and skewness are also calculated to determine if the data distribution deviates from the normal distribution.

sns.distplot(df.Global_active_power);
print( 'Kurtosis of normal distribution: {}'.format(stats.kurtosis(df.Global_active_power)))
print( 'Skewness of normal distribution: {}'.format(stats.skew(df.Global_active_power)))
Copy the code

Kurtosis: Describes the tail weight of the distribution

The kurtosis of a normal distribution is close to zero. If the kurtosis is greater than 0, the tail of the distribution is heavier. If the kurtosis is less than 0, the tail of the distribution is lighter. We calculated that the kurtosis is greater than zero.

Skewness: Measures the asymmetry of a distribution

If the skewness is between -0.5 and 0.5, the data is basically symmetric. If the skewness is between -1 and -0.5 or between 0.5 and 1, the data is slightly skewed. If the skewness is less than -1 or greater than 1, the data is highly skewed. We calculate that the skewness is greater than one.

The first time series image

df1=df.loc[:,['date_time'.'Global_active_power']]
df1.set_index('date_time',inplace=True)
df1.plot(figsize=(12.5))
plt.ylabel('Global active power')
plt.legend().set_visible(False)
plt.tight_layout()
plt.title('Global Active Power Time Series')
sns.despine(top=True)
plt.show();
Copy the code

Obviously, this image is not what we want. Don’t do it.

Annual and quarterly box comparison of total active power

plt.figure(figsize=(14.5))
plt.subplot(1.2.1)
plt.subplots_adjust(wspace=0.2)
sns.boxplot(x="year", y="Global_active_power", data=df)
plt.xlabel('year')
plt.title('Box plot of Yearly Global Active Power')
sns.despine(left=True)
plt.tight_layout()

plt.subplot(1.2.2)
sns.boxplot(x="quarter", y="Global_active_power", data=df)
plt.xlabel('quarter')
plt.title('Box plot of Quarterly Global Active Power')
sns.despine(left=True)
plt.tight_layout();
Copy the code

When comparing the year-by-year box charts side by side, we note that the median of total active power in 2006 is much higher than in other years. This is actually a little bit misleading. If you remember, we only have data for December 2006. And December is clearly a peak month for household electricity consumption.

The quarterly median of total active power was more in line with expectations, with the first and fourth quarters (winter) being higher and the third quarter (summer) the lowest.

Total active power distribution

plt.figure(figsize=(14.6))
plt.subplot(1.2.1)
df['Global_active_power'].hist(bins=50)
plt.title('Global Active Power Distribution')

plt.subplot(1.2.2)
stats.probplot(df['Global_active_power'], plot=plt);
df1.describe().T
Copy the code

The normal probability graph also shows that this data deviates greatly from the normal distribution.

Resample the average total active power by day, week, month, quarter, and year

fig = plt.figure(figsize=(18.16))
fig.subplots_adjust(hspace=4.)
ax1 = fig.add_subplot(5.1.1)
ax1.plot(df1['Global_active_power'].resample('D').mean(),linewidth=1)
ax1.set_title('Mean Global active power resampled over day')
ax1.tick_params(axis='both', which='major')

ax2 = fig.add_subplot(5.1.2, sharex=ax1)
ax2.plot(df1['Global_active_power'].resample('W').mean(),linewidth=1)
ax2.set_title('Mean Global active power resampled over week')
ax2.tick_params(axis='both', which='major')

ax3 = fig.add_subplot(5.1.3, sharex=ax1)
ax3.plot(df1['Global_active_power'].resample('M').mean(),linewidth=1)
ax3.set_title('Mean Global active power resampled over month')
ax3.tick_params(axis='both', which='major')

ax4  = fig.add_subplot(5.1.4, sharex=ax1)
ax4.plot(df1['Global_active_power'].resample('Q').mean(),linewidth=1)
ax4.set_title('Mean Global active power resampled over quarter')
ax4.tick_params(axis='both', which='major')

ax5  = fig.add_subplot(5.1.5, sharex=ax1)
ax5.plot(df1['Global_active_power'].resample('A').mean(),linewidth=1)
ax5.set_title('Mean Global active power resampled over year')
ax5.tick_params(axis='both', which='major');
Copy the code

Generally speaking, our time series do not have an upward or downward trend. The highest average power consumption seems to have been before 2007, actually because we only have data for December 2006 in 2007, and that’s the peak month. In other words, if we compare year to year, the sequence is actually relatively stable.

Plot the mean value of total active power and group it by year, season, month and day

plt.figure(figsize=(14.8))
plt.subplot(2.2.1)
df.groupby('year').Global_active_power.agg('mean').plot()
plt.xlabel(' ')
plt.title('Mean Global active power by Year')

plt.subplot(2.2.2)
df.groupby('quarter').Global_active_power.agg('mean').plot()
plt.xlabel(' ')
plt.title('Mean Global active power by Quarter')

plt.subplot(2.2.3)
df.groupby('month').Global_active_power.agg('mean').plot()
plt.xlabel(' ')
plt.title('Mean Global active power by Month')

plt.subplot(2.2.4)
df.groupby('day').Global_active_power.agg('mean').plot()
plt.xlabel(' ')
plt.title('Mean Global active power by Day');
Copy the code

The images above confirm our previous findings. The series is relatively stable in years. On a quarterly basis, the lowest average power consumption was in the third quarter. On a monthly basis, the lowest average power consumption was in July and August. On a daily basis, the lowest average is around the 8th of the month (I don’t know why).

Total active power per year

This time, we remove 2006.

pd.pivot_table(df.loc[df['year'] != 2006], values = "Global_active_power",
               columns = "year", index = "month").plot(subplots = True, figsize=(12.12), layout=(3.5), sharey=True);
Copy the code

From 2007 to 2010, the pattern was similar every year.

Total active power comparison on weekdays and weekends

dic={0:'Weekend'.1:'Weekday'}
df['Day'] = df.weekday.map(dic)

a=plt.figure(figsize=(9.4))
plt1=sns.boxplot('year'.'Global_active_power',hue='Day',width=0.6,fliersize=3,
                    data=df)
a.legend(loc='upper center', bbox_to_anchor=(0.5.1.00), shadow=True, ncol=2)
sns.despine(left=True, bottom=True)
plt.xlabel(' ')
plt.tight_layout()
plt.legend().set_visible(False);
Copy the code

Prior to 2010, the median total active power on weekdays was lower than on weekends. In 2010, they were exactly equal.

Factor diagram of total active power comparison on weekdays and weekends

plt1=sns.factorplot('year'.'Global_active_power',hue='Day',
                    data=df, size=4, aspect=1.5, legend=False)
plt.title('Factor Plot of Global active power by Weekend/Weekday')
plt.tight_layout()
sns.despine(left=True, bottom=True)
plt.legend(loc='upper right');
Copy the code

On a yearly basis, weekdays and weekends follow the same pattern.

In principle, when using LSTM, we do not need to test or correct the stationarity. However, if the data is smooth, it helps the model improve performance and makes the neural network easier to learn.

stationarity

In statistics, the Dickey-Fuller test tests the null hypothesis that the unit root exists in an autoregressive model. The alternative hypothesis varies depending on the test used, but is usually stationary or trend stationary.

The mean and variance of stationary sequences are constant. The mean and standard deviation of time series under the sliding window do not change with time.

Dickey Fuller inspection

Zero test (H0) : indicates that the time series has a unit root, meaning that it is non-stationary. It has some time dependent components.

Alternative test (H1) : Indicates that there is no unit root in time series, which means that it is stable. It doesn’t have a time component.

P-value > 0.05: the zero test (H0) is accepted, and the data has a unit root and is non-stationary.

P-value <= 0.05: reject zero test (H0), data has no unit root and is stable.

df2=df1.resample('D', how=np.mean)

def test_stationarity(timeseries):
    rolmean = timeseries.rolling(window=30).mean()
    rolstd = timeseries.rolling(window=30).std()

    plt.figure(figsize=(14.5))
    sns.despine(left=True)
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')

    plt.legend(loc='best'); plt.title('Rolling Mean & Standard Deviation')
    plt.show()

    print ('<Results of Dickey-Fuller Test>')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4],
                         index=['Test Statistic'.'p-value'.'#Lags Used'.'Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
test_stationarity(df2.Global_active_power.dropna())
Copy the code

From the above conclusion, we will reject the zero test H0 because the data has no unit root and is stationary.

LSTM

Our task was to predict this time series based on a household’s history of two million minutes of electricity consumption. We will use a multi-level LSTM recursive neural network to predict the last value of the time series.

If you want to reduce the computation time and get quick results to test the model, you can resample the data in hours. I will maintain the original unit of minutes in this experiment.

Before building the LSTM model, the following data preprocessing and feature engineering work is required.

  • Create a data set and ensure that all data is of type float.
  • Feature standardization.
  • Split training set and test set.
  • Converts a numeric array to a data set matrix.
  • Convert the dimensions to X=t and Y=t+1.
  • Convert the input dimensions into three dimensions (num_samples, num_timesteps, num_features).
dataset = df.Global_active_power.values #numpy.ndarray
dataset = dataset.astype('float32')
dataset = np.reshape(dataset, (- 1.1))
scaler = MinMaxScaler(feature_range=(0.1))
dataset = scaler.fit_transform(dataset)
train_size = int(len(dataset) * 0.80)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

def create_dataset(dataset, look_back=1):
    X, Y = [], []
    for i in range(len(dataset)-look_back- 1):
        a = dataset[i:(i+look_back), 0]
        X.append(a)
        Y.append(dataset[i + look_back, 0])
    return np.array(X), np.array(Y)

look_back = 30
X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)

Convert input dimensions to [samples, Time Steps, features]
X_train = np.reshape(X_train, (X_train.shape[0].1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0].1, X_test.shape[1]))
Copy the code

Model structure

  • Define the LSTM model with the first hidden layer containing 100 neurons and the output layer containing 1 neuron for predicting Global_active_power. The input dimension is a time step of 30 features.
  • 20% Dropout.
  • The mean square error loss function is used, and the more efficient Adam improves on the stochastic gradient descent.
  • The model will conduct 20 epochs of training, and each batch is 70 in size.
model = Sequential()
model.add(LSTM(100, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

history = model.fit(X_train, Y_train, epochs=20, batch_size=70, validation_data=(X_test, Y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=10)], verbose=1, shuffle=False)

model.summary()
Copy the code

Make predictions

train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
# Inverse the predicted value
train_predict = scaler.inverse_transform(train_predict)
Y_train = scaler.inverse_transform([Y_train])
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform([Y_test])

print('Train Mean Absolute Error:', mean_absolute_error(Y_train[0], train_predict[:,0]))
print('Train Root Mean Squared Error:',np.sqrt(mean_squared_error(Y_train[0], train_predict[:,0])))
print('Test Mean Absolute Error:', mean_absolute_error(Y_test[0], test_predict[:,0]))
print('Test Root Mean Squared Error:',np.sqrt(mean_squared_error(Y_test[0], test_predict[:,0)))Copy the code

Drawing model loss

plt.figure(figsize=(8.4))
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Test Loss')
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.legend(loc='upper right')
plt.show();
Copy the code

Compare the actual value with the predicted value

In my result, each time step is 1 minute. If you had previously resampled the data in hours, each time step in your results would have been 1 hour.

I’m going to compare the actual and predicted values for the last 200 minutes.

aa=[x for x in range(200)]
plt.figure(figsize=(8.4))
plt.plot(aa, Y_test[0] [:200], marker='. ', label="actual")
plt.plot(aa, test_predict[:,0] [:200].'r', label="prediction")
# plt.tick_params(left=False, labelLeft =True) # remove ticks
plt.tight_layout()
sns.despine(top=True)
plt.subplots_adjust(left=0.07)
plt.ylabel('Global_active_power', size=15)
plt.xlabel('Time step', size=15)
plt.legend(fontsize=15)
plt.show();
Copy the code

LSTMs is amazing!

Jupyter Notebook can be found at Github. Enjoy the rest of the week!

Reference: Multivariate Time Series Forecasting with LSTMs in Keras

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.