043- Pandas Manipulating time series data

(Python libraries and versions used in this article: Python 3.6, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)

Time series data analysis is an important field in the field of machine learning. Time series data is constantly changing with time. The most typical example is stock price data, which varies with the date, temperature changes throughout the year, typhoon movement track and so on. An important point in this field is that the sample data collected are time-related, and we cannot scramble the sequence of sample data as before, so special research methods are required in this field.


1. Prepare time series data

The timeseries data used in this chapter comes from data_timeseries. TXT, in which the first column is the year from 1940 to 2015, the second column is the month, and the third and fourth columns are the data.

1.1 Data is converted to time series format

We can merge the first column and the second column and use a function to convert the string to Date, but the data is regular and there is no missing time. We can construct a time series using the date_range() function for pandas. The time series is used as the index of the DataFrame dataset

Step 1: Load the dataset with PANDAS

Load the data set
data_path='E:\PyProjects\DataSet\FireAI/data_timeseries.txt'
df=pd.read_csv(data_path,header=None)
print(df.info()) Check the data to make sure there are no errors
print(df.head())
print(df.tail())
Copy the code

Step 2: Build time series data with pd.date_range()

start=str(df.iloc[0.0]) +The '-'+str(df.iloc[0.1]) # 1940-1
if df.iloc[- 1.1] %12= =0: # If it ends in December, you need to move to January of the following year
    end=str(int(df.iloc[- 1.0]) +1) +'- 01'
else:
    end=str(df.iloc[- 1.0]) +The '-'+str(int(df.iloc[- 1.1]) +1)
print(end)

dates=pd.date_range(start,end,freq='M') Build date data with monthly intervals
print(dates[0])
print(dates[- 1]) # The last one is 2015-12 without errors
print(len(dates))
Copy the code

Because date_range does not include the month of end, we need to add a month to keep the number of final dates the same.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

2016-01 1940-01-31 00:00:00 2015-12-31 00:00:00 912

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

Step 3: Set the time series data to index of DF

df.set_axis(dates)
print(df.info())
print(df.head())
Copy the code

As can be seen from the result, the dates date data obtained above has become the index of data set DF.

1.2 Plotting of time series data

The pandas function is used to draw the second column of the text text in its library. For example, to draw the second column of the text text in its library, the following code is used:

# drawing
df.iloc[:,2].plot() Draw the sequence data for column 2
Copy the code

The amount of data in the above figure is too large to be seen clearly. Therefore, we need to draw only part of the figure to see the trend of data in a certain period. The following methods can be used:

# The data in the above graph is too dense, we need to check the data of some time periods
start='2008-2'
end='2010-3'
df.iloc[:,2][start:end].plot() 
Select * from Series; select * from Series; select * from Series
Copy the code

You can also select data from a certain year to plot

start='2008' Get data for a given year
end='2010'
df.iloc[:,2][start:end].plot() 

Copy the code

You can draw more than one column of data at a time, as follows:

Draw two columns of data at a time
start='2008' Get data for a given year
end='2010'
df.iloc[:,2:4][start:end].plot()  Draw the second and third columns at the same time
Copy the code

You can also plot the difference between the two columns, or and, or min, Max, etc as follows:

You can also plot the difference between two columns of data
start='2008' Get data for a given year
end='2010'
temp_df=df.iloc[:,2][start:end]-df.iloc[:,3][start:end]
temp_df.plot()
Copy the code

You can also plot a portion of data where one column is greater than a certain value and another column is less than a certain value.

The first column is greater than a certain value and the second column is less than a certain value
temp_df2=df[df.iloc[:,2] >60][df.iloc[:,3] <20].iloc[:,2:4]
temp_df2.plot()
Copy the code

1.3 Obtaining Statistics

1.3.1 Obtaining Max,Min, Mean, etc

Get the statistics of the dataset
part_df=df.iloc[:,2:4] Only the second and third columns are counted
print('Max: \n{}'.format(part_df.max()))
print('Min: \n{}'.format(part_df.min()))
print('Mean: \n{}'.format(part_df.mean()))
# This method can obtain Max, Min, Mean values, but it is not as good as the following function

print(part_df.describe()) # This can be seen from the overall distribution of data

Copy the code

The above functions are very simple, so I will not post the printed results.

1.3.2 Calculation of moving average

The meaning of moving average is mainly to eliminate noise and make the signal look more smooth. The calculation method is to calculate the average value of the previous N data, and then move one bit, always calculate the average value of the latest N data. If you know how to trade stocks, you should have a good idea of what a moving average means and how it is calculated.

# Calculate the moving average MAn
N=20
MAn=part_df.rolling(N).mean()
MAn.plot()
Copy the code

1.3.3 Calculate the correlation coefficient of the moving average

The correlation coefficient of moving average can be understood as: The correlation of two columns of data, if the correlation is very strong, so the two columns of data has a strong correlation, use the stock data to illustrate, the correlation of moving average is the two stock share price correlation, if the correlation is very strong, so the two stocks will show the synchronization “with rising and falling, if the correlation is very small, That means there’s not much relationship between the price movements of the two stocks.

# Calculate the correlation coefficient of moving average MAn
N=20
MAn=part_df.rolling(N).mean()
corr=MAn.iloc[:,0].rolling(window=40).corr(MAn.iloc[:,1])
corr.plot()
Copy the code

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

There is not much to cover in this section because many of the basic methods for Pandas are used in the Pandas module.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli