• Time Series Analysis in Python: An Introduction
  • Will Koehrsen
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: the PPP – man

A summation model for time series modeling

Time series is one of the most common data types in daily life. Prices in financial markets, the weather, household energy consumption and even body weight are examples of data that can be collected regularly. Almost every data scientist comes across time series in their daily work, and learning how to model time series is an important skill in data science. The addition model for analyzing and predicting periodic data is a simple but powerful one. The intuitive concept behind this is to break up a time series into combinations of different time intervals and overall trends, which can be daily, weekly, quarterly, or yearly. Your home may use more energy in the summer than in the winter, but overall there is a declining trend for more efficient energy use. Additive models show patterns/trends and make predictions based on these observations.

The chart below shows a time series broken down into overall, annual, and weekly trends.

Examples of additive model decomposition

This article will detail an introductory example of modeling financial time series data using Python and Facebook’s Prophet Prediction extension. We’ll also cover manipulating data in pandas, accessing financial data in Quandl, and drawing diagrams in Matplotlib. I’ve included introductory code and I encourage you to check out the full analysis of Jupyter Notebook on GitHub. This introduction will show you all the steps you need to take to model a time series.

Disclaimer: The thing that strikes you is that past performance of financial data is not an indicator of future performance and this approach won’t make you rich. I choose stocks because the daily data are easier to get and work with.

Extracting financial data

On average, nearly 80% of a data science project’s time is spent acquiring and organizing data. Thanks to the Quandl Financial database, the acquisition and collation time required for this project was reduced to about 5%. Quandl can be installed with PIP on the command line and allows you to access thousands of financial metrics with a single line of Python code, making no more than 50 requests without registration. Signing up for a free account gives you access to an API key and unlimited requests.

First we import the necessary libraries and get some data. Quandl automatically incorporates our data into pandas’ data box, a data structure for data science (” TSLA “or” GM “is used to retrieve data from other companies, and you can specify a time interval).

Use quandl to fetch financial data
import quandl

# Use Panda for data processing
import pandas as pd

quandl.ApiConfig.api_key = 'getyourownkey! '

Extract TSLA data from Quandl
tesla = quandl.get('WIKI/TSLA')

Extract GM data from Quandl
gm = quandl.get('WIKI/GM')
gm.head(5)
Copy the code

Screenshot of GM data from Quandl

The amount of data on Quandl is almost unlimited, but I want to focus on two companies in the same industry, Tesla and General Motors. The allure of Tesla is not only that it is the first successful car startup in the US in 111 years, it has also periodically ranked as the most valuable car company in the US in 2017.

Even though it only sells four models. General Motors, its rival for the title of most valuable car company, has shown signs of backing the car of the future in recent years, making some novel (if not pretty) all-electric vehicles.

The answer is obvious

We can spend a lot of time looking for this data and downloading it into CSV files, but thanks to Quandl, we can get what we need in seconds.

Data mining

We’d better draw some pictures to get some idea of the structure before we model it. This also allows us to check for outliers and missing data.

Matplotlib can easily draw the Pandas data box. Don’t worry if this graphing code scares you, I also find the matplotlib code unintuitive so I often copy and paste the quandl example on Stack Overflow or in other documentation. One of the rules of programming is never to rewrite an answer you already have.

# Stock split adjusted closing price, this is the data we should draw
plt.plot(gm.index, gm['Adj. Close'])
plt.title('GM Stock Price')
plt.ylabel('Price ($)');
plt.show()

plt.plot(tesla.index, tesla['Adj. Close'].'r')
plt.title('Tesla Stock Price')
plt.ylabel('Price ($)');
plt.show();
Copy the code

Original stock price

Comparing the stock prices of two companies alone doesn’t tell you which one is more valuable, because a company’s value (market cap) also depends on the number of shares (market cap = stock price * number of shares). Quandl doesn’t have data on the number of shares, but I was able to find a simple Google search for the annual average number of shares for the two companies. It’s not precise, but it’s good enough for our analysis, and sometimes we have to make do with it!

We use some of the pandas tricks to create two columns in a data box, such as indexing a column (reset_index) and using IX to index the data spaced in the data box.

# Average annual number of Shares of Tesla and GENERAL Motors
tesla_shares = {2018: 168e6, 2017: 162e6, 2016: 144e6, 2015: 128e6, 2014: 125e6, 2013: 119e6, 2012: 107e6, 2011: 100e6, 2010: 51e6}

gm_shares = {2018: 1.42e9, 2017: 1.50e9, 2016: 1.54e9, 2015: 1.59e9, 2014: 1.61e9, 2013: 1.39e9, 2012: 1.57e9, 2011: 1.54e9, 2010:1.50e9}

Create a year column
tesla['Year'] = tesla.index.year

Move the date of the index to the date column
tesla.reset_index(level=0, inplace = True)
tesla['cap'] = 0

# Calculate market capitalization for all years
for i, year in enumerate(tesla['Year') :Extract the number of shares for that year
    shares = tesla_shares.get(year)
    
    The # market value column is equal to quantity times price
    tesla.ix[i, 'cap'] = shares * tesla.ix[i, 'Adj. Close']
Copy the code

This is how tesla’s market cap column is generated. We generate the GM data in the same way and merge the two sets of data. Merge is a necessary element in data science because it lets us join together data that shares a column. In this example we want to merge the data from two different companies by date, using the “inner” merge to preserve the date that appears in both companies’ data boxes. We then rename the merged columns to see which data set belongs to which car company.

Merge two sets of data and rename columns
cars = gm.merge(tesla, how='inner', on='Date')

cars.rename(columns={'cap_x': 'gm_cap'.'cap_y': 'tesla_cap'}, inplace=True)

Select only relevant columns
cars = cars.ix[:, ['Date'.'gm_cap'.'tesla_cap']]

# Divide to get $1 billion of market value
cars['gm_cap'] = cars['gm_cap'] / 1e9
cars['tesla_cap'] = cars['tesla_cap'] / 1e9

cars.head()
Copy the code

The combined market value data box

Market capitalization is measured in billions of dollars. We saw gm’s market cap at 30 times tesla’s in its early days! Has it changed over time?

plt.figure(figsize=(10, 8))
plt.plot(cars['Date'], cars['gm_cap'].'b-', label = 'GM')
plt.plot(cars['Date'], cars['tesla_cap'].'r-', label = 'TESLA')
plt.xlabel('Date'); plt.ylabel('Market Cap (Billions $)'); plt.title('Market Cap of GM and Tesla')
plt.legend();
Copy the code

Historical market value data

We saw tesla’s rapid growth and GENERAL Motors’ weak growth in the data related period. Tesla passed GENERAL Motors in 2017!

import numpy as np

# Find the first and last time Tesla was worth more than GM
first_date = cars.ix[np.min(list(np.where(cars['tesla_cap'] > cars['gm_cap'[0]]))),'Date']
last_date = cars.ix[np.max(list(np.where(cars['tesla_cap'] > cars['gm_cap'[0]]))),'Date']

print("Tesla was valued higher than GM from {} to {}.".format(first_date.date(), last_date.date()))

Tesla was valued higher than GM from 2017-04-10 to 2017-09-21.
Copy the code

During that period, Tesla sold about 48,000 cars while GM sold 1.5 million. Gm undersells Tesla when it sells 30 times more cars than Tesla! It definitely shows the power of a persuasive CEO and a super high quality (so low volume) product. While Tesla’s market cap is now lower than GM’s, can we expect tesla to surpass it again? When will it happen? To answer this question we turn to additive models for predicting the future.

Prophet modeling

Data scientists around the world have been inspired by the Facebook Prophet expansion pack, which debuted in 2017 for Python and R. Fortunetellers are designed to analyze time series of daily observations that exhibit different patterns at different time intervals. It also has powerful functions to analyze the influence of festivals in time series and apply self-defined nodes. Let’s just look at the basic functions that make the model work properly. Prophet, like Quandl, can be installed with PIP on the command line.

We first import the prophet and rename the columns in the data to the correct format. The date column must be renamed “DS” and the value column you want to predict called “Y”. We then create a prophet model to process the data, much like SciKit-Learn:

import fbprophet

# Prophet asks for ds and Y
gm = gm.rename(columns={'Date': 'ds'.'cap': 'y'})

# $1 billion
gm['y'] = gm['y'] / 1e9

# Create a prophet model for dataGm_prophet = fbprophet. The Prophet (changepoint_prior_scale = 0.15) gm_prophet. Fit (gm)Copy the code

When creating the Prophet MODEL I set ChangePoint Prior to 0.15, which is higher than the default 0.05. This hyperparameter is used to control the sensitivity of trend changes. The higher the value is, the more sensitive it is, and the lower the value is, the less sensitive it is. The value of this value is to counter the fundamental tradeoff of machine learning: bias vs. bias.

If our model is too close to the training data, which is called over-fitting, our deviation will be too large and the model will be difficult to generalize to other new data. On the other hand, if the model cannot capture the trend of training data, too many biases make it unsuitable. When the fitting degree of a model is low, increase the value of Changepoint prior to make the model closer to the data. Conversely, if the model is too fit, reduce the sensitivity of the prior constraint model. The effect of Changepoint prior can be shown by plotting a prediction graph of a series of values:

The higher the degree of Changepoint prior, the more flexible the model is and the better it fits the training data. This seems to be exactly what we want, but being too close to the training data can lead to too much fit, weakening the model’s ability to make predictions with the new data. So we need to find a balance that is consistent with the training data and generalizes to other data. We used our model to capture the stock prices that changed every day. After trying some data, I increased the flexibility of the model.

Once we decide to use the Prophet model, we can specify Changepoints (locations where the time series changes most quickly) when the time series increases to decrease or slowly increases to rapidly increase. Changepoints can correspond to major events such as new product launches or macroeconomic turbulence. In the absence of a designation, the prophet calculates it for us.

The future data box is used to make predictions. We specify the number of time periods to forecast (two years) and the frequency of making predictions (daily), and then use the prophet model and future data box to make predictions.

Generate a 2 year data box
gm_forecast = gm_prophet.make_future_dataframe(periods=365 * 2, freq='D')

# do predict
gm_forecast = gm_prophet.predict(gm_forecast)
Copy the code

Our Future data box contains the estimated market values of Tesla and GENERAL Motors over the next two years, graphed with the Prophet function.

gm_prophet.plot(gm_forecast, xlabel = 'Date', ylabel = 'Market Cap (billions $)')
plt.title('Market Cap of GM');
Copy the code

The black dots represent actual data (note that they only go to the beginning of 2018), the blue lines are forecasts, and the light blue areas are uncertainties (usually the most important part of the forecast). The area of uncertainty expands over time, because the initial uncertainty increases over time, just as weather forecasts become less accurate the longer they are forecast.

We can also check the Changepoints detected by the model. To reiterate, Changepoints are when there is a significant change in the growth of a time series (e.g. from increase to decrease).

tesla_prophet.changepoints[:10]

61    2010-09-24
122   2010-12-21
182   2011-03-18
243   2011-06-15
304   2011-09-12
365   2011-12-07
425   2012-03-06
486   2012-06-01
547   2012-08-28
608   2012-11-27
Copy the code

We can compare tesla trends from Google searches over that time period to see if the results are consistent. Changepoints (vertical lines) and search results are placed in the same graph:

# load data
tesla_search = pd.read_csv('data/tesla_search_terms.csv')

Convert month to datetime
tesla_search['Month'] = pd.to_datetime(tesla_search['Month'])
tesla_changepoints = [str(date) for date in tesla_prophet.changepoints]
# Draw search frequency
plt.plot(tesla_search['Month'], tesla_search['Search'], label = 'Searches')

# painting changepoints
plt.vlines(tesla_changepoints, ymin = 0, ymax= 100, colors = 'r', our linewidth = 0.6, linestyles ='dashed', label = 'Changepoints')

# Rearrange drawing
plt.grid('off'); plt.ylabel('Relative Search Freq'); plt.legend()
plt.title('Tesla Search Terms and Changepoints');
Copy the code

Tesla search frequency and stock Changepoints

Some of the changepoints in Tesla’s market value correspond to changes in the frequency of Tesla searches, but not all of them. I don’t think Google search frequency is a good indicator of stock movements.

We still need to know when Tesla’s market cap will surpass GM’s. Now that we have a forecast for the next two years, we can merge the two data boxes and plot the market values of the two companies on the same graph. Before merging, columns need to be renamed to facilitate tracking.

gm_names = ['gm_%s' % column for column in gm_forecast.columns]
tesla_names = ['tesla_%s' % column for column in tesla_forecast.columns]

# Merged data box
merge_gm_forecast = gm_forecast.copy()
merge_tesla_forecast = tesla_forecast.copy()

More ranked #
merge_gm_forecast.columns = gm_names
merge_tesla_forecast.columns = tesla_names

# Merge two sets of data
forecast = pd.merge(merge_gm_forecast, merge_tesla_forecast, how = 'inner', left_on = 'gm_ds', right_on = 'tesla_ds')

The date column is renamed
forecast = forecast.rename(columns={'gm_ds': 'Date'}).drop('tesla_ds', axis=1)
Copy the code

So first we’re just going to draw estimates. The estimates (” yhat “for the Prophet pack) remove some of the noise from the data and look different from the original graph. The degree of impurity removal depends on the size of changepoint Prior – a higher prior value indicates more model flexibility and more ups and downs.

The projected market capitalization of GM and Tesla

Our model says tesla’s 2017 overtaking of GM is noise, and it’s not until 2018 that Tesla really beats GM in forecasts. The exact date is January 27, 2018, and if that happens, I’d love to accept credit for predicting the future!

When generating the above images, we are missing one of the most important aspects of forecasting: uncertainty! We can use matplotlib (see notebook) to see where there is uncertainty:

This is a better representation of the predicted outcome. The chart shows that the two companies are expected to grow, with Tesla growing faster than GM. Again, uncertainty increases over time, and the fact that Tesla’s floor is higher than GM’s ceiling in 2020 means GM is likely to stay ahead of the pack.

Trends and laws

The final step in market capitalization analysis is to look at overall trends and patterns. The seers make it easy for us to do this.

# Describe trends and patterns
gm_prophet.plot_components(gm_forecast)
Copy the code

Decomposition of time series of General Motors

The trend is clear: GM’s stock price is rising and continuing to rise. But the interesting thing about this year is that stock prices rose at the end of the year and then slowly fell into the summer. We can test whether there is a correlation between annual market capitalization and average monthly sales. I first collected monthly sales from Google and then averaged those months by groupby, which is an important step because we often want to compare two categories of data, such as an age group of users or different cars from a manufacturer. To calculate the average monthly sales in our example, add up the months and average sales.

gm_sales_grouped = gm_sales.groupby('Month').mean()
Copy the code

Monthly sales seem to have little to do with market value. Highest sales in August but lowest market value!

There are no useful signals from weekly trends (there is no stock price information at weekends so we can only look at weekdays), which is to be expected because of the random walks in the economy and looking at daily stock prices alone doesn’t tell us anything. Our analysis also shows that stocks grow over the long term, but on a daily basis, even with the best models we can exploit few patterns.

A quick look at the Dow Jones Industrial Average (an index of the 30 largest companies) illustrates this point well:

Dow Jones Industrial Average (Source)

The obvious message is to go back to 1900 and invest! Or in real life when the stock market goes down don’t get out because history tells us it will go up again. Overall, the daily changes are too small to see, and if you want to look at stocks every day so stupid, you might as well invest in the whole market and hold on for the long term.

Prognosticators can also be used to measure large-scale data, such as gross domestic product (GDP), a measure of the size of a country’s economy. I made the following predictions using the Prophet model based on historical GDP data for the US and China.

The exact date when China will surpass US GDP is 2036! The model’s weakness is that it has too few observations (GDP is measured quarterly but forecasters’ strength is in using daily data), but it can make basic predictions when macroeconomic knowledge is scarce.

There are many modeling methods for time series, ranging from simple linear regression to cyclic neural network built with LSTM. The usefulness of the addition model is that it is easy to generate and train and produce predictions with uncertainty through readable rules. Seers are powerful and we only see the surface. I encourage you to use this article and your notebooks to explore some of Quandl’s data, or your own time series. Keep an eye out for articles where I use these skills in my daily life to analyze and predict weight changes. The addition model is the perfect place to start exploring time series!

My email is [email protected]. Corrections and constructive criticism are welcome.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.