We are in the midst of a difficult time.

In will be coronavirus pneumonia outbreak affects society’s critical moment, this article will use the method of data analysis, data mining, machine learning, around the epidemic situation to show, epidemic situation prediction is analyzed, relationship between mining complex heterogeneous multi-source data, in the form of vivid, contributing to win the victory of the epidemic prevention and control of power!

Traditional time series model Prophet, deep learning model Seq2seq and infectious disease model SIR will be used to predict the number of confirmed cases.

Data description

Source: www.kaggle.com/sudalairajk…

This data set provides information on the global number of Novel Coronavirus infections, deaths and recoveries in 2019. Please note that this is a time series data, so the number of cases on any given day is cumulative. Data is available from January 22, 2020, and will be updated daily.

Data set:

-2019ncovdata.csv

-time_series_2019_ncov_confirmed.csv

-time_series_2019_ncov_deaths.csv

-time_series_2019_ncov_recovered.csv

  • Sno – Serial number
  • Date – Date and time (MM/DD/YYYY HH:MM:SS)
  • Province/State – The Province or State to watch (can be null if lost)
  • Country – the Country
  • Last Update – The time in UTC at which rows are updated for the given province or country. (There is currently no standardization. So please wash before use)
  • Confirmed – Confirmed number of people
  • Deaths of Deaths
  • Recovered – Number of Recovered

This paper mainly uses 2019ncovdata.csv data

The data analysis

1. Basic import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline
plt.style.use('ggplot')

import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
import plotly.figure_factory as ff
from plotly import subplots
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)

from datetime import datetime, date, timedelta

from fbprophet import Prophet

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns'.100)
pd.set_option('display.max_rows'.100)
Copy the code

2. Import data

import os
for dirname, _, filenames in os.walk('/kaggle/input') :for filename in filenames:
        print(os.path.join(dirname, filename))
Copy the code

Output:

/kaggle/input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv

/kaggle/input/novel-corona-virus-2019-dataset/time_series_2019_ncov_deaths.csv

/kaggle/input/novel-corona-virus-2019-dataset/time_series_2019_ncov_confirmed.csv

/kaggle/input/novel-corona-virus-2019-dataset/time_series_2019_ncov_recovered.csv

df = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv')
df.head(5)
Copy the code

3. Missing value processing

You can see that there is a significant absence in the province/state column. Take a closer look at why.

df[df['Province/State'].isnull()]
Copy the code

For some countries there is no province/state information, so it is empty.

4. Visualization of confirmed cases

First, the global picture

fig = px.bar(df, x='Date', y='Confirmed', hover_data=['Province/State'.'Deaths'.'Recovered'], color='Country')
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Confirmed bar plot for each country',
                              font=dict(family='Arial',
                                        size=30,
                                        color='RGB (37,37,37)'),
                              showarrow=False))
fig.update_layout(annotations=annotations)
fig.show()
Copy the code

Here are the confirmed cases in China

fig = px.bar(df.loc[dataset['Country'] = ='Mainland China'], x='Date', y='Confirmed', hover_data=['Province/State'.'Deaths'.'Recovered'], color='Province/State')
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Confirmed bar plot for Mainland China',
                              font=dict(family='Arial',
                                        size=30,
                                        color='RGB (37,37,37)'),
                              showarrow=False))
fig.update_layout(annotations=annotations)
fig.show()
Copy the code

Death toll in China

Number of deaths/cured persons in Hubei Province

Data aggregation

confirmed_training_dataset = pd.DataFrame(dataset[dataset.Country=='China'].groupby('Date') ['Confirmed'].sum().reset_index()).rename(columns={'Date': 'ds'.'Confirmed': 'y'})
confirmed_training_dataset.head()
Copy the code

The date feature is not isometric because the data is stored at a certain time of day, not in real time. Here we assume real daily confirmed data for analysis and prediction

Model prediction – Prophet

The Prophet algorithm provided by Facebook can handle not only the presence of some outliers in the time series, but also the absence of some values. It can also predict the future trend of the time series almost automatically. What Prophet does is:

  1. Enter the timestamp and corresponding value of the known time series;
  2. Input the length of the time series to be predicted;
  3. Output future time series trend.
  4. The output can provide necessary statistical indicators, including fitting curves, upper and lower bounds, etc.
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import plot_cross_validation_metric, add_changepoints_to_plot, plot_plotly
Copy the code

We start by establishing a basic baseline model, including daily trends. (This is not always useful, of course, because the time in the date is not the actual time when newly confirmed cases were registered, so there are various confounding factors).

prophet = Prophet(
    yearly_seasonality=False,
    weekly_seasonality = False,
    daily_seasonality = True,
    seasonality_mode = 'additive')
prophet.fit(confirmed_training_dataset)
future = prophet.make_future_dataframe(periods=7)
confirmed_forecast = prophet.predict(future)
Copy the code

The results are analyzed visually

fig = plot_plotly(prophet, confirmed_forecast)  
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Forecast number of confirmed cases',
                              font=dict(family='Arial',
                                        size=30,
                                        color='RGB (37,37,37)'),
                              showarrow=False))
fig.update_layout(annotations=annotations)
fig
Copy the code

Year, week and day seasonality parameters are in Prophet, now we try to re-forecast without daily seasonality.

prophet = Prophet(
    yearly_seasonality=False,
    weekly_seasonality = False,
    daily_seasonality = False,
    seasonality_mode = 'additive')
prophet.fit(confirmed_training_dataset)
future = prophet.make_future_dataframe(periods=7)
confirmed_forecast_2 = prophet.predict(future)
Copy the code

The results become very poor. Now look at the average absolute percentage error (MAPE) of the two models.

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

max_date = prophet.ds.max()
y_true = prophet.y.values
y_pred_daily = confirmed_forecast.loc[confirmed_forecast['ds'] <= max_date].yhat.values
y_pred_daily_2 = confirmed_forecast_2.loc[confirmed_forecast_2['ds'] <= max_date].yhat.values

print('Include day seasonal MAPE: {}'.format(mean_absolute_percentage_error(y_true,y_pred_daily)))
print('Does not include day seasonal MAPE: {}'.format(mean_absolute_percentage_error(y_true,y_pred_daily_2)))
Copy the code

Output:

Included day seasonal MAPE: 39.37057017194978 Not included day seasonal MAPE: 162.6290389271529

Obviously the performance of these models is poor, so we can try adding some parameters to both models to see if anything changes and hopefully improve.

In Prophet, you can generally set the following four parameters:

  1. Capacity: Capacity value to be set when the incremental function is a logistic regression function.
  2. Change Points: N_changepoints and Changepoint_range can be used to set equidistant Change Points, or manually set the time series Change Points.
  3. Seasonal and holidays: You can specify holidays based on actual service requirements.
  4. Smooth parameters:Changepoint_prior_scale can be used to control the flexibility of trends,Seasonality_prior_scale controls the flexibility of season entries,Holidays Prior Scale controls the flexibility of holidays.

If you don’t want to set it, just use Prophet’s default parameters.

Prophet specific introduction, please refer to: zhuanlan.zhihu.com/p/52330017

Seq2seq and SIR epidemic prediction will be described in detail in subsequent articles

Reference links:

www.kaggle.com/shubhamai/c…

www.kaggle.com/parulpandey…

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code