We are in the midst of a difficult time.
In will be coronavirus pneumonia outbreak affects society’s critical moment, this article will use the method of data analysis, data mining, machine learning, around the epidemic situation to show, epidemic situation prediction is analyzed, relationship between mining complex heterogeneous multi-source data, in the form of vivid, contributing to win the victory of the epidemic prevention and control of power!
Traditional time series model Prophet, deep learning model Seq2seq and infectious disease model SIR will be used to predict the number of confirmed cases.
Data description
Source: www.kaggle.com/sudalairajk…
This data set provides information on the global number of Novel Coronavirus infections, deaths and recoveries in 2019. Please note that this is a time series data, so the number of cases on any given day is cumulative. Data is available from January 22, 2020, and will be updated daily.
Data set:
-2019ncovdata.csv
-time_series_2019_ncov_confirmed.csv
-time_series_2019_ncov_deaths.csv
-time_series_2019_ncov_recovered.csv
- Sno – Serial number
- Date – Date and time (MM/DD/YYYY HH:MM:SS)
- Province/State – The Province or State to watch (can be null if lost)
- Country – the Country
- Last Update – The time in UTC at which rows are updated for the given province or country. (There is currently no standardization. So please wash before use)
- Confirmed – Confirmed number of people
- Deaths of Deaths
- Recovered – Number of Recovered
This paper mainly uses 2019ncovdata.csv data
The data analysis
1. Basic import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline
plt.style.use('ggplot')
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
import plotly.figure_factory as ff
from plotly import subplots
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)
from datetime import datetime, date, timedelta
from fbprophet import Prophet
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns'.100)
pd.set_option('display.max_rows'.100)
Copy the code
2. Import data
import os
for dirname, _, filenames in os.walk('/kaggle/input') :for filename in filenames:
print(os.path.join(dirname, filename))
Copy the code
Output:
/kaggle/input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_2019_ncov_deaths.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_2019_ncov_confirmed.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_2019_ncov_recovered.csv
df = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv')
df.head(5)
Copy the code
3. Missing value processing
You can see that there is a significant absence in the province/state column. Take a closer look at why.
df[df['Province/State'].isnull()]
Copy the code
For some countries there is no province/state information, so it is empty.
4. Visualization of confirmed cases
First, the global picture
fig = px.bar(df, x='Date', y='Confirmed', hover_data=['Province/State'.'Deaths'.'Recovered'], color='Country')
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
xanchor='left', yanchor='bottom',
text='Confirmed bar plot for each country',
font=dict(family='Arial',
size=30,
color='RGB (37,37,37)'),
showarrow=False))
fig.update_layout(annotations=annotations)
fig.show()
Copy the code
Here are the confirmed cases in China
fig = px.bar(df.loc[dataset['Country'] = ='Mainland China'], x='Date', y='Confirmed', hover_data=['Province/State'.'Deaths'.'Recovered'], color='Province/State')
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
xanchor='left', yanchor='bottom',
text='Confirmed bar plot for Mainland China',
font=dict(family='Arial',
size=30,
color='RGB (37,37,37)'),
showarrow=False))
fig.update_layout(annotations=annotations)
fig.show()
Copy the code
Death toll in China
Number of deaths/cured persons in Hubei Province
Data aggregation
confirmed_training_dataset = pd.DataFrame(dataset[dataset.Country=='China'].groupby('Date') ['Confirmed'].sum().reset_index()).rename(columns={'Date': 'ds'.'Confirmed': 'y'})
confirmed_training_dataset.head()
Copy the code
The date feature is not isometric because the data is stored at a certain time of day, not in real time. Here we assume real daily confirmed data for analysis and prediction
Model prediction – Prophet
The Prophet algorithm provided by Facebook can handle not only the presence of some outliers in the time series, but also the absence of some values. It can also predict the future trend of the time series almost automatically. What Prophet does is:
- Enter the timestamp and corresponding value of the known time series;
- Input the length of the time series to be predicted;
- Output future time series trend.
- The output can provide necessary statistical indicators, including fitting curves, upper and lower bounds, etc.
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import plot_cross_validation_metric, add_changepoints_to_plot, plot_plotly
Copy the code
We start by establishing a basic baseline model, including daily trends. (This is not always useful, of course, because the time in the date is not the actual time when newly confirmed cases were registered, so there are various confounding factors).
prophet = Prophet(
yearly_seasonality=False,
weekly_seasonality = False,
daily_seasonality = True,
seasonality_mode = 'additive')
prophet.fit(confirmed_training_dataset)
future = prophet.make_future_dataframe(periods=7)
confirmed_forecast = prophet.predict(future)
Copy the code
The results are analyzed visually
fig = plot_plotly(prophet, confirmed_forecast)
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
xanchor='left', yanchor='bottom',
text='Forecast number of confirmed cases',
font=dict(family='Arial',
size=30,
color='RGB (37,37,37)'),
showarrow=False))
fig.update_layout(annotations=annotations)
fig
Copy the code
Year, week and day seasonality parameters are in Prophet, now we try to re-forecast without daily seasonality.
prophet = Prophet(
yearly_seasonality=False,
weekly_seasonality = False,
daily_seasonality = False,
seasonality_mode = 'additive')
prophet.fit(confirmed_training_dataset)
future = prophet.make_future_dataframe(periods=7)
confirmed_forecast_2 = prophet.predict(future)
Copy the code
The results become very poor. Now look at the average absolute percentage error (MAPE) of the two models.
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
max_date = prophet.ds.max()
y_true = prophet.y.values
y_pred_daily = confirmed_forecast.loc[confirmed_forecast['ds'] <= max_date].yhat.values
y_pred_daily_2 = confirmed_forecast_2.loc[confirmed_forecast_2['ds'] <= max_date].yhat.values
print('Include day seasonal MAPE: {}'.format(mean_absolute_percentage_error(y_true,y_pred_daily)))
print('Does not include day seasonal MAPE: {}'.format(mean_absolute_percentage_error(y_true,y_pred_daily_2)))
Copy the code
Output:
Included day seasonal MAPE: 39.37057017194978 Not included day seasonal MAPE: 162.6290389271529
Obviously the performance of these models is poor, so we can try adding some parameters to both models to see if anything changes and hopefully improve.
In Prophet, you can generally set the following four parameters:
- Capacity: Capacity value to be set when the incremental function is a logistic regression function.
- Change Points: N_changepoints and Changepoint_range can be used to set equidistant Change Points, or manually set the time series Change Points.
- Seasonal and holidays: You can specify holidays based on actual service requirements.
- Smooth parameters:Changepoint_prior_scale can be used to control the flexibility of trends,Seasonality_prior_scale controls the flexibility of season entries,Holidays Prior Scale controls the flexibility of holidays.
If you don’t want to set it, just use Prophet’s default parameters.
Prophet specific introduction, please refer to: zhuanlan.zhihu.com/p/52330017
Seq2seq and SIR epidemic prediction will be described in detail in subsequent articles
Reference links:
www.kaggle.com/shubhamai/c…
www.kaggle.com/parulpandey…
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code