Original link:tecdat.cn/?p=19211
Original source:Tuo End number according to the tribe public number
As the threat of Novel Coronavirus COVID-19 spreads around the world, we live in an era of increasing concern. In this paper, MATLAB is used to analyze COVID-19 data sets.
COVID – 19 data source
We examine the unzipped file. Contains:
- Data.csv – Daily levels of global cases by province/state, 2020
- Confirmed.csv – Time series data of confirmed cases
- Shuffle.csv – Time series data on deaths
- Recovered. CSV – Time series data of recovered people
Map visualization
We visualized the number of confirmed cases on a map. We start by loading the latitude and longitude variables.
opts = detectImportOptions(filenames(4), "TextType","string");
Copy the code
The dataset contains “province/state” variables, but we aggregate the data at the “country/territory” level. Before we do that, we need to sort out the data a little bit.
times_conf.("Country/Region")(times_conf.("Country/Region") == "China") = "Mainland China";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Czechia") = "Czech Republic";
Copy the code
We can now use groupsummary to add up confirmed cases and average latitude and longitude to aggregate the data by country/region.
country = groupsummary(times_conf,"Country/Region",{'sum','mean'},vars(3:end));
Copy the code
The output contains unnecessary columns, such as the sum of latitude and longitude. Let’s delete these variables.
vars = regexprep(vars,"^(sum_)(? =L(a|o))","remove_"); vars = regexprep(vars,"^(mean_)(? =[0-9])","remove_");Copy the code
times_conf_exChina = times_conf_country(times_conf_country.("Country/Region") ~= "Mainland China",:);
Copy the code
Let’s visualize the first and last date data in the dataset using GeoBubble.
for ii = [4, length(vars)]
times_conf_exChina.Category = categorical(repmat("<100",height(times_conf_exChina),1));
times_conf_exChina.Category(table2array(times_conf_exChina(:,ii)) >= 100) = ">=100";
gb.LegendVisible = "off";
Copy the code
We can see that it initially affected only the countries around the continent. It is important to note that we have confirmed cases in the United States as early as January 22, 2020.
Confirmed cases in US
Enter the United States at the provincial/state level.
figure t = tiledlayout("flow"); For ii = [5, length(vars)] gb.BubbleColorList = [1,0,1;1,0,0]; gb.LegendVisible = "off"; gb.Title = "As of " + vars(ii); gb.SizeLimits = [0, max(times_conf_us.(vars{length(vars)}))]; Gb. MapCenter = [44.9669 113.6201]; Gb. ZoomLevel = 1.7678;Copy the code
You can see that it started in Washington, and there were big outbreaks in California and New York.
Rank countries/territories by confirmed cases
Let’s compare the number of confirmed cases by country/region using COVID_19_data.csv. There are inconsistencies in the date-time format, so we treat it as text at first.
opts = detectImportOptions(filenames(3), "TextType","string","DatetimeType","text");
Copy the code
Clear date and time format.
Data.nDate = regexprep(Data.Date,"\/20$","/2020");
Data.Date = datetime(Data.Date);
Copy the code
We also need to standardize values in country/region.
Country_Region(Country_Region == "Iran (Islamic Republic of)") = "Iran";
Copy the code
The dataset contains provincial/state variables. Let’s aggregate the data at the country/region level.
countryData = groupsummary(provData,{'ObservationDate','Country_Region'}, ...
"sum",{'Confirmed','Deaths','Recovered'});
Copy the code
CountryData contains cumulative daily data. We just need the latest numbers.
Increase in confirmed cases by country/territory
We can also examine the rate of increase in cases in these countries.
figure
plot(countryData.ObservationDate(countryData.Country_Region == labelsK(2)), ...
hold on
for ii = 3:length(labelsK)
plot(countryData.ObservationDate(countryData.Country_Region == labelsK(ii)), ...
Copy the code
Although South Korea is showing signs of slowing growth, it is accelerating elsewhere.
Increase in new cases by country/region
We can calculate the number of new cases by subtracting the cumulative number of confirmed cases between the two dates.
for ii = 1:length(labelsK)
country = provData(provData.Country_Region == labelsK(ii),:);
country = groupsummary(country,{'ObservationDate','Country_Region'}, ...
if labelsK(ii) ~= "Others"
nexttile
Copy the code
As you can see, China and South Korea are not seeing many new cases. We can see that the epidemic has been contained.
China
As the rate of infection in China is slowing, let’s take a look at how many active cases there are still. You can count active cases by subtracting recovered cases and deaths from confirmed cases.
for ii = 1:length(labelsK)
by_country{ii}.Active = by_country{ii}.Confirmed - by_country{ii}.Deaths -
figure
Copy the code
Fitting curve
The number of valid cases is falling, and the curve looks roughly gaussian. Can we fit the Gaussian model and predict when the activity case will be zero?
I use the curve fitting toolbox for Gaussian fitting.
ft = fittype("gauss1");
[fobj, gof] = fit(x,y,ft,opts);
gof
Copy the code
Gof = struct with fields: sse: 4.4145e+08 rsquare: 0.9743 dfe: 47 adjrsquare: 0.9732 rmse: 3.0647e+03Copy the code
Let’s output the forecast by adding 20 days.
Now let’s plot the result.
figure
area(ObservationDate,by_country{1}.Active)
hold on
plot(xdates,yhat,"lineWidth",2)
Copy the code
South Korea
Let’s look at the number of active cases, recovered cases and deaths in South Korea.
It is impossible to obtain any suitable results using the Gaussian model.
Most welcome insight
1. Use LSTM and PyTorch for time series prediction in Python
2. Long and short-term memory model LSTM is used in Python for time series prediction analysis
3. Time series (ARIMA, exponential smoothing) analysis using R language
4. R language multivariate Copula – Garch – model time series prediction
5. R language Copulas and financial time series cases
6. Use R language random wave model SV to process random fluctuations in time series
7. Tar threshold autoregressive model for R language time series
8. R language K-Shape time series clustering method for stock price time series clustering
Python3 uses ARIMA model for time series prediction