As the threat of Novel Coronavirus COVID-19 spreads around the world, we live in an era of increasing concern. In this paper, MATLAB is used to analyze COVID-19 data sets.

COVID – 19 data source

We examine the unzipped file. Contains:

  • Data.csv – Daily levels of global cases by province/state, 2020
  • Confirmed.csv – Time series data of confirmed cases
  • Shuffle.csv – Time series data on deaths
  • Recovered. CSV – Time series data of recovered people

Map visualization

We visualized the number of confirmed cases on a map. We start by loading the latitude and longitude variables.

opts = detectImportOptions(filenames(4), "TextType","string");
The dataset contains “province/state” variables, but we aggregate the data at the “country/territory” level. Before we do that, we need to sort out the data a little bit.

times_conf.("Country/Region")(times_conf.("Country/Region") == "China") = "Mainland China";
times_conf.("Country/Region")(times_conf.("Country/Region") == "Czechia") = "Czech Republic";
We can now use groupsummary to add up confirmed cases and average latitude and longitude to aggregate the data by country/region.

country = groupsummary(times_conf,"Country/Region",{'sum','mean'},vars(3:end));
The output contains unnecessary columns, such as the sum of latitude and longitude. Let’s delete these variables.

vars = regexprep(vars,"^(sum_)(? =L(a|o))","remove_"); vars = regexprep(vars,"^(mean_)(? =[0-9])","remove_");


times_conf_exChina = times_conf_country(times_conf_country.("Country/Region") ~= "Mainland China",:);
Let’s visualize the first and last date data in the dataset using GeoBubble.

for ii = [4, length(vars)]
    times_conf_exChina.Category = categorical(repmat("<100",height(times_conf_exChina),1));
    times_conf_exChina.Category(table2array(times_conf_exChina(:,ii)) >= 100) = ">=100";
    gb.LegendVisible = "off";
We can see that it initially affected only the countries around the continent. It is important to note that we have confirmed cases in the United States as early as January 22, 2020.

Confirmed cases in US

Enter the United States at the provincial/state level.

figure t = tiledlayout("flow"); For ii = [5, length(vars)] gb.BubbleColorList = [1,0,1;1,0,0]; gb.LegendVisible = "off"; gb.Title = "As of " + vars(ii); gb.SizeLimits = [0, max(times_conf_us.(vars{length(vars)}))]; Gb. MapCenter = [44.9669 113.6201]; Gb. ZoomLevel = 1.7678;Copy the code


You can see that it started in Washington, and there were big outbreaks in California and New York.

Rank countries/territories by confirmed cases

Let’s compare the number of confirmed cases by country/region using COVID_19_data.csv. There are inconsistencies in the date-time format, so we treat it as text at first.

opts = detectImportOptions(filenames(3), "TextType","string","DatetimeType","text");
Clear date and time format.

Data.nDate = regexprep(Data.Date,"\/20$","/2020");
Data.Date = datetime(Data.Date);
We also need to standardize values in country/region.

Country_Region(Country_Region == "Iran (Islamic Republic of)") = "Iran";
Copy the code

The dataset contains provincial/state variables. Let’s aggregate the data at the country/region level.

countryData = groupsummary(provData,{'ObservationDate','Country_Region'}, ...
CountryData contains cumulative daily data. We just need the latest numbers.


Increase in confirmed cases by country/territory

We can also examine the rate of increase in cases in these countries.

plot(countryData.ObservationDate(countryData.Country_Region == labelsK(2)), ...
hold on
for ii = 3:length(labelsK)
    plot(countryData.ObservationDate(countryData.Country_Region == labelsK(ii)), ...
Although South Korea is showing signs of slowing growth, it is accelerating elsewhere.

Increase in new cases by country/region

We can calculate the number of new cases by subtracting the cumulative number of confirmed cases between the two dates.

for ii = 1:length(labelsK)
    country = provData(provData.Country_Region == labelsK(ii),:);
    country = groupsummary(country,{'ObservationDate','Country_Region'}, ...

    if labelsK(ii) ~= "Others"
As you can see, China and South Korea are not seeing many new cases. We can see that the epidemic has been contained.


As the rate of infection in China is slowing, let’s take a look at how many active cases there are still. You can count active cases by subtracting recovered cases and deaths from confirmed cases.

for ii = 1:length(labelsK)
    by_country{ii}.Active = by_country{ii}.Confirmed - by_country{ii}.Deaths - 

Fitting curve

The number of valid cases is falling, and the curve looks roughly gaussian. Can we fit the Gaussian model and predict when the activity case will be zero?

I use the curve fitting toolbox for Gaussian fitting.

ft = fittype("gauss1");

[fobj, gof] = fit(x,y,ft,opts);
Gof = struct with fields: sse: 4.4145e+08 rsquare: 0.9743 dfe: 47 adjrsquare: 0.9732 rmse: 3.0647e+03

Let’s output the forecast by adding 20 days.

Now let’s plot the result.

hold on
South Korea

Let’s look at the number of active cases, recovered cases and deaths in South Korea.


It is impossible to obtain any suitable results using the Gaussian model.


