Perform exploratory analysis in the simplest way possible.
Pain points
In practice, a lot of Data Analysis time is spent on Data cleaning and Exploratory Data Analysis (EDA). Namely missing value statistical processing, and variable distribution visualization.
In the process of data collection, there may be missing.
You need to know how much data is missing and how it might affect subsequent analysis.
If there are few missing data for a variable, simply throw out the rows (observations) containing missing values, so as not to affect the accuracy of the analysis.
But if there’s too much missing data, throwing it all away isn’t feasible. You need to think about how to fill it. Do you use 0, do you use “unknown”, do you use mean or median?
In addition, you may want to look at the distribution of each characteristic variable.
For example, is quantitative data normally distributed or power-law distributed? This will have an impact on your research hypothesis.
Even for categorised data, you need to know how many unique values there are so you know what to expect.
The work is necessary. But implementing it has been a hassle. Even software designed for statisticians like R used to require several commands (generally proportional to the number of characteristic variables) to complete.
I recently found an R pack that makes it very easy to do a dataset summary overview. A single statement takes you through many of the steps in exploratory data analysis.
I share it with you through this article. I hope it will be helpful to your data analysis work.
demo
You don’t need to install any software. Just click on this link (t.cn/Rg1JFfo) to use the R programming environment.
When you’re ready, you’ll see an RStudio interface open in your browser.
Click File -> New File in the upper left corner and select R Script, the first item in the menu.
At this point, you should see a blank edit area open on the left side and you can enter the statement.
Before entering, let’s give the file a name. Click the File -> Save button.
In the new dialog box, type Demo and press Enter.
We need to enter a total of four statements as follows. You can just copy and paste it into the edit area.
library(tidyverse)
library(summarytools)
flights <- read_csv("https://gitlab.com/wshuyi/demo-data-flights/raw/master/flights.csv")
view(dfSummary(flights))
Copy the code
Explain the meanings separately. In fact, the first three lines are all preparation. It only takes article 4 to really summarize the overview functionality.
Line 1: Tidyverse is a very important library. It can be said that it improves the data processing ecosystem of R. And most of the tools in this library, Hadley Wickham, have been developed and completed by himself.
Line 2: SummaryTools is the name of the package we used today to summarize the overview data.
Line 3: Use read_csv for data reading. We read from this site and store the data in the FLIGHTS variable.
You can download the raw data CSV file and view its contents by clicking on the link (t.cn/Rg1XCCN).
This data set, from Hadley Wickham’s Github project, is called NycFlights13.
It records flights departing from New York City’s three major airports in 2013: JFK, LGA Laguardia, and EWR Newark Liberty International.
Specific recorded information (characteristic columns), including departure time, arrival time, frequent delays, airline, originating airport, destination airport, flight duration, and flight distance.
This chart, it looks pretty clear. However, due to the large number of observations (rows), it is difficult for us to intuitively analyze the situation of missing values and data distribution and other information.
The fourth statement is responsible for helping us better view and explore the data. It uses dfSummary function to process the contents of flights data boxes, and then uses view function to output them directly to users.
Go to Code -> Run Region -> Run All to Run the Code.
Running, there may be some warning messages. Just ignore it.
The results of the analysis are shown in the lower right area. Because the area is relatively small, the content is very much, see not comprehensive.
You can click the third “Show in New Window” button in the upper left of this area to open the full display in a new window of the browser.
Interpretation of the
Due to the length of the screenshot, complete information cannot be displayed in one picture. On the first screen, I’ll show you what the results are.
- The first column is the ordinal number. Don’t bother.
- The second column contains the name of the variable and its type. For example,
integer
Refers to quantitative data of integer type;character
It’s a string, which is classified data. - The third column is the statistical result. For quantitative data, report maximum, minimum, mean, median and other information directly.
- The sixth column is the number of valid values; Complementing this, the seventh column is the number of missing values.
- The fourth column is frequency. Displays the occurrences of unique values for each variable.
- The fifth column is the most interesting, graphing the distribution statistics directly.
Let’s turn to the next page.
It can be seen that take-off delay is a typical power law distribution.
It seems reasonable to think that the distribution of arrival delays and departure delays looks very similar.
But what is the distribution of arrival delays? Why the difference?
Here’s a question for you to think about.
explore
The SummaryTools package described in this article does more than provide a summary overview of the data set.
It can also display relationships between variables. For example, you want to know if there is a difference in the proportion of flights departing from the three airports. You can use a single statement to get an analysis table like this:
Want to do this analysis yourself? Please click on this link (github.com/dcomtois/su…) , read the documentation to learn more about summaryTools.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.