- EXPLORATORY STATISTICAL DATA ANALYSIS WITH A KAGGLE DATASET USING PANDAS
- Original by Strikingloo
- Translation from: The Gold Project
- This article is permalink: github.com/xitu/gold-m…
- Translator: haiyang – tju
- Proofreader: Rocheers Leviding
Sometimes, when faced with a data problem, we must first delve into the data set and understand it. Understanding its properties, its distribution and so on, that’s the area we need to focus on.
Today, we will use the Python Pandas framework for data analysis and Seaborn for data visualization.
As a geek programmer, I have a low aesthetic. Seaborn is a great visualization tool for me because all I need is coordinate points.
It uses Matplotlib as the drawing engine underneath and uses default styles to set up the graphics, which makes them look prettier than I could have done. Let’s take a look at the data set, and I’ll give you a sense of what it’s like to look at the different features. Maybe we can get some insight from it!
You can’t Make an omelet without eggs: a data set.
In the analysis below, I used the data set of the Olympic Games from 120 years ago, which you can download or read more about by clicking on this link.
I downloaded this data set for free from Kaggle. If you need to grab a data set to try out some new machine learning algorithms, brush up on some of the framework’s apis, or just want to play around, Kaggle is a great site.
I’ll just use ‘athlete_events’ in the CSV file, which records the athlete information for every Olympic event since 1900, the country in which each athlete was born, whether they won a prize, and so on.
What’s interesting is that 85 percent of the information in the diving column is empty, so on average only 15 percent of Olympic athletes win MEDALS. There were also some athletes who won more than one medal, suggesting that even fewer of the few Athletes at the Olympic level can win MEDALS. So their credit is greater!
Start analysis: What does the data set look like?
First, before diving into the data set, you can look at some intuitive data to understand the schema of the data set. Like how much data is missing from the data set? How many columns does the data have? I want to start with these questions.
I use Jupyter notes in the analysis process, and I will add comments for each piece of code I run so that you can continue to learn.
The Jupyter notes can be found in this warehouse, which you can open and have a look at, and you can start from anywhere.
The first thing I’ll do is use Pandas to load the data and check their size.
import pandas as pd
import seaborn as sns
df = pd.read_csv('athlete_events.csv')
df.shape
# (271116, 15)
Copy the code
In this example, there are 15 different columns in the dataset, and a full 271116 rows! That’s over 270,000 athletes! But then I wonder how many different athletes there are actually. Also, how many of them won MEDALS?
To see this data, first call the ‘list’ function on the word data set to list the row data. We can see many interesting features.
list(df)
#['ID','Name','Sex','Age','Height','Weight','Team','NOC','Games','Year','Season','City',
# 'Sport','Event','Medal']
Copy the code
Some of the things I can think of is that we can look at the average height and weight of Olympic athletes, or divide them by different sports. We can also look at the distribution of the two variables that depend on gender. We can even see how many MEDALS each country has, as a time series of the rise and fall of civilizations throughout the 20th century.
The possibilities are endless! But first let’s tackle this conundrum: How complete is our data set?
def NaN_percent(df, column_name):
row_count = df[column_name].shape[0]
empty_values = row_count - df[column_name].count()
return(100.0 * empty_values)/row_countfor i in list(df):
print(i +':' + str(NaN_percent(df,i))+The '%')
' ''0% incomplete columns omitted for brevity. Age: 3.49444518214% Height: 22.193821095% Weight: 23.191180159% Medal: When you think or speculate about something 85.3262072323% --Notice how 15% of athletes did not get any diving '' '
Copy the code
The number of non-empty lines can be obtained by using Pandas’ counting methods on sequence data. By looking at the Shape property, you can see the total number of rows, whether they are empty or not.
Then there’s the problem of subtraction and division. We can see that only four columns are incomplete: height, weight, age and medal.
The medal attribute is incomplete because an athlete may not actually win a medal, so it would be expected that this statistic is incomplete. However, when it comes to weight, height and age, incomplete data present considerable challenges.
I tried to filter the rows by different years, but the incompleteness seemed to be consistent over time, leading me to think that there might be some countries that don’t provide such relevant data for their athletes.
Here’s our real analysis: What’s the medal picture?
The first question we asked was, how many different people have won MEDALS since 1900? The following code snippet answers that question:
total_rows = df.shape[0]
unique_athletes = len(df.Name.unique())
medal_winners = len(df[df.Medal.fillna('None')! ='None'].Name.unique())
"{0} {1} {2}".format(total_rows, unique_athletes, medal_winners)
# ', 271116, 134732, 28202 '
Copy the code
As you can see, over the past 120 years, about 135,000 different people have participated in the Olympics, but only over 28,000 have won at least one medal.
The medal rate is about one in five, which is not bad. But if you consider that many people actually participate in multiple categories of sports, that’s not so optimistic.
Now that we’re done, how many MEDALS have athletes won in the past 120 years?
# See medal distribution
print(df[df.Medal.fillna('None')! ='None'].Medal.value_counts())
# Total MEDALS
df[df.Medal.fillna('None')! ='None'].shape[0]
' '' Gold 13372 Bronze 13295 Silver 13116 Total: 39783 '' '
Copy the code
Not surprisingly, the medal table was almost evenly distributed: almost the same number of gold, silver and bronze MEDALS were won.
However, a total of nearly 39,000 MEDALS were awarded, meaning that if you were in the top 20 per cent of athletes who won the most MEDALS, your average medal haul would be more than one.
What about distribution by country? To get this information, run the following code snippet:
team_medal_count = df.groupby(['Team'.'Medal']).Medal.agg('count')
# Rank by number
team_medal_count = team_medal_count.reset_index(name='count').sort_values(['count'], ascending=False)
#team_medal_count.head(40) is used to display the first line
def get_country_stats(country):
return team_medal_count[team_medal_count.Team==country]
# get_country_stats('some_country'
Copy the code
Using this function we can get the number of MEDALS of each type that a country has won, and we can see which country has won the most MEDALS by fetching the Pandas data frame headers.
Interestingly, the runner-up for the country with the most MEDALS is still the Soviet Union, even though it hasn’t been around for nearly 20 years.
In all categories, the United States came in first and Germany came in third. I also looked at my two countries, Argentina and Croatia, and was surprised to find that Croatia had won 58 gold MEDALS, although that was since 1991 (that was the 1992 Olympics).
Write a code as an exercise to get the data of different years when a particular country participated in the Olympics. I believe you can do it!
Female Participation
The other interesting thing that comes to my mind is, how have women performed in the Olympics over this whole century? This code answers the question:
unique_women = len(df[df.Sex=='F'].Name.unique())
unique_men = len(df[df.Sex=='M'].Name.unique())
women_medals = df[df.Sex=='F'].Medal.count()
men_medals = df[df.Sex=='M'].Medal.count()
print("{} {} {} {}".format(unique_women, unique_men, women_medals, men_medals ))
df[df.Sex=='F'].Year.min()
# 33808, 100979, 11253, 28530
# 1900
Copy the code
What surprised me was that there were women participating in the Olympic Games as early as 1900. Historically, however, the Ratio of men to women at the Olympics has been three to one. Surprised that women were participating in the Olympics as early as 1900, I decided to look at the number of them over the entire time period. I finally used Seaborn!
We can see that over the last few decades, the participation rate of women has been rising rapidly, from almost nothing to thousands. But are their participation rates really growing faster than those of men? Or is it just a matter of world population? To solve this problem, I made a second picture:
f_year_count = df[df.Sex=='F'].groupby('Year').agg('count').Name
m_year_count = df[df.Sex=='M'].groupby('Year').agg('count').Name
(sns.scatterplot(data= m_year_count),
sns.scatterplot(data =f_year_count))
Copy the code
Over time, women participated (orange) versus men participated (blue).
This time, we can clearly see a pattern emerging: the number of women participating is actually rapidly approaching that of men! Another interesting thing: See that little dot down there, on the right? I think that’s the Winter Olympics! In any case, the picture looks pretty optimistic for the female delegates, although there has never been a single year in which women had more participants than men.
Other analysis: height and weight
I spent a lot of time looking at height and weight graphs, but I didn’t come up with anything interesting.
- Both of these properties are normally distributed in most motion
- In all the sports I have observed, men are always heavier and taller than women
- The only interesting change seems to be the extent to which the differences between the two sexes can be analysed according to the exercise.
If you have any interesting ideas that you can use to analyze the weight and height data, please let me know! I didn’t go deep enough into the groupings for each sport, so there may be some misinterpretations. So that’s it for today. I hope you found this analysis interesting, or at least learned something about Pandas or data analysis.
I put the notes up on GitHub, so you can copy the project, do your own analysis, and make a pull request.
Of course you get all the credit! I hope you do a better job of graphical display and visual analysis than I did.
A deeper understanding of movement in Part 2 can be found here.
Follow me on Medium for more tutorials, tips, and tips on software development and data science. If you really enjoyed this post, share it with a friend!
If you find any errors in the translation or other areas that need improvement, you are welcome to revise and PR the translation in the Gold Translation program, and you can also get corresponding bonus points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.