From today on, we will learn data analysis together and make progress together!
First of all, to carry out a data cleaning actual combat, using a more classic data set, Titanic survival prediction data.
Data set download address
Github.com/zhouwei713/…
Import data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('titanic_data.csv')
df
Copy the code
The data set information is as follows:
Meanings of Each field
Pclass: Ticket class
Sibsp: Number of siblings or spouses aboard a ship
Parch: The number of parents or children who board a ship together
Ticket: ticket number
Fare: The price of a ferry ticket
Cabin, the cabin
Embarked: Place of embarkation
Overall view of data
Once we get the data, let’s take a look at the data
df.describe()
Copy the code
There are also some columns that are non-numerical, so they are not shown.
All of the indices we get here, we’ll keep them for now, and we’ll use them later when we deal with missing values.
Handling missing values
First look at the missing values
df.isnull().sum()
>>>
pclass 1
survived 1
name 1
sex 1
age 264
sibsp 1
parch 1
ticket 1
fare 2
cabin 1015
embarked 3
dtype: int64
Copy the code
As you can see, cabin and age have more missing values
Age processing
View missing percentage
Print (' lack of %. 2 f % % '% ((df) [r].' age 'isnull (). The sum ()/df shape [0]) * 100)) > > > missing accounted for 20.15%Copy the code
Now let’s look at the age distribution
Age = df['age'].hist(bins=15, color='teal', alpha=0.6) age.set(xlabel='age') plt.xlim(-10,85) plt.show()Copy the code
As can be seen from the figure, the overall data is biased to the left, that is, most of the data is less than the average value, so we can use the median to fill the empty value, rather than the average value.
As can be seen from the output values of the describe function above, the average is 29.88 and the median is 28, which is obviously closer to the region where most of the data is located.
Fill in the age value of the vacancy with the median
data = df.copy()
data['age'].fillna(df['age'].median(skipna=True), inplace=True)
Copy the code
Positions to deal with
View missing percentage
Print (' loss percentage %. 2 f % % % ((df) [r]. 'cabin isnull (). The sum ()/df shape [0]) * 100)) > > > 77.48% loss percentageCopy the code
Since most of the warehouse information has been missing, we choose to delete it directly here
data.drop(columns=['cabin'], inplace=True)
Copy the code
Place of embarkation
Let’s start by looking at the form of the boarding location data column
print(df['embarked'].value_counts())
sns.countplot(x='embarked', data=df, palette='Set2')
plt.show()
>>>
S 914
C 270
Q 123
Name: embarked, dtype: int64
Copy the code
It can be seen that the boarding site contains three types of data, S, C and Q, and the frequency of their occurrence is 914, 270 and 123 respectively.
And since there are three missing data in this column, the missing rate is very low, so it should be no problem to use mode to fill these three missing values.
Fill with mode
data['embarked'].fillna(df['embarked'].value_counts().idxmax(), inplace=True)
Copy the code
Other missing value processing
For other columns where only one or two are missing, the mode can be adopted to fill in the missing values, or the missing part can be directly deleted without affecting the overall data distribution
data.dropna(axis=0, how='any', inplace=True)
Copy the code
Finally, check to make sure there are no missing values
data.isnull().sum()
>>>
pclass 0
survived 0
name 0
sex 0
age 0
sibsp 0
parch 0
ticket 0
fare 0
embarked 0
dtype: int64
Copy the code
Other feature column processing
For the sibsp and parch columns, we can abstract as whether or not we boarded alone, so that we can combine the two columns into one column and use 0,1 to indicate whether or not we boarded alone.
We add a new column alone, add the data with both columns 0 to the new column and set it to 0, and add the data with two columns that do not add up to 0 to the new column and set the value to 1. So the original two columns can be deleted.
data['alone']=np.where((data["sibsp"]+data["parch"])>0, 0, 1)
data.drop('sibsp', axis=1, inplace=True)
data.drop('parch', axis=1, inplace=True)
data.head()
Copy the code
Embarked and sex are strings of data that must be converted to numbers before being analyzed and processed by the algorithm model.
Here we can use the unique thermal coding to convert the data
data =pd.get_dummies(data, columns=["embarked","sex"])
data.head()
Copy the code
One-hot encoding is a common data conversion method. For each feature, if it has m possible values, then after single-hot encoding, it becomes M binary features. These features are mutually exclusive and only one is activated at a time.
For the name and ticket columns, their existence has no significance for our data analysis, and such meaningless data will often be deleted directly
data.drop('name', axis=1, inplace=True)
data.drop('ticket', axis=1, inplace=True)
Copy the code
At this point, we have processed a piece of raw data into a relatively standard data that is easy to analyze.
PivotTable analysis
After processing the data, we can also use pivot tables to analyze the data as a whole
Here we mainly examine the impact of various characteristics (ticket class, gender, berth, etc.) on survival rate
Note the difference between dataset DF and data
Gender PivotTable
So let’s first look at survival by sex
sex_sur_table = pd.pivot_table(df, index=['sex'], Values ='survived') print(sex_sur_table) >>> survived sex female 0.727468 male 0.190985Copy the code
The female survival rate is much higher than the male, Ladies First.
Ticket class and survival rate
pclass_sur_table = pd.pivot_table(df, index=['sex'], columns=['pclass'], Values ='survived') print(pclass_sur_table) >>> Pclass 1.0 2.0 3.0 sex female 0.965278 0.886792 0.490741 male 0.340782 0.146199 0.152130Copy the code
As you can see, the female survival rate for first-class tickets is very high, and the higher the class, the higher the survival rate for both men and women
Survival rate at different ages
Age is discretized
data['age_cut'] = pd.cut(data['age'], [0, 18, 80]) data['sex'] = df['sex'] print(data.head()) >>> pclass survived age fare alone embarked_C embarked_Q \ 0 1.0 1.0 29.0000 211.3375 1 00 1 1.0 1.0 0.9167 151.5500 000 2 1.0 0.0 2.0000 151.5500 000 3 1.0 0.0 30.0000 151.5500 0000 4 1.0 0.0 25.0000 151.5500 000 embarked_S sex_female sex_male age_cut sex 0 1 1 0 (18, 90) Female 1 1 0 1 (0, 18] male 2 1 1 0 (0, 18] female 3 1 0 1 (18, 90] male 4 1 1 0 (18, 90] femaleCopy the code
Age and survival rate
age_cut_sur_table = pd.pivot_table(data, index=['sex'], columns=['pclass', 'age_cut'], Values ='survived') print(age_cut_SUR_table) >>> pclass 1.0 1.0 1.0 1.0 age_cut (0, 18) (0, 20) (0, 20) (0, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (0, 20) 18] (18, 90] Male 0.750000 0.321637 0.523810 0.093333 0.208333 male 0.750000 0.321637 0.523810 0.093333 0.208333 0.142857Copy the code
Of course, pivottables have many more powerful features that you can explore.
The importance of data cleansing
After all, a good data analyst is a master data cleaner. In the process of data analysis, data cleaning is the most time-consuming and energetic step. The quality of data directly affects the results of our final analysis, and must not be careless.
Guidelines for data quality
So since data cleaning is so important, how far do I need to process the raw data to qualify as data to be analyzed? Here I summarize some industry standards for your reference.
- Integrity: whether there are null values in the data set and whether the statistical fields are complete.
- Comprehensiveness: Whether a column of data is a complete representation of the real situation, or whether it contains only part of the situation.
- Validity: Whether the type, content, size, etc. of the data is reasonable. For example: whether there are more than 150 age, whether there are more than 10,000 scores, whether the data units are unified and so on.
- Uniqueness: Whether the data has duplicate records.
When cleaning data, we must first observe data patiently, fully understand the meaning of each column of data, analyze whether the data has real meaning from the real situation, and then process data one by one according to the experience in life and work.
Let’s do one more little example to understand this
The name | height | weight | age | age |
---|---|---|---|---|
Zhang fei | 180 | 500500 | ||
Guan yu | 181 | 100 | 2828 | |
Liu bei | 1.78 | 160 | 30k30k | |
zhaoyun | 175 | 140 | 2323 | |
Cao cao | 180 | 150 | 3737 | |
zhaoyun | ||||
1751402323 cao cao |
Convert the data to the Pandas data structure
mydata = pd.read_csv('mydata.csv', Index_col ='name') print(mydata) print(mydata) >>> height weight age 160.0 30K 30K Zhao Yun 175.00 140.0 23 23 Cao Cao 180.00 150.0 37 37 Zhao Yun 175.00 140.0 23 23 NaN NaNCopy the code
integrity
Viewing missing Values
Mydata1 = mydata.copy() # copy mydata1.isnull().sum() # copy mydata1.isnull().sum() # copy mydata1.isnull().sum() #Copy the code
Usually it deals with null values, empty lines, etc
mydata1['weight'].fillna(mydata1['weight'].mean(), #mydata1['height'].fillna(mydata['height'].value_counts().index[0], Inplace =True) # Fill mydata1. Dropna (how='any', inplace=TrueCopy the code
Null-value padding must be performed before code that removes empty lines, otherwise all lines containing nulls will be deleted.
A comprehensive
Liu Bei’s height is in “meters”, we need to convert it to “centimeters”.
mydata1.loc[mydata1['height']<100, 'height'] = mydata1[mydata1['height']<100]['height']*100
Copy the code
rationality
Zhang Fei’s age is 500, which is obviously unreasonable and needs to be dealt with. Since Zhang Fei is the third brother, he needs to be younger than Liu Bei and Guan Yu, so let’s set it to 27
Mydata1. Loc [' age', 'age'] = 27Copy the code
At the same time, there is a K character in Liu Bei’s age, which needs to be removed
mydata1['age'].replace({r'[K]': ''}, regex=True, inplace=True)
Copy the code
uniqueness
Duplicate rows and columns in the data must be deleted to ensure data uniqueness
Mydata1. drop_duplicates(inplace=True) # Delete duplicate rows mydata1.drop('age.1', axis=1, inplace=True) # Delete unwanted columnsCopy the code
So we end up with zero
Print (mydata1) >>> Height weight age name print(mydata1) >> 180.0 150.0 37Copy the code
conclusion
This section we jointly completed a data cleaning actual combat and a small exercise example. For missing values, it is necessary to determine how to fill them according to the percentage of missing values and data distribution. For some non-digital data, you can choose to convert data in the way of independent thermal coding. It also summarizes guidelines for data cleansing, and as long as you follow these guidelines for processing data, the data you get is basically “good” data.
Welcome to comment!