From today on, we will learn data analysis together and make progress together!

First of all, to carry out a data cleaning actual combat, using a more classic data set, Titanic survival prediction data.

Data set download address

Github.com/zhouwei713/…

Import data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('titanic_data.csv')
df
Copy the code

The data set information is as follows:

Meanings of Each field

Pclass: Ticket class

Sibsp: Number of siblings or spouses aboard a ship

Parch: The number of parents or children who board a ship together

Ticket: ticket number

Fare: The price of a ferry ticket

Cabin, the cabin

Embarked: Place of embarkation

Overall view of data

Once we get the data, let’s take a look at the data

df.describe()
Copy the code

There are also some columns that are non-numerical, so they are not shown.

All of the indices we get here, we’ll keep them for now, and we’ll use them later when we deal with missing values.

Handling missing values

First look at the missing values

df.isnull().sum()
>>>
pclass         1
survived       1
name           1
sex            1
age          264
sibsp          1
parch          1
ticket         1
fare           2
cabin       1015
embarked       3
dtype: int64
Copy the code

As you can see, cabin and age have more missing values

Age processing

View missing percentage

Print (' lack of %. 2 f % % '% ((df) [r].' age 'isnull (). The sum ()/df shape [0]) * 100)) > > > missing accounted for 20.15%Copy the code

Now let’s look at the age distribution

Age = df['age'].hist(bins=15, color='teal', alpha=0.6) age.set(xlabel='age') plt.xlim(-10,85) plt.show()Copy the code

As can be seen from the figure, the overall data is biased to the left, that is, most of the data is less than the average value, so we can use the median to fill the empty value, rather than the average value.

As can be seen from the output values of the describe function above, the average is 29.88 and the median is 28, which is obviously closer to the region where most of the data is located.

Fill in the age value of the vacancy with the median

data = df.copy()
data['age'].fillna(df['age'].median(skipna=True), inplace=True)
Copy the code

Positions to deal with

View missing percentage

Print (' loss percentage %. 2 f % % % ((df) [r]. 'cabin isnull (). The sum ()/df shape [0]) * 100)) > > > 77.48% loss percentageCopy the code

Since most of the warehouse information has been missing, we choose to delete it directly here

data.drop(columns=['cabin'], inplace=True)
Copy the code

Place of embarkation

Let’s start by looking at the form of the boarding location data column

print(df['embarked'].value_counts())
sns.countplot(x='embarked', data=df, palette='Set2')
plt.show()
>>>
S    914
C    270
Q    123
Name: embarked, dtype: int64
Copy the code

It can be seen that the boarding site contains three types of data, S, C and Q, and the frequency of their occurrence is 914, 270 and 123 respectively.

And since there are three missing data in this column, the missing rate is very low, so it should be no problem to use mode to fill these three missing values.

Fill with mode

data['embarked'].fillna(df['embarked'].value_counts().idxmax(), inplace=True)
Copy the code

Other missing value processing

For other columns where only one or two are missing, the mode can be adopted to fill in the missing values, or the missing part can be directly deleted without affecting the overall data distribution

data.dropna(axis=0, how='any', inplace=True)
Copy the code

Finally, check to make sure there are no missing values

data.isnull().sum()
>>>
pclass      0
survived    0
name        0
sex         0
age         0
sibsp       0
parch       0
ticket      0
fare        0
embarked    0
dtype: int64
Copy the code

Other feature column processing

For the sibsp and parch columns, we can abstract as whether or not we boarded alone, so that we can combine the two columns into one column and use 0,1 to indicate whether or not we boarded alone.

We add a new column alone, add the data with both columns 0 to the new column and set it to 0, and add the data with two columns that do not add up to 0 to the new column and set the value to 1. So the original two columns can be deleted.

data['alone']=np.where((data["sibsp"]+data["parch"])>0, 0, 1)
data.drop('sibsp', axis=1, inplace=True)
data.drop('parch', axis=1, inplace=True)
data.head()
Copy the code

Embarked and sex are strings of data that must be converted to numbers before being analyzed and processed by the algorithm model.

Here we can use the unique thermal coding to convert the data

data =pd.get_dummies(data, columns=["embarked","sex"])
data.head()
Copy the code

One-hot encoding is a common data conversion method. For each feature, if it has m possible values, then after single-hot encoding, it becomes M binary features. These features are mutually exclusive and only one is activated at a time.

For the name and ticket columns, their existence has no significance for our data analysis, and such meaningless data will often be deleted directly

data.drop('name', axis=1, inplace=True)
data.drop('ticket', axis=1, inplace=True)
Copy the code

At this point, we have processed a piece of raw data into a relatively standard data that is easy to analyze.

PivotTable analysis

After processing the data, we can also use pivot tables to analyze the data as a whole

Here we mainly examine the impact of various characteristics (ticket class, gender, berth, etc.) on survival rate

Note the difference between dataset DF and data

Gender PivotTable

So let’s first look at survival by sex

sex_sur_table = pd.pivot_table(df, index=['sex'], Values ='survived') print(sex_sur_table) >>> survived sex female 0.727468 male 0.190985Copy the code

The female survival rate is much higher than the male, Ladies First.

Ticket class and survival rate

pclass_sur_table = pd.pivot_table(df, index=['sex'], columns=['pclass'], Values ='survived') print(pclass_sur_table) >>> Pclass 1.0 2.0 3.0 sex female 0.965278 0.886792 0.490741 male 0.340782 0.146199 0.152130Copy the code

As you can see, the female survival rate for first-class tickets is very high, and the higher the class, the higher the survival rate for both men and women

Survival rate at different ages

Age is discretized

data['age_cut'] = pd.cut(data['age'], [0, 18, 80]) data['sex'] = df['sex'] print(data.head()) >>> pclass survived age fare alone embarked_C embarked_Q \ 0 1.0 1.0 29.0000 211.3375 1 00 1 1.0 1.0 0.9167 151.5500 000 2 1.0 0.0 2.0000 151.5500 000 3 1.0 0.0 30.0000 151.5500 0000 4 1.0 0.0 25.0000 151.5500 000 embarked_S sex_female sex_male age_cut sex 0 1 1 0 (18, 90) Female 1 1 0 1 (0, 18] male 2 1 1 0 (0, 18] female 3 1 0 1 (18, 90] male 4 1 1 0 (18, 90] femaleCopy the code

Age and survival rate

age_cut_sur_table = pd.pivot_table(data, index=['sex'], columns=['pclass', 'age_cut'], Values ='survived') print(age_cut_SUR_table) >>> pclass 1.0 1.0 1.0 1.0 age_cut (0, 18) (0, 20) (0, 20) (0, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (20, 20) (0, 20) 18] (18, 90] Male 0.750000 0.321637 0.523810 0.093333 0.208333 male 0.750000 0.321637 0.523810 0.093333 0.208333 0.142857Copy the code

Of course, pivottables have many more powerful features that you can explore.

The importance of data cleansing

After all, a good data analyst is a master data cleaner. In the process of data analysis, data cleaning is the most time-consuming and energetic step. The quality of data directly affects the results of our final analysis, and must not be careless.

Guidelines for data quality

So since data cleaning is so important, how far do I need to process the raw data to qualify as data to be analyzed? Here I summarize some industry standards for your reference.

  • Integrity: whether there are null values in the data set and whether the statistical fields are complete.
  • Comprehensiveness: Whether a column of data is a complete representation of the real situation, or whether it contains only part of the situation.
  • Validity: Whether the type, content, size, etc. of the data is reasonable. For example: whether there are more than 150 age, whether there are more than 10,000 scores, whether the data units are unified and so on.
  • Uniqueness: Whether the data has duplicate records.

When cleaning data, we must first observe data patiently, fully understand the meaning of each column of data, analyze whether the data has real meaning from the real situation, and then process data one by one according to the experience in life and work.

Let’s do one more little example to understand this

The name height weight age age
Zhang fei 180 500500
Guan yu 181 100 2828
Liu bei 1.78 160 30k30k
zhaoyun 175 140 2323
Cao cao 180 150 3737
zhaoyun
1751402323 cao cao

Convert the data to the Pandas data structure

mydata = pd.read_csv('mydata.csv', Index_col ='name') print(mydata) print(mydata) >>> height weight age 160.0 30K 30K Zhao Yun 175.00 140.0 23 23 Cao Cao 180.00 150.0 37 37 Zhao Yun 175.00 140.0 23 23 NaN NaNCopy the code

integrity

Viewing missing Values

Mydata1 = mydata.copy() # copy mydata1.isnull().sum() # copy mydata1.isnull().sum() # copy mydata1.isnull().sum() #Copy the code

Usually it deals with null values, empty lines, etc

mydata1['weight'].fillna(mydata1['weight'].mean(), #mydata1['height'].fillna(mydata['height'].value_counts().index[0], Inplace =True) # Fill mydata1. Dropna (how='any', inplace=TrueCopy the code

Null-value padding must be performed before code that removes empty lines, otherwise all lines containing nulls will be deleted.

A comprehensive

Liu Bei’s height is in “meters”, we need to convert it to “centimeters”.

mydata1.loc[mydata1['height']<100, 'height'] = mydata1[mydata1['height']<100]['height']*100
Copy the code

rationality

Zhang Fei’s age is 500, which is obviously unreasonable and needs to be dealt with. Since Zhang Fei is the third brother, he needs to be younger than Liu Bei and Guan Yu, so let’s set it to 27

Mydata1. Loc [' age', 'age'] = 27Copy the code

At the same time, there is a K character in Liu Bei’s age, which needs to be removed

mydata1['age'].replace({r'[K]': ''}, regex=True, inplace=True)
Copy the code

uniqueness

Duplicate rows and columns in the data must be deleted to ensure data uniqueness

Mydata1. drop_duplicates(inplace=True) # Delete duplicate rows mydata1.drop('age.1', axis=1, inplace=True) # Delete unwanted columnsCopy the code

So we end up with zero

Print (mydata1) >>> Height weight age name print(mydata1) >> 180.0 150.0 37Copy the code

conclusion

This section we jointly completed a data cleaning actual combat and a small exercise example. For missing values, it is necessary to determine how to fill them according to the percentage of missing values and data distribution. For some non-digital data, you can choose to convert data in the way of independent thermal coding. It also summarizes guidelines for data cleansing, and as long as you follow these guidelines for processing data, the data you get is basically “good” data.

Welcome to comment!