Python missing value handling case study: Titanic data

The original link: www.cnblogs.com/tecdat/p/94…

Missing value handling

Real data often has missing values for some variables.

First, we use the info() statement to get an overview of the data:

titanic_df.info()

From this data, we can see that there are 891 rows of data, so instead of 891 rows of data in the middle column, there are missing values. For example, if there are 714 non-null values in the Age column, there are 891-714=177 missing values. In the case of cabin, there are more missing values. The missing value of the boarding dock is less, so it can not be processed later.

What happens to these missing values? There are generally three processing methods: do not process/discard/fill.

Here, Cabin has over 70% missing values, so we can consider dropping this variable directly. Delete a column of data

Important variables like Age have about 20% missing values that we can consider filling in with median values. — Fill in the missing values

We generally discourage removing lines with missing values because other non-missing variables may provide useful information. — Deletes rows with missing values

Drop lines with missing values (generally not recommended) : df.dropna()

Drop (‘column_name’, axis=1, inplace=True)

Fill missing values: df.column_name. Fillna ()

Axis =1, indicating that a column is deleted, namely column_name. Inplace =True, which indicates that the original data df is modified.

In fact, this lesson focuses on the last one: populating missing values. Fill means fill and NA is the name for missing value.

We can see quite a few missing values for Age in the info() run, and we’ll fill in the missing values with the median.

Fill in the missing values in the age data

Just fill it in with the median age of everyone

To facilitate later comparisons, we first use describe statistics.

View statistics for the Age column

As can be seen from this data, the non-missing value count is 714, the mean is 29.6 years old, and the standard deviation STD is 14.5. Then you can notice the 50% data: 28.

The median

To prevent changes to the data, we need to reload the data before we start.

The correct median can be obtained using the median method, which yields the same number as 50% above.

Fill in age missing values

Assigning the median to age_median1 is shown in the second line of code. Titanic_df.age.median () will need to be fully typed to replace the longer sentence with age_median1.

Titanic_df.age.fillna (age_median1,inplace=True); titanic_df.age.fillna (); The parameters that need to be filled in parentheses are the values that need to be filled in, i.e. the missing values are replaced by the newly assigned age_median1. The comma followed by inplace=True indicates that the df data has been modified. If this parameter is not added, the value will be reassigned to the Age column, so inplace is for simplicity.

At this time, the non-vacancy loss value has become 891, and the average value has also decreased from 29.7 to 29.4. Because the median we just filled is 28, which is smaller than the original average value, there will be a new average value will also decrease.

So that’s the median age of all people, but now let’s think a little bit more: how does gender affect the results?

Fill in the median age of male and female passengers, taking gender into account

Because the above operation has modified the original data, so if we need to reclassify, we need to reload the original data, otherwise the following operation will be based on the previous step to fill in all the missing age values. Well, I’ve had…

Median gender

We got a median of 27 for women and 29 for men, so there’s still a gap! What we’re going to need later is to fill in the missing values for men and women with the median.

The next step, as usual, is to use a Boolean index to get the missing value from the female, and then reassign it with 27; The same can be said for male operations.

But in this lesson we learned fillna’s new method!

But when we used fillna, we were just filling in one value, there was more than one value, so we had to fill it up depending on the situation. Pandas can be used to automatically match the value of the Pandas index. If the original data can also be indexed by gender, fillna can be used to automatically match the corresponding index form.

Fillna indexes gender classification

So what I’m going to do is I’m going to reindex the original values, so I started with 0, 1, 2, and now I’m going to change it to the gender column. Use set_index statement, index with Sex, and add inplace=True to indicate that the original data is modified.

Inplace =True =True =True =True =True =True =True

We can see that the result of this run is that the index of the column is Sex, and the first index of the column is male and female, and the first index of the row is no longer Sex.

Fill in missing values for gender categories

We assign the category median here to age_median2. Pandas will be able to fill in the text by matching different values based on their indexes. Since the Sex column will be used later on, you also need to reset the index to its column. Reset_index is used here.

A non-null value of 891 indicates that all missing values are filled, and the mean value is 29.4.

So at this point, we’ve also filled in the missing values for the gender categories with their respective median. The next step is to consider the influence of both factors:

Consider both gender and space

So let’s first look at what happens to the median by age and by class?

The objects of groupby are Pclass cabin and Sex Sex respectively. Since there are two factors to be considered here: gender and cabin, middle brackets should be used here, followed by age.median to obtain the median of groupby.

Median age of men and women by class

There are two indexes, cabin and sex, and we can see that as the cabin goes down, so does its age. In our terms, young people tend to be poorer than older people, and older people tend to accumulate more wealth.

Then we use the resulting median to reassign to class and sex respectively. You can still use fillna, but you need to set the double index.

Class and age classification

Again, we assign the median here to age_median3, and then we reassign the index. Again, there are two factors, again using brackets and set_index.

Then look at the data after resetting the index. If you look at the output, you can see the double index. The first column contains 3*2=6 combinations of contents. The Pclass and Sex columns are no longer in the index because they are already in the index at this point.

Now use the same method fillna to match different medians with index values.

Er… These two look the same as the hair above ah, is there something wrong with me… ?

To restore such indexes, reset is used here

Classification filling result

The output here shows that the non-empty value is already 891, indicating that the missing value has been filled. The average fell to 29.1 years, dragged down by higher numbers of third-class passengers and younger ages.

So just to summarize, the fillna method that we’re using here, we can operate on the median of the population, or we can operate on the median after classification. After grouping, because there is an index, the original data also needs to be indexed. For the same index value, it can be filled with a match.

Python missing value handling case study: Titanic data

Related Posts

Window Common shortcut keys

Git command quick lookup table

L System for Lazy Nezumi Pro