“This is the 16th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

Earlier we talked about how to deal with duplicate values. Today we will talk about missing values. The missing values are mainly divided into mechanical reasons and human reasons. The mechanical reason is that the memory is broken, the machine is faulty and so on so that the data is not collected for a certain period of time. The kind of circumstance of artificial reason is more, if conceal deliberately and so on.

Build a DataFrame with missing values as follows:

import pandas as pd
import numpy as np
data = pd.DataFrame([[1,np.nan,3],[np.nan,5,np.nan]],columns = ['a','b','c'])
print(data)
Copy the code

Can you see that? Np. nan is nan value, null value meaning.

There is one function in Numpy that can be used to check for null values. No, there are two functions, isnull() and isna(). Let’s try them out separately:

import pandas as pd
import numpy as np
data = pd.DataFrame([[1,np.nan,3],[np.nan,5,np.nan]],columns = ['a','b','c'])
data.isnull()
data.isna()

Copy the code

As you can see, the purpose of these two functions is to determine whether the data is null, if so, return true, if not false.

In general, there are two ways to remove a null value, one is to delete it, and the other is to fill it in. Let’s talk about the first way to delete a null value, and we can use dropna() to remove a null value. Note that it will delete the entire line that contains a null value. Such as:

import pandas as pd
import numpy as np
data = pd.DataFrame([[1,np.nan,3],[np.nan,5,np.nan]],columns = ['a','b','c'])
data.dropna()
Copy the code

The above example uses the drop function, and nothing happens!

We can set the value to be deleted when there are more than 2 empty values per line (less than 2 empty values are retained), using the thresh parameter of dropna().

So there’s a whole bunch of ways to fill in the void, you can fill in the mean, you can fill in the median, and we’re going to use fillna(). For example, we populate the above data with the mean value as follows:

import pandas as pd
import numpy as np
data = pd.DataFrame([[1,np.nan,3],[np.nan,5,np.nan]],columns = ['a','b','c'])
data.fillna(data.mean())
Copy the code

The code runs as follows, and you can see that the null values are filled with the mean values of the corresponding columns.