Public account: Youerhuts author: Peter editor: Peter
Data cleaning is really a complex and tedious job. Some people sneer at 😭 : data people spend 80% of their time on data cleaning. It sounds crazy, but that’s really the case.
But it is also the most important part of the whole data analysis process. This article describes how to use the Pandas library to handle missing values.
Pandas series
Pandas has published 12 articles in a series.
1- Illustrates the sorting mechanism for pandas
2- Illustrates the ranking mechanism for Pandas
3- Groupby mechanism for Pandas
Commonly used functions
When there is a null value or missing value in our data, we often deal with the following functions: determine whether the missing value is true, directly delete the missing value, and fill in the missing value
- Df.isnull (), df.notnull() : These two functions are the inverse of each other
- Df.isna () : equivalent to df.isnull()
- Df.dropna () : deletes the missing value
- Df.fillna () : fills the missing value
Relevant concepts
First, a few concepts that may be vague to Pandas or Python involve:
- The null value in PANDAS is “” (a pair of double quotes); An empty string: “” with an extra space in the middle
- Missing values are NaN or NaT in DataFrame and None or NaN in Series
- When we need to manually specify a missing value, we default to None and np.nan
Next, let’s look at the association between “None” and “NaN” in Pandas:
- By default, they are the same when we create them
- If we specify None, it will still be None in Series and will display the float64 data type
Simulated data
For interpretation, the author simulated a data with missing values and null values:
Missing value in birthday, address, English; There are also null values in the address: they can be generated by pressing the space bar directly
When we read the data into the Notebook, it looks like this: null values have nothing, and missing values are shown as NaN or NaT.
Viewing missing Values
Isnull () and isna() are two functions that check for missing values in the data:
Look at the missing values at each location
Places where missing values exist are represented by True
df.isnull() The position of the null value is marked True
Copy the code
Check for missing values for each column attribute
df.isnull().any(a)# as long as there is a missing value and True
Copy the code
See how many missing values exist for each column attribute
df.isnull().sum(a)# Total number of missing values for each column attribute
Copy the code
⚠️isna() and isnull() are used the same way:
View non-missing value case Notnull ()
The uses of notnull() and isnull() are completely opposite:
1. Check the case of each non-missing value, where missing values are represented by False
2. Check the number of non-missing values for each column attribute
Delete the missing value dropna
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Copy the code
Parameters that
The specific explanation of parameters is as follows:
- Axis: Delete rows or columns, axis=0 indicates index direction; Axis =1; The default is 0
- How: “all”, “any”; All: deletes all rows or columns when all rows or columns are missing. Any: deletes at least one row or column. The default value is
- Thresh: The minimum number of cases in a row or column that will not be deleted
- Subset: subset is deleted only for a subset of a specified column. No operations are performed on rows or columns that are not in the subset
- Inplace: indicates whether a new DataFrame is generated or deleted from the original data
For the convenience of explanation, we made a copy of dF1:
Parameters of the axis
Default is axis=0 delete:
Instead of using axis=0 or axis=1, you can use axis=”index” or axis=”columns” :
How parameters
There are two ways to delete:
- How =”any” : if at least one missing value is deleted, default
- How =”all” : All missing values are deleted
Parameter thresh
The thRESH parameter indicates thresh=N. Data is retained only when at least N non-missing values exist.
In the following example, N=6 means that at least 6 non-missing values will be retained; There are two missing values in the data with index 1 and the index is deleted
When we change it to N=7: any row with missing values will be deleted
Parameters subset
The use of subset specifies a subset element to be deleted. It can be one or more, and more are lists
Parameter inplace
Delete the missing data to determine whether the new data is saved as a copy or directly modified on the original data.
If inplace is not specified, the default is saved as a copy and new data frames are generated
If True is specified, the original data is changed directly:
The missing value is filled in the fillNA
Sometimes when processing data, we can simply delete missing values; But sometimes we need to fill in the missing values for data integrity:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
Copy the code
Parameter interpretation
- Value: the value to be filled in. It can be a specific value, a dictionary value, a function value, etc
- Axis: direction of filling, axis=0(row), default; The axis = 1 (column)
- Method: Fill method {‘ backfill ‘, ‘bfill’, ‘pad’, ‘ffill’, None}, default is None
- Inplace: generating a new copy or modifying the original data
- Limit: indicates the number of missing values
- Downcast: reduction of data types,
item-> dtype
The dictionary of, if possible, will be converted downward, or strings"Infer"
, will attempt to cast down to the appropriate equivalent type. Look at the official website explanation, with less
Downcast: A dict of item->dtype of what to downcast if possible, Or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Simulated data
Simulate another piece of data:
df2 = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3.4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
df2
Copy the code
Parameter value
The parameter value is the value we need to fill in. It takes several forms:
It is populated with a dictionary whose key is the name of our column property:
Fill with the values computed by the function:
Parameters of the axis
The axis argument is used to indicate the direction of filling. The default is Axis =0
Parameters of the limit
The limit argument represents the maximum number of missing values to be filled: in the following example, the maximum number of missing values is 2
Parameter method
The method parameter indicates the method of padding. This parameter cannot exist with the value
- Ffill, pad: Fill with the value before the missing value
- Bfill, backfill: fill with the value after the missing value
Bfill and backfill indicate that the current missing value is filled with the value following the missing value: for example, a missing value with index 0 followed by 3 is filled.
Parameter Inplace
Effect similar to Dropna; If inplace=True is added, the original data is changed directly. The default value is False. We don’t modify the original data directly
Parameter downcast
To explain the use of downcast, an example was found on Stack Overflow:
Stackoverflow.com/questions/2…
conclusion
Data can’t always be perfect, and missing values are quite common. This article describes how to view the missing value, delete the missing value, and fill the missing value in detail. It focuses on the use of Dropna and Fillna.