This article is participating in Python Theme Month. See the link to the event for more details

A missing value is a part of the data that is empty for some reason. Missing values can be handled in one of two ways: either delete or fill (replacing the missing value with a value). There are generally two kinds of missing values. One is the data missing in a certain column. The other is that an entire line of data is missing, i.e. a blank line.

The Contents of the Excel table used in this article are as follows:

1. Check the missing value

1.1 First print the data structure to see the overall picture

import pandas as pd


df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df)
Copy the code

result:

0 northeast Liaoning Dalian 2019-09-06 12.01 Northwest NaN Xi 'an NaT 87.02 South China Guangdong Shenzhen 2019-09-08 NaN 3 North China Beijing Beijing 2021-05-13 45.04 4 Central China 2019-09-13 34.0 8 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-11 42.0 6 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-17 2019-09-15 2019-09-17 2019-09-17 2019-09-15 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-15 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17Copy the code

As a result, each column has a missing value. It is important to note here that missing data values of the time and date types are represented by NaT, and all other types are represented by NaN. Don’t assume that missing values are all NaN’s

1.2 info(): Displays the missing value

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.info())
Copy the code

result:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 5 columns): # Column non-null Count Dtype --------- -------------- ----- 0 Region 11 non-null object 1 province 8 Non-null Object 2 City 11 Non-null object 3 Time 7 Non-NULL datetime64[ns] 4 Float64 dtypes: Datetime64 [ns](1), Float64 (1), Object (3) Memory Usage: 608.0+ bytes NoneCopy the code

As a result, the province column is 9 non-null. Note The province column has four null values. Similarly, the time column has 4 missing values, the index column has 2 missing values, the city column has 1 missing value, and the region column has 1 missing value

2. Judgment of missing value

Isnull () : checks whether a specific value is missing, returning True if it is, and False otherwise

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.isnull())
Copy the code

result:

Region Province City Time Indicator 0 False False False False False False 1 False True False True False False 2 False False False False False True 3 False False False False False 4 False False False True False 5 False False False False False 6 True True True True True 7 False True False False False 8 False False False True False 9 False True False False True 10 False False False True False 11 False False False False FalseCopy the code

3. Delete the missing value

Dropna (): deletes missing values

3.1 Strictly delete all rows with all missing values

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.dropna())
Copy the code

result:

0 northeast Liaoning Dalian 2019-09-06 12.03 North China Beijing Beijing 2021-05-13 45.0 5 Northeast Heilongjiang Harbin 2019-09-11 42.0 11 Central China Hubei Jingzhou 2019-09-17 32.0Copy the code

As a result, the dropna() method will delete rows with missing values by default, and any row that has a field missing will be deleted. So only four complete data records were kept

3.2 Delete only blank lines

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.dropna(how='all'))
Copy the code

result:

0 northeast Liaoning Dalian 2019-09-06 12.01 Northwest NaN Xi 'an NaT 87.02 South China Guangdong Shenzhen 2019-09-08 NaN 3 North China Beijing Beijing 2021-05-13 45.04 4 Central China 2019-09-11 42.0 7 2019-09-13 34.0 8 2019-09-13 56.0 9 10 northeast Liaoning Anshan NaT 32.0 11 Central China Hubei Jingzhou 2019-09-17 32.0Copy the code

How =’all’ means that only rows with missing values in all fields are deleted, that is, only blank rows are deleted. So 11 records were kept. Only the blank line on line 6 was deleted.

4. Missing value supplement

Generally, fill with 0, fill with average, fill with mode (most of the time), fill forward (fill with the value of the field on the row above the missing value, for example, if D3 cells are missing, then fill with the value of D2 cells), fill backward (corresponding to fill forward), and so on.

4.1 Fill in all missing values

Fillna (): a missing value. The parameter represents the content to be filled.

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.fillna(0))
Copy the code

result:

Regional province city time index 0 northeast Liaoning Dalian 0 northwest Xi 'an 0 87.02 South China Guangdong Shenzhen 2019-09-08 00:00:00 0.03 North China Beijing Beijing 2021-05-13 00:00:00 45.0 4 Wuhan, Central China, Hubei province 0 21.0 5 Harbin, Northeast China, heilongjiang, China 0 09-11 00:00:00 42.0 600 000 0.07 2019-09-13 0 56.0 9 0 Yiyang 10 northeast Liaoning 0 32.0 11 Jingzhou 2019-09-17 00:00:00 8 North China 32.0Copy the code

As it turns out, this method fills in all missing values by default

4.2 Populating a single column

Df = pd. Read_excel (r'C:\Users\admin\Desktop\data_test.xlsx') print(df.Copy the code

result:

0 northeast Liaoning Dalian 2019-09-06 12.01 Northwest NaN Xi 'an NaT 87.02 South China Guangdong Shenzhen 2019-09-08 NaN 3 North China Beijing Beijing 2021-05-13 45.04 4 Central China 2019-09-13 34.0 8 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-13 42.0 6 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-13 42.0 6 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-17 2019-09-15 2019-09-17 2019-09-17 2019-09-15 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-15 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17Copy the code

4.3 Filling multiple columns

4.3.1 Specify that multiple columns are populated with the same content

Df = pd read_excel (r 'C: \ Users \ admin \ Desktop \ data_test XLSX') print (df) fillna ({' regional ': "Shanghai", "province", "Shanghai"}))Copy the code

result:

0 northeast Liaoning Dalian 2019-09-06 12.01 Northwest Shanghai Xi 'an NaT 87.02 South China Guangdong Shenzhen 2019-09-08 NaN 3 North China Beijing Beijing 2021-05-13 45.04 4 Central China 2019-09-13 34.0 8 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-13 42.0 6 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17Copy the code

4.3.2 Specify that multiple columns are populated differently

Df = pd read_excel (r 'C: \ Users \ admin \ Desktop \ data_test XLSX') print (df) fillna ({' regional ': "Chinese", "province", "Shanghai"}))Copy the code

result:

0 northeast Liaoning Dalian 2019-09-06 12.01 Northwest Shanghai Xi 'an NaT 87.02 South China Guangdong Shenzhen 2019-09-08 NaN 3 North China Beijing Beijing 2021-05-13 45.04 4 Central China 2019-09-13 34.0 8 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-13 42.0 6 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-13 42.0 6 2019-09-13 34.0 8 2019-09-13 36.0 9 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17 2019-09-17Copy the code