In the process of data cleaning, missing values, outliers and repeated values are mainly dealt with. The so-called cleaning is to discard, fill, replace and deduplicate data to achieve the purpose of removing anomalies, correcting errors and making up for the missing.

1. Four methods to deal with missing data columns

Data loss is divided into two types:

  • The loss of row records, which is also called data record loss;
  • Loss of data column values, that is, the loss of values in certain columns of data records due to various reasons.

The processing of missing value of data column type is mainly discussed here. There are usually four processing ideas:

  1. discarded
  2. Compared with discarding, completion is a more common way to deal with missing values. Common completion methods include:
    • Statistical method: For the data of numerical rows, methods such as mean, weighted mean and median were used to supplement; For partitioned data, the value with the most category mode is used to complement.
    • Model law
    • Expert completion method
    • Other methods
  3. Truth conversion method
  4. Don’t deal with

In the data preprocessing stage, it is also an idea to do no processing for data records with missing values. This idea mainly depends on the data analysis and modeling application in the later stage. Many models have tolerance or flexible processing methods for missing values, so no processing can be done in the pre-processing stage. Common models that can automatically process missing values are as follows: – Ignored, missing values do not participate in distance calculation, such as KNN; – Take the missing value as a state of the distribution and participate in the modeling process, such as various decision trees and their variants; – Do not calculate based on distance, so calculate based on the distance of the value, the impact of itself is eliminated, such as DBSCAN.

2. Do not clean and discard abnormal data

2.1 Outliers Normally reflect service operation results

2.2 Anomaly detection model

2.3 Data modeling containing outliers

3. Does duplication of data need to be removed

Deduplication is not recommended in the following situations:

  1. Repeated records are used to analyze evolution rules
  2. Duplicate records are used for sample imbalance processing
  3. Duplicate records are used to detect business rule problems

1. Missing value processing

import pandas as pd # Import library Pandas
import numpy as np # Import Numpy library
from sklearn.preprocessing import Imputer Import the Imputer repository in sklear. preprocessing

# Generate missing data
df = pd.DataFrame(np.random.randn(6.4), columns=['col1'.'col2'.'col3'.'col4']) # Generate a data

df.iloc[1.1] = np.nan # Increase missing values
df.iloc[4.3] = np.nan
print(df)

Copy the code
Col1 Col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689 NaN 1.399678 1.341892 2-0.258012-0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749 NaN 5 0.615499 0.138902 - 0.523084-0.608018Copy the code
# Check which values are missing
nan_all = df.isnull() # Get N values for all data boxes
print(nan_all) # Print out
Copy the code
    col1   col2   col3   col4
0  False  False  False  False
1  False   True  False  False
2  False  False  False  False
3  False  False  False  False
4  False  False  False   True
5  False  False  False  False
Copy the code
# Obtain the column containing NA
nan_col = df.isnull().any(a)# Find columns with at least one missing value. The any() method returns True for any element in the specified axis
print("Get the column that contains NA")
print(nan_col)


Copy the code
Get the column containing NA col1 False col2 True col3 False col4 True dtype: boolCopy the code
# select * from NA
nan_col2 = df.isnull().all(a)# Find columns with all missing values, where all() returns True for all elements of the specified axis
print("Get all columns that are NA")
print(nan_col2)


Copy the code
Obtain all NA columns col1 False col2 False col3 False col4 False dTYPE: boolCopy the code

1.1 Discarding missing values

df2 = df.dropna() # Discard rows containing NA directly
print(df2)
Copy the code
Col1 Col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 2-0.258012-0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 5 0.615499 0.138902-0.523084-0.608018Copy the code

1.2 Complete missing values

1.2.1 Replace missing values with specific values using sklearn

The missing values were processed by using the data preprocessing method of Sklearn. Firstly, a preprocessing object was created by creating the Imputer method, and the parameters were described as follows:

  • Strategy: string with default missing value. The default value is NaN. Strategy can be mean, median, most_frequent;
  • Axis: This is used to set the axis of the input. The default value is 0, which uses columns for the calculation logic.
nan_model = Imputer(missing_values='NaN', strategy='mean', axis=0) # Create replacement rule: Replace missing values with Nan values by their mean
nan_result = nan_model.fit_transform(df) # Apply model rules
print(nan_result)
Copy the code
[-1.00285617-0.07899506-0.60695655-1.62471244] [0.04468906-0.15452215 1.39967806 1.34189211] [-0.25801157 [-0.76607272 0.03276415 0.03368488-0.68469122] [-0.38592178-0.52391692 1.2277487.] [-0.76607272 0.03276415 0.03368498-0.68469122] [-0.38592178-0.52391692 1.2277487. [0.61549888 0.13890169-0.5230837-0.60801774]Copy the code

1.2.2 Use Pandas to replace missing values with specific values

Pandas handles missing values by using df.fillna(). The two main parameters are value and method. The former replaces missing values with fixed values (or manually specified values), and the latter replaces missing values by using the default methods provided by Pandas. Here are the methods supported by Methods:

  • Pad and ffill: Replace missing values with previous values
  • Backfill and bfill: Replace missing values with later values
# Replace the missing value with the following value
nan_result_pd1 = df.fillna(method='backfill') 
print(nan_result_pd1)
Copy the code
Col1 Col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689-0.341365 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749-0.608018 5 0.615499 0.138902-0.523084-0.608018Copy the code
# Replace the missing value with the following value, limiting the replacement of one missing value per column
nan_result_pd2 = df.fillna(method='bfill', limit=1) 
print(nan_result_pd2)
Copy the code
Col1 Col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689-0.341365 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749-0.608018 5 0.615499 0.138902-0.523084-0.608018Copy the code
# Replace missing values with previous values
nan_result_pd3 = df.fillna(method='pad') 
print(nan_result_pd3)
Copy the code
Col1 Col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689-0.078995 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749-0.684691 5 0.615499 0.138902-0.523084-0.608018Copy the code
# Replace missing values with 0
nan_result_pd4 = df.fillna(0) 
print(nan_result_pd4)
Copy the code
Col1 col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689 0.000000 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749 0.000000 5 0.615499 0.138902-0.523084-0.608018Copy the code
# Replace missing values for different columns with different values
nan_result_pd5 = df.fillna({'col2':1.1.'col4':1.4}) 
print(nan_result_pd5)
Copy the code
Col1 col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689 1.100000 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749 1.400000 5 0.615499 0.138902-0.523084-0.608018Copy the code
# Replace with the mean. Select the mean of each column to replace the missing value
nan_result_pd6 = df.fillna(df.mean()['col2':'col4']) 
print(nan_result_pd6)
Copy the code
Col1 Col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689-0.154522 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749-0.717174 5 0.615499 0.138902-0.523084-0.608018Copy the code
# Replace the missing value with the replace method
nan_result_pd7 = df.replace(np.nan, 0)
print(nan_result_pd7)
Copy the code
Col1 col2 col3 col4 0-1.002856-0.078995-0.606957-1.624712 1 0.044689 0.000000 1.399678 1.341892 2-0.258012 -0.341365 0.126365-2.010339 3-0.766073 0.032764 0.033685-0.684691 4 0.385922-0.523917 1.227749 0.000000 5 0.615499 0.138902-0.523084-0.608018Copy the code

2. Outlier handling

There are many rules and methods for the determination of outliers. Here, the z-standardized threshold is used as the judgment standard. When the standardized score exceeds the threshold, it is regarded as an outlier.

import pandas as pd # Import library Pandas

# Generate abnormal data
df = pd.DataFrame({'col1': [1.120.3.5.2.12.13].'col2': [12.17.31.53.22.32.43]})
print(df)
Copy the code
   col1  col2
0     1    12
1   120    17
2     3    31
3     5    53
4     2    22
5    12    32
6    13    43
Copy the code

2.1 Judging outliers by z-Score method

df_zscore = df.copy() # Copy a box to store the z-score
cols = df.columns # Get the column name of the data box
print(cols)
for col in cols: # Loop reads each column
    df_col = df[col] # gets the value for each column
    z_score = (df_col - df_col.mean()) / df_col.std() Calculate the z-score for each column
    df_zscore[col] = z_score.abs(a) >2.2 # Determine if z-score is greater than 2.2, True if yes, False otherwise
print(df_zscore)
Copy the code
Index(['col1', 'col2'], dtype='object')
    col1   col2
0  False  False
1   True  False
2  False  False
3  False  False
4  False  False
5  False  False
6  False  False
Copy the code

In the above example, the setting of the threshold is the key to determine whether the exception is true or not. Usually, when the threshold is greater than 2, it is already a relatively abnormal value. & emsp; The key points to consider in judging outliers are: how to judge outliers \color{red}{how to judge outliers} How to judge outliers. To have a fixed business rules can be directly to use business rules, for there is no fixed business rules, common data model may be used for judgment, is based on the probability distribution model (such as normal distribution scope of the standard deviation), the method based on clustering (KMeans, for example), the method based on density (LOF) for example, Classification based method (KNN), statistical based method (such as quantile method), and so on. In this case, the definition of outliers has a strong color of subjective judgment, which needs to be selected according to the actual situation.

3. Repeat value processing

import pandas as pd 

# Generate duplicate data
data1 = ['a'.3]
data2 = ['b'.2]
data3 = ['a'.3]
data4 = ['c'.2]
df = pd.DataFrame([data1, data2, data3, data4], columns=['col1'.'col2'])
print(df)
Copy the code
  col1  col2
0    a     3
1    b     2
2    a     3
3    c     2
Copy the code

3.1 Identifying duplicate Data

isDuplicated = df.duplicated() # Judge duplicate data records
print(isDuplicated)
Copy the code
0    False
1    False
2     True
3    False
dtype: bool
Copy the code

3.2 Deleting duplicate Values

3.2.1 Deleting records whose columns have the same value

new_df1 = df.drop_duplicates() Delete data records where all columns have the same value
print(new_df1)
Copy the code
  col1  col2
0    a     3
1    b     2
3    c     2
Copy the code

3.2.2 Deleting records with the same COL1 value

new_df2 = df.drop_duplicates(['col1'])
print(new_df2)
Copy the code
  col1  col2
0    a     3
1    b     2
3    c     2
Copy the code

3.2.3 Deleting records with the same COL2 value

new_df3 = df.drop_duplicates(['col2'])
print(new_df3)
Copy the code
  col1  col2
0    a     3
1    b     2
Copy the code

3.2.4 Delete records that have the same value in the specified column (COL1 / COL2)

new_df4 = df.drop_duplicates(['col1'.'col2'])
print(new_df4)
Copy the code
  col1  col2
0    a     3
1    b     2
3    c     2
Copy the code

4. Summary of technical points

  1. Create a data box using pd.DataFrame.
  2. Select a specific column or object with df.iloc[];
  3. To see if the value is empty, use isnull() for Pandas;
  4. Use all() and any() to determine whether each column contains at least one True or all True cases;
  5. Missing values are deleted directly by using Pandas’ dropna().
  6. The Imputer method in Sklearn. Preprocessing was used to fill and replace the missing values. Three filling methods were supported: mean, median, and MOST_frequent.
  7. Missing values are filled in by Pandas’ Fillna, allowing for more custom values and common predefined methods.
  8. Copy is used to obtain a copy of an object. It is usually used when the original object and the copied object are operated at the same time.
  9. Iterable list values are traversed through a for loop.
  10. Customize z-Score calculation formula;
  11. Repeated data records are judged by duplicated() in Pandas;
  12. Pandas uses drop_duplicates() to delete duplicate records, specifying specific columns or all of them.