In the process of data cleaning, missing values, outliers and duplicate values are mainly dealt with. The so-called cleaning refers to the operation of discarding, filling, replacing and removing weight of the data set, so as to achieve the purpose of removing anomalies, correcting errors and making up for missing data.

One, data column missing processing methods

Data loss is divided into two types: one is the loss of row records, which is also known as data record loss; For example, Null in a database, None returned by Python, NaN in Pandas, or NaN in Numpy.

Missing data records usually cannot be retrieved. Here we focus on the handling of missing values of data column types. There are usually four ideas:

1.1 throw away

This method is straightforward. Delete row records with missing values (whole row deletion) or column fields (whole column deletion) directly to reduce the impact of missing data records on the overall data. However, discarding means that data characteristics will be reduced, and this method should not be used in any of the following scenarios:

  • There are a large number of incomplete data records in the total data set and the proportion is large, for example, more than 10%. Deleting these records with missing values means that too much useful information will be lost.
  • With missing values of the data records of large data distribution exist or features, such as the goal of data records with a missing value labels (that is, the classification of the Label variables) focused on one or a few classes, if you remove the data record will lead to loss of a large number of data samples corresponding classification characteristic information, lead to inaccurate model fitting or classification.

1.2 completion

Compared with discarding, completion is a more commonly used method to deal with missing values. It is of great importance for subsequent data processing, analysis and modeling to fill up missing data in certain ways so as to form a complete data record. Commonly used completion methods include:

  • Statistical method: for numerical data, use mean, weighted mean, median and other methods to supplement; For typed data, the value with the most category mode is used to complement it.
  • Model method: Most of the time, we will predict the missing field as the target variable based on other existing fields, so as to obtain the most possible completion value. If the columns with missing values are numerical variables, regression model is used to complete them. If it is a classification variable, the classification model is used for completion.
  • Expert completion: For a small amount of significant data records, expert completion is also an important approach.
  • Other methods: such as random method, special value method, multiple fill, etc.

1.3 Truth conversion method

In some cases, we may not be able to know the distribution rule of the missing values, and we cannot use any of the above methods to deal with the missing values; Or we believe that missing data is also a rule and should not be dealt with at will. Then there is another way to deal with missing value — truth value conversion.

The basic idea of this approach is that we admit the existence of missing values and regard missing data as a part of the law of data distribution, which takes both the actual value and missing value of variables as input dimensions to participate in subsequent data processing and model calculation. However, the actual value of a variable can be used as a variable value to participate in the model calculation, while the missing value usually cannot participate in the operation, so the missing value needs to be transformed into truth value.

Taking the user gender field as an example, many database sets cannot complement the gender of members, but they are reluctant to discard it, so we will choose to convert the values, including male, female and unknown, from multiple value distribution states of a variable to the truth distribution state of multiple variables.

  • Before conversion: gender (range: male, female, unknown).
  • After conversion: gender _ male (range 1 or 0), gender _ female (range 1 or 0), gender _ unknown (range 1 or 0). Then, these three columns of new fields are used as input dimensions to replace the original one field to participate in the subsequent model calculation.

1.4 does not handle

In the data preprocessing stage, it is also a way to do nothing to the data records with missing values. This idea mainly depends on the data analysis and modeling application in the later stage. Many models have tolerance or flexible processing methods for missing values, so they can not be processed in the pre-processing stage. Common models that can automatically process missing values include KNN, decision tree and random forest, neural network and naive Bayes, DBSCAN (density based spatial clustering with noise), etc. These models deal with missing values as follows:

  • Ignore, missing value does not participate in distance calculation, such as KNN.
  • Consider missing values as a state of distribution and participate in the modeling process, such as various decision trees and their variants.
  • The calculation is not based on distance, so the influence of calculation based on the distance of value is eliminated, such as DBSCAN.

Second, do not discard abnormal data easily

Abnormal data is the normal distribution of data. Data outside a specific distribution area or range is usually defined as abnormal or “noise”. There are many reasons for data noise, such as business operations, data collection problems, and data synchronization problems. Before processing abnormal data, you need to identify which data anomalies are true. There are two kinds of data abnormalities:

  • One type is “pseudo exception”. These exceptions are caused by specific operations of the service and reflect the normal business status rather than the abnormal rule of the data itself.
  • One is “true exceptions”, which are not caused by a specific business action, but objectively reflect the abnormal distribution of the data itself.

In most data mining or data work, outliers are considered as noise and removed during data preprocessing to avoid their impact on overall data evaluation and analysis mining. However, there are several cases in which throwing outliers is not necessary.

2.1 Outliers reflect business operation results

This scenario is a data distribution anomaly caused by a specific action of the business department, and if the outliers are discarded, the business results will not be properly reported.

For example, the daily sales volume of product A of the company is about 1000 units under normal circumstances. The total sales volume reached 10,000 due to the preferential promotion activity held yesterday, and the sales volume dropped to 100 due to the shortage of stock in the back-end. In this case, both 10,000 and 100 units correctly reflect the results of business operations, not data anomalies.

2.2 Anomaly detection model

The anomaly detection model is to analyze and mine the abnormal data in the whole sample in order to find the abnormal cases and rules. This data application is based on outliers, so the outliers cannot be discarded.

Anomaly detection model is often used in customer anomaly identification, credit card fraud, loan approval identification, drug variation identification, severe weather prediction, network intrusion detection, traffic cheating detection and so on. In this case, the exception data itself is the target data, and critical information will be lost if it is processed.

2.3 Data modeling that includes outliers

If the data algorithm and model are insensitive to outliers, the model itself will not be adversely affected even if outliers are not handled. In a decision tree, for example, outliers themselves can act as split nodes.

In addition to discarding and reserving, there is another way to deal with outliers, such as replacing them with other statistics and predictors, but this method is not recommended because it will remove the key distribution features and change the distribution law of the original data set.

3. Does data duplication need to be de-duplicated

Duplicate values in the dataset include the following two cases:

  • Multiple data records with identical data values. This is the most common case of data duplication.
  • The data body is the same but the matched unique attribute value is different. This is most often seen in the change dimension tables of data warehouses, where the body of the same fact table matches multiple values of the same attribute.

Deduplication is the main method of duplicate value processing, the main purpose is to preserve a unique data record that can show the characteristics. However, perform data deduplication with caution (not recommended) in the following situations.

3.1 Repeated records were used to analyze the law of evolution

Take the example of changing dimension tables. For example, in the dimension table of commodity category, the value of each commodity corresponding to the same category should be unique. For example, Apple iPhone7 belongs to personal electronic consumer goods, so that all commodities can be assigned to the attribute value of a unique category. But when the values of all the goods categories are refactored or upgraded, as most do as the company grows, the original goods may be assigned different values within the category.

At this point, when we use Full Join in the data to do category matching across reconstructed time points, we can find that Apple iPhone7 will match both personal electronic consumer goods and mobile digital records. In this case, you need to address specific business requirements:

  • If you communicate to the business that two pieces of data need to be consolidated, you need to determine an integration field to cover two records. You map two pieces of data into a category body again.
  • If two pieces of data need to be saved at the same time to communicate with the service, no processing can be performed. Subsequent processing depends on modeling requirements.

Changing dimension tables A changing dimension table is a concept in a data warehouse. A dimension table is similar to a match table and is used to store static dimension, attribute, and other data that generally does not change. However, change and immutability are relative concepts. With the continuous development of enterprises, the dimensions will change in many cases. So the dimensions are constant at some point in time, but they change as a whole. There are three ways to handle dimensional changes:

  • Overwrite the original value directly. In this case, each unique ID corresponds to only one attribute value, which is simple and easy to implement, but does not preserve historical information.
  • Add a new dimension row. In this case, the same ID will get two matching records.
  • Add a new attribute column. At this point, no new data row records are added, but a new column is added to the original record to mark the values of different periods. It is usually up to the database administrator to decide which method to use within the enterprise.

3.2 Duplicate records are used for sample imbalance treatment

In the process of classification data modeling, sample imbalance is one of the key factors affecting the effect of classification model. One way to solve the classification method is to simply oversampling a few sample categories, and to increase a few samples by simply copying samples through random oversampling. After this process, multiple data of the same record will also be generated in the data record. At this point, we cannot perform a deduplicate operation on the duplicate values.

3.3 Duplicate records are used to detect business rule problems

For the data set mainly used for analysis applications, the existence of duplicate records will not directly affect the actual operation, after all, the data set is mainly used for analysis; However, for transactional data, duplicate data can mean significant operational rule problems, especially when these duplicate values occur in business scenarios related to money in the operation of the enterprise, such as duplicate orders, duplicate recharge, duplicate appointments, duplicate outbound requests, etc.

These repeated data records are often the result of inadequate data acquisition, storage, validation, and audit mechanisms and are directly reflected in the front end production and operations systems. Repeat orders, for example, if the front desk to submit order functions don’t do uniqueness constraints, so in a repeat click submit order button in the order, will trigger a repeat order submission record, if the examination and approval after operation, will drive linkage operation the goods after sorting, outbound, delivery, if the user receives repeated commodity will lead to heavy losses; If the user returns the goods, it will increase the reverse order, and affect the various operation links related to logistics, distribution and warehousing, resulting in unreasonable consumption of operation resources, increased consumption of goods, increased cost of warehousing and logistics and other problems.

Therefore, these problems must be solved and avoided through a certain mechanism in the early data collection and storage. If such problems do occur, data workers or operations workers can use these duplicate values to identify rule vulnerabilities and work with the relevant authorities to minimize operational risk to the enterprise.

Fourth, Python missing value processing

The Imputer class (SKlearn. Preprocessing), Pandas and Numpy were used to deal with missing values. Pandas supports data exploration, analysis, and exploration, so missing value processing around Pandas is commonly used.

4.1 Importing library Pandas, Numpy, and Sklearn

import pandas as pd  # import the pandas library
import numpy as np  # import numpy library
from sklearn.preprocessing import Imputer  Import the Imputer library from sklearn. Preprocessing
Copy the code

4.2 Generating Missing Data

Create a data box with 6 rows and 4 columns named ‘COL1’, ‘COL2’, ‘COL3’, and ‘COL4’. At the same time, add two missing value data to the data box. In addition to creating a data box directly from pd.dataframe in the example, you can use the df.from_records, df.from_dict, df.from_items of a data box object to create a data box from a tuple record, dictionary, and key-value pair object. Read_csv, pandas. Read_table, and pandas. Read_clipboard are used to read files or create data boxes.

Generate missing data
df = pd.DataFrame(np.random.randn(6, 4),
                  columns=['col1'.'col2'.'col3'.'col4'])  Generate a data set
df.iloc[1:2, 1] = np.nan  # increase the missing value
df.iloc[4, 3] = np.nan  # increase the missing value
print(df)
Copy the code

The results of

Col1 col2 col3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414 NaN - 0.664368-0.740117 2-0.367365-0.173968 -0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 NaN 5 0.918172-0.146227-1.087238 0.376928Copy the code

4.3 Finding missing data

Use the df.null () method to find missing values in all data boxes (default missing values are NaN format), and then use any () or all () methods to find columns with at least one or all missing values, where any () is used to return True for any element in the specified axis, The all () method returns True for all elements of the specified axis.

(1) Determine whether the element is missing

See which values are missing
nan_all = df.isnull()  Get N values in all data boxes
print(nan_all)  # print output
Copy the code

The results of

    col1   col2   col3   col4
0  False  False  False  False
1  False   True  False  False
2  False  False  False  False
3  False  False  False  False
4  False  False  False   True
5  False  False  False  False
Copy the code

(2) List the columns with at least one element containing a missing value

See which columns are missing
nan_col1 = df.isnull().any()  Get the column containing NA
print(nan_col1)  # print output
Copy the code

The results of

col1    False
col2    True
col3    False
col4    True
dtype:  bool
Copy the code

(3) List all columns containing missing values

See which columns are missing
nan_col2 = df.isnull().all()  Get all NA columns
print(nan_col2)  # print output
Copy the code

The results of

col1    False
col2    False
col3    False
col4    False
dtype:  bool
Copy the code

4.4 Discarding the Missing Value

Missing values are discarded using the dropna () method used by Pandas to return a data record with no missing values. This code snippet returns the following result (lines 2 and 5 are deleted) :

# discard missing values
df2 = df.dropna()  Discard rows containing NA directly
print(df2)  # print output
Copy the code

The results of

Col1 col2 col3 COL4 0 1.411367-0.367903 2.610054-0.594829 2-0.367365-0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 5 0.918172-0.146227-1.087238 0.376928Copy the code

4.5 Replace the missing value with the mean value

The missing values are processed by Sklearn’s data preprocessing method. Firstly, a preprocessing object was created through the Imputer method, where strategy was the default missing value string (NaN by default).

In the example, the missing values are replaced by means (the default), or you can choose to replace with median and mode, that is, the strategy value is set to median or MOST_frequent; The axis argument is used to set the input axis, which defaults to 0, using columns for the calculation logic. Df (data box object) is then processed using the fit_transform method of the preprocessed object, which combines FIT and transform.

Use sklearn to replace missing values with specific values
nan_model = Imputer(missing_values='NaN', strategy='mean',
                    axis=0)  Set up a substitution rule: replace the missing value of NaN with the mean value
nan_result = nan_model.fit_transform(df)  # Apply model rules
print(nan_result)  # print output
Copy the code

The results of

[[1.41136738-0.36790328 2.61005414-0.59482946] [0.72341364 0.12179985-0.66436841-0.7401174] [-0.36736481 -0.17396777-0.73227893 0.10299754] [1.81116379 0.65385047 0.60561066 1.79355171] [0.0940324 0.64324712 0.25963105 0.18770606] [0.91817218-0.14622709-1.08723833 0.37692791]]Copy the code

Row 2, column 2, and row 5, column 4 in the code are replaced by the means of their columns, respectively. To verify that we manually calculated the mean of the respective columns, we obtained the mean of the two columns by using df[‘col2’].mean () and df[‘col4’].mean (), i.e. -0.4494679289032068 and -0.16611331259664791, respectively. Same result as Sklearn.

4.6 Replace the missing value with a specific value

Use Pandas for missing value handling. Pandas handles missing values in df.fillna (). The two main arguments in this method are value and method. The former replaces the missing value with a fixed (or manually specified) value, while the latter replaces the missing value with the default method provided by Pandas. The following methods are supported by method:

  • Pad and ffill: Replace the missing value with the previous value.
  • Backfill and bfill: Replace the missing value with the following value.

(1) Replace the missing value with the following value

# Replace the missing value with a specific value using pandas
nan_result_pd1 = df.fillna(method='backfill')  Replace the missing value with the following value
# print output
print(nan_result_pd1)  # print output
Copy the code

The results of

Col1 col2 col3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414-0.173968-0.664368-0.740117 2-0.367365 0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 0.376928 5 0.918172 - 0.146227-1.087238-0.376928Copy the code

(2) Replace the missing value with the following value, limiting each column to replace only one missing value

# Replace the missing value with a specific value using pandas
nan_result_pd2 = df.fillna(method='bfill'.limit= 1)# Replace the missing value with the following value, limiting each column to one missing value
# print output
print(nan_result_pd2)  # print output
Copy the code

The results of

Col1 col2 col3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414-0.173968-0.664368-0.740117 2-0.367365 0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 0.376928 5 0.918172 - 0.146227-1.087238-0.376928Copy the code

(3) Replace the missing value with the previous value

# Replace the missing value with a specific value using pandas
nan_result_pd3 = df.fillna(method='pad')  Replace the missing value with the previous value
# print output
print(nan_result_pd3)  # print output
Copy the code

The results of

Col1 col2 COL3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414-0.367903-0.6643668-0.740117 2-0.367365 -0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 1.793552 5 0.918172 - 0.146227-1.087238-0.376928Copy the code

(4) Replace the missing value with 0

# Replace the missing value with a specific value using pandas
nan_result_pd4 = df.fillna(0)  Replace the missing value with 0
# print output
print(nan_result_pd4)  # print output
Copy the code

The results of

Col1 col2 COL3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414 0.000000-0.664368-0.740117 2-0.367365 0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 0.000000 5 0.918172 - 0.146227-1.087238-0.376928Copy the code

(5) Replace the missing values of different columns with different values

# Replace the missing value with a specific value using pandas
nan_result_pd5 = df.fillna({'col2': 1.1.'col4': 1.2})  Replace missing values for different columns with different values
# print output
print(nan_result_pd5)  # print output
Copy the code

The results of

Col1 col2 COL3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414 1.100,000-0.664368-0.740117 2-0.367365 -0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 1.200000 5 0.918172 - 0.146227-1.087238-0.376928Copy the code

(6) Replace the missing value with the mean value of each column

# Replace the missing value with a specific value using pandas
nan_result_pd6 = df.fillna(df.mean()['col2':'col4'])  # Replace with the average, select the mean of each column to replace the missing value
# print output
print(nan_result_pd6)  # print output
Copy the code

The results of

Col1 col2 COL3 COL4 0 1.411367-0.367903 2.610054-0.594829 1 0.723414 0.121800-0.664368-0.740117 2-0.367365 0.173968-0.732279 0.102998 3 1.811164 0.653850 0.605611 1.793552 4 0.094032 0.643247 0.259631 0.187706 5 0.918172 - 0.146227-1.087238-0.376928Copy the code

5. Python exception handling

There are many rules and methods for determining outliers. Here, the threshold obtained by Z standardization is used as the judgment standard: when the standardized score exceeds the threshold, it is considered abnormal. The complete code is as follows:

5.1 Importing the Pandas Library

import pandas as pd  # import the pandas library
Copy the code

5.2 Generating Abnormal Data

Create a data box with 7 rows and 2 columns directly from the DataFrame

Generate exception data
df = pd.DataFrame({'col1': [1, 120, 3, 5, 2, 12, 13],
                   'col2': [12, 17, 31, 53, 22, 32, 43]})
print(df)  # print output
Copy the code

The results of

   col1  col2
0     1    12
1   120    17
2     3    31
3     5    53
4     2    22
5    12    32
6    13    43
Copy the code

5.3 Outlier determination Process

In this process, a copy of the original data box is first copied through df.copy () to store the standardized z-score Score, and then the column name of the original data box is obtained through df.columns, and then the outliers in each column are judged through a loop. In the judgment logic, a customized method is used to calculate the standardized Score of Z-score value for the data in each column, and then it is compared with the threshold 2.2. If the value is greater than the threshold, it is considered abnormal.

Why is the threshold 2.2? The topic of data standardization will be discussed later.

# Use z-score method to determine outliers
df_zscore = df.copy()  # copy a box to store the z-score
cols = df.columns  Get the column name of the data box
for col in cols:  # loop through each column
    df_col = df[col]  # get the value of each column
    z_score = (df_col - df_col.mean()) / df_col.std()  Calculate the z-score for each columnDf_zscore [col] = z_score.abs() > 2.2Check whether the z-score is greater than 2.2, True if so, False otherwise
print(df_zscore)  # print output
Copy the code

The results of

    col1   col2
0  False  False
1   True  False
2  False  False
3  False  False
4  False  False
5  False  False
6  False  False
Copy the code

5.4 Deleting an Outlier

# delete the line where the outlier resides
df_drop_outlier = df[df_zscore['col1'] == False]
print(df_drop_outlier)
Copy the code

The results of

   col1  col2
0     1    12
2     3    31
3     5    53
4     2    22
5    12    32
6    13    43
Copy the code

In the above process, the key point to consider is how to determine outliers. For those with fixed business rules, business rules can be applied directly, while for those without fixed business rules, common mathematical models can be used to judge. Based on the probability distribution model (such as normal distribution scope of the standard deviation), the method based on clustering (KMeans, for example), the method based on density (LOF), for example, the method based on classification (KNN), for example, based on statistical methods (cases) Such as quantile method), the definition of outlier with strong subjective judgment, Select this parameter based on the actual situation.

Python repeated value handling

6.1 Importing the Pandas Library

import pandas as pd  # import the pandas library
Copy the code

6.2 Generating Duplicate Data

The data is a 4 row 2 column data box

Generate duplicate data
data1 = ['a', 3]
data2 = ['b', 2]
data3 = ['a', 3]
data4 = ['c', 2]
df = pd.DataFrame([data1, data2, data3, data4], columns=['col1'.'col2'])
print(df)
Copy the code

The results of

   col1  col2
0    a     3
1    b     2
2    a     3
3    c     2
Copy the code

6.3 Checking whether data records are duplicate

Returns the result of whether each data record is duplicated. The value can be True or False. The judgment method is Df_.duplicated (), and the two main parameters in this method are SUBSET and KEEP:

  • Subset: To determine duplicate columns, you can specify a specific column or multiple columns. All columns are used by default.
  • Keep: A rule that is not marked True when repeated and can be set to True (first), last (last), and all. The default is first, that is, the first repeated value is not marked True.
# Determine duplicate data
isDuplicated = df.duplicated()  # Determine duplicate data records
print(isDuplicated)  # print output
Copy the code

The results of

0    False
1    False
2     True
3    False
dtype: bool
Copy the code

6.4 Deleting a Duplicate Value

The core operation method is df.drop_duplicates (). This method deletes duplicated values after determining them based on specified rules. Its parameters are the same as that of DF.duplicated (). In this partial method example, the default rule (data records with all the same columns), col1 column is identical, COL2 column is identical, and col1 and COL2 are specified to be exactly the same are used for deduplication.

(1) Delete records in which all columns have the same value

# delete duplicate values
new_df1 = df.drop_duplicates()  Delete data record where all columns have the same value
print(new_df1)  # print output
Copy the code

The results of

  col1  col2
0    a     3
1    b     2
3    c     2
Copy the code

(2) Delete the records with the same COL1 value in the data records

# delete duplicate values
new_df2 = df.drop_duplicates(['col1'])  Delete data records with the same col1 value
print(new_df2)  # print output
Copy the code

The results of

  col1  col2
0    a     3
1    b     2
3    c     2
Copy the code

(3) Delete the records with the same COL2 value in the data records

# delete duplicate values
new_df3 = df.drop_duplicates(['col2'])  Delete data records with the same COL2 value
print(new_df3)  # print output
Copy the code

The results of

  col1  col2
0    a     3
1    b     2
Copy the code

(4) Delete the records with the same value of the specified column (COL1 / COL2) in the data record

# delete duplicate values
new_df4 = df.drop_duplicates(['col1'.'col2'])  # delete data recorded in the specified column (col1 col2) value of the same record
print(new_df4)  # print output
Copy the code

The results of

  col1  col2
0    a     3
1    b     2
3    c     2
Copy the code