Python style data cleansing using NumPy and Pandas

Introduction to data processing

Data processing generally has the following stages

Graph TD Collection --> Cleaning/Transformation --> Integration --> Modeling --> Visualization

The first three are the preparation stage, and the last two are the analysis stage.

Data analysis task

Time spent by pie Title data engineer "Bilding training sets ": 3 "Cleaning/organizing" : 60 "Collecting data ": 19 "Mining data for pattern" : 9 "others":9

Percentage of time spent Bilding training sets training data 3%

Cleaning/organizing 60%

Collecting 收集数据 19%

Mining data for pattern 9%

others 9%

Data preparation

Data exploration

Data exploration is the process of establishing a preliminary understanding of attributes (such as distributions or characteristics). Such as data size, type, format, integrity, relationship, etc.

There are a number of tools to help build your personal approach to data :MIT DIVE/Python: Matplotlib/Pandas/Numpy

Data Cleaning Data Cleaning

difficult

Handle missing values
Remove values with large deviations (outliers)
Resolve inconsistencies
Processing noise data, noise reduction

Data is noisy, which can affect our understanding and analysis of data. So we need to remove it before we analyze it. Data noise may be caused by:

The sensor location is incorrect. 2. Different data sources. 3. Human error, programming error

Data loss index points are not in the raw data. General treatment method

Data contains outliers, which are generally found

Visual exploration.
Statistical tests.
Modeling (linear models, covariance, a class of SVM, local outliers)

Data inconsistency is when data values do not match data attributes or change halfway.

Data Transformation Data Transformation

Methods:

Normalization of Normalization
Aggregation Aggregation
Discretization Discretization

Normalization changes values to a general range, but does not change the difference between values. Normalization reduces the knock-on effect (algorithmically dependent) on algorithm learning ability. Ensure standardized features and implicitly average the weight of all feature representations.

Aggregation Aggregation is the process of aggregating at least two attributes into one (for example, two data columns into one). It can be done either automatically (for example, correlation detection) or manually. Data aggregation reduces the variability of data. It operates on attributes rather than values.

Use Python for data preparation

Data link: pan.baidu.com/s/1fas7BQXT… Extraction code: 4ZH5

The data view

import matplotlib.pyplot as plt # readinf the data df = pd.read_csv("CombinedCyclePowerPlantDirty.csv") # two samples print("data samples") print(df.head(3).T) # columns and types print("columns types") print(df.dtypes) print("Full Info")  print(df.info())Copy the code

Using df.info() to look at the composition of the data, you can see that the data types and quantities vary

Data cleaning

df.V.hist()
plt.show()
print("min ", df.V.min())
print("max ", df.V.max())
print("mean ", df.V.mean())

Copy the code

Output the maximum and minimum values of a column, observe the distribution of data, convenient later cleaning

df = df[df.AP > 0] df = df[df.AP < 100000] df = df[df.RH > 0] df = df[df.RH <= 100] plt.hist(df.RH) plt.title("RH") plt.show() print(df.info()) def test(value): try: float(value) return True except: Return False df = df[df.pe. Apply (test)] df.pe = df.pe. Apply (float) print(df.info())# return False df = df[df.pe.Copy the code

For data cleaning, remove the large and small values to get the following figure

Data distribution is more uniform and intuitive

You can see that for the first time info() has been aligned except for the PE column, 10,432 sets of data. The test function is then used to remove the data that cannot be floated from the PE and turn the data into 9502 groups.

df.AT = df.AT.apply(lambda x: Df.at = df.at.apply (float) print(df.info()) print(df.info())Copy the code

Then, the error data in the AT is cleared. AT this time, the data is cleared completely, including 9,497 groups

View relationships between data

PE AT plt.scatter(df.at, df.pe) plt.title("PE, vs AT") plt.show()Copy the code

datasets = {}
for name in df.TCN.unique():
    datasets[name] = df[df.TCN == name]
    plt.scatter(datasets[name].AT, datasets[name].PE, label=name)
plt.legend()
plt.show()
Copy the code

TCN contains the centralized name. Each dataset is established with each name of TCN, and then the plot shows the distribution of different names.

def corrector(row):
    if row["TCN"] == "Daniel Smithson":
        return (row["AT"]-32)*5/9
    else:
        return row["AT"]


df.AT = df.apply(corrector, axis=1)

datasets = {}
for name in df.TCN.unique():
    datasets[name] = df[df.TCN == name]
    plt.scatter(datasets[name].AT, datasets[name].PE, label=name)
plt.legend()
plt.show()

Copy the code