Introduction to data processing
Data processing generally has the following stages
Graph TD Collection --> Cleaning/Transformation --> Integration --> Modeling --> Visualization
The first three are the preparation stage, and the last two are the analysis stage.
Data analysis task
Time spent by pie Title data engineer "Bilding training sets ": 3 "Cleaning/organizing" : 60 "Collecting data ": 19 "Mining data for pattern" : 9 "others":9
Percentage of time spent Bilding training sets training data 3%
Cleaning/organizing 60%
Collecting 收集数据 19%
Mining data for pattern 9%
others 9%
Data preparation
Data exploration
Data exploration is the process of establishing a preliminary understanding of attributes (such as distributions or characteristics). Such as data size, type, format, integrity, relationship, etc.
There are a number of tools to help build your personal approach to data :MIT DIVE/Python: Matplotlib/Pandas/Numpy
Data Cleaning Data Cleaning
difficult
- Handle missing values
- Remove values with large deviations (outliers)
- Resolve inconsistencies
- Processing noise data, noise reduction
Data is noisy, which can affect our understanding and analysis of data. So we need to remove it before we analyze it. Data noise may be caused by:
The sensor location is incorrect. 2. Different data sources. 3. Human error, programming error
Data loss index points are not in the raw data. General treatment method
Data contains outliers, which are generally found
- Visual exploration.
- Statistical tests.
- Modeling (linear models, covariance, a class of SVM, local outliers)
Data inconsistency is when data values do not match data attributes or change halfway.
Data Transformation Data Transformation
Methods:
- Normalization of Normalization
- Aggregation Aggregation
- Discretization Discretization
Normalization changes values to a general range, but does not change the difference between values. Normalization reduces the knock-on effect (algorithmically dependent) on algorithm learning ability. Ensure standardized features and implicitly average the weight of all feature representations.
Aggregation Aggregation is the process of aggregating at least two attributes into one (for example, two data columns into one). It can be done either automatically (for example, correlation detection) or manually. Data aggregation reduces the variability of data. It operates on attributes rather than values.
Use Python for data preparation
Data link: pan.baidu.com/s/1fas7BQXT… Extraction code: 4ZH5
The data view
import matplotlib.pyplot as plt # readinf the data df = pd.read_csv("CombinedCyclePowerPlantDirty.csv") # two samples print("data samples") print(df.head(3).T) # columns and types print("columns types") print(df.dtypes) print("Full Info") print(df.info())Copy the code
Using df.info() to look at the composition of the data, you can see that the data types and quantities vary
Data cleaning
df.V.hist()
plt.show()
print("min ", df.V.min())
print("max ", df.V.max())
print("mean ", df.V.mean())
Copy the code
Output the maximum and minimum values of a column, observe the distribution of data, convenient later cleaning
df = df[df.AP > 0] df = df[df.AP < 100000] df = df[df.RH > 0] df = df[df.RH <= 100] plt.hist(df.RH) plt.title("RH") plt.show() print(df.info()) def test(value): try: float(value) return True except: Return False df = df[df.pe. Apply (test)] df.pe = df.pe. Apply (float) print(df.info())# return False df = df[df.pe.Copy the code
For data cleaning, remove the large and small values to get the following figure
Data distribution is more uniform and intuitive
You can see that for the first time info() has been aligned except for the PE column, 10,432 sets of data. The test function is then used to remove the data that cannot be floated from the PE and turn the data into 9502 groups.
df.AT = df.AT.apply(lambda x: Df.at = df.at.apply (float) print(df.info()) print(df.info())Copy the code
Then, the error data in the AT is cleared. AT this time, the data is cleared completely, including 9,497 groups
View relationships between data
PE AT plt.scatter(df.at, df.pe) plt.title("PE, vs AT") plt.show()Copy the code
datasets = {}
for name in df.TCN.unique():
datasets[name] = df[df.TCN == name]
plt.scatter(datasets[name].AT, datasets[name].PE, label=name)
plt.legend()
plt.show()
Copy the code
TCN contains the centralized name. Each dataset is established with each name of TCN, and then the plot shows the distribution of different names.
def corrector(row):
if row["TCN"] == "Daniel Smithson":
return (row["AT"]-32)*5/9
else:
return row["AT"]
df.AT = df.apply(corrector, axis=1)
datasets = {}
for name in df.TCN.unique():
datasets[name] = df[df.TCN == name]
plt.scatter(datasets[name].AT, datasets[name].PE, label=name)
plt.legend()
plt.show()
Copy the code