I have been learning machine learning knowledge recently, and it is difficult to get started. When I was studying with Teacher Enda Ng’s video, I found that there were still a lot of knowledge points that were difficult to understand. Not long ago, “machine learning A-Z” out of the Chinese translation, the teacher said very easy to understand, so began to learn.
In order to organize the knowledge learned in a more systematic way, and also as a self-supervision, I sorted out the more systematic knowledge points on the blog. The corresponding code is also synced to Github.
All of the following code is written using the Python, data preprocessing is used mainly sklearn. Preprocessing module [sklearn.apachecn.org/cn/0.19.0/m…].
directory
In this part of machine learning, I will focus on data preprocessing.
1. Import the standard library
- Numpy: Contains many mathematical methods used in machine learning
- Matplotlib.pyplot: Mainly used for plotting
- Pandas: Imports and performs a series of operations on a dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Copy the code
2. Import the data set
Iloc array parameters: the number of rows to the left of the comma, the number of columns to the right of the comma, the colon to select all rows or columns
# Import the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values
Copy the code
3. Missing data
The Imputer class is mainly used for missing data processing parameter Axis:
- Axis = 0 takes the average of a column
- Axis = 1 takes the average of a row
Parameters Strategy: strategy: string, optional (default=”mean”) The imputation strategy.
- If “mean”, then replace missing values using the mean along the axis.
- If “median”, then replace missing values using the median along the axis.
- If “most_frequent”, then replace missing using the most frequent value along the axis.
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
imputer.fit(X[:,1:3]) # stands for 1 and 2
X[:,1:3] = imputer.transform(X[:,1:3])
Copy the code
4. Categorizing data
4.1. Label coding
Function: Converts text to numbers
Disadvantages: at the beginning, there is no numerical distinction between different nationalities. After converting different countries into values 0,1,2, different classes have different sizes, so different classes are sorted.
Solution: Single thermal coding (virtual coding)
4.2. Dummy Coding
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:,0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Encode labels with value between 0 and n_class-1 # Encode labels with value between 0 and n_class-1
Copy the code
5. Divide the data set into training set and test set
- Test_size: between 0 and 1. The default value is 0.25. Generally good values are 0.2 or 0.25
- Random_state: Determines how random numbers are generated
# splitting dataset into Training set and Test setfrom sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)Copy the code
6. Feature scaling
Why do YOU want to feature scale data?
Euclidean distance, the length of a line segment between two points, is important in many machine learning algorithms.
Scale data of different orders of magnitude to the same order of magnitude, and without feature scaling, age has very little effect on salary comparison.
To solve this problem, we need to scale Age and Salary to the same order of magnitude.
6.1. Standardisation
The average value of the new data obtained is 0, and the variance is 1, which is applied in support vector machine, logistic regression and neural network
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Copy the code
6.2. Normalisation
Project all values between 0 and 1
import sklearn.preprocessing as sp
mms = sp.MinMms_samples2 MaxScaler (feature_range = (0, 1)) = MMS. Fit_transform (raw_samples)Copy the code
7. Data pre-processing template
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
imputer = Imputer()
test = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:,0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Copy the code
Problem of 8.
8.1. The difference between FIT, FIT_transform and transform?
- Fit (): In simple terms, it is to obtain the mean value, variance, maximum value and minimum value of the training set, which is also the attribute of the training set X, which can be understood as a training process.
- Transform(): On the basis of FIT, carry out standardization, dimension reduction, normalization and other operations;
- Fit_transform (): is a combination of FIT and transform that contains both training and transformation.
Note:
- Fit_transform (trainData) must be used first, followed by transform(testData);
- An error will be reported if you use transform(testData) directly;
- If fit_Transform (trainData) is followed by FIT_Transform (testData) instead of Transform (testData), although normalization can also be achieved, the two results are not under the same “standard” and have significant differences.
8.2. What are the differences between standardization and normalization, and what are the respective scenarios?
www.jianshu.com/p/95a8f035c…