“This is the 25th day of my participation in the Gwen Challenge.

What is feature engineering? A very simple example, now give a very simple answer to the binary classification problem, please use logistic regression, design a body classifier. Enter data X: height and weight, labeled Y: body level (fat, not fat). Obviously, you can’t judge a person by his weight alone. Yao Ming is very heavy. Is he fat? Apparently not. A very classic feature engineering for this is BMI, BMI= weight/height ^2. In this way, BMI is a very obvious way of telling us what kind of body a person is. You can even throw away the raw weight and height data.

1. Definition of characteristic Engineering:

The process of transforming feature data into feature data more suitable for the algorithm model through some transformation functions

2. Include content

  • Normalization: mapping the original data between (default: [0,1])
  • Standardization: The transformation of the original data to a range of 0 mean and 1 standard deviation

3.api

sklearn.preprocessing

Enter all kinds of libraries first

from sklearn.datasets import load_iris#,fetch_20newsgroups import seaborn as sns import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split from pylab import * mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #Copy the code

1. Data set acquisition

1.1 Small data set acquisition

iris = load_iris()
Copy the code

1.2 Data set attribute description

#print(' target value :\n',iris['target']) #print(' target value :\n',iris. Feature_names') #print(' target value :\n',iris. #print(' data set description is :\n', iris.descr) #print(' data set description is :\n', iris.descr)Copy the code

1.3 Data type conversion to store data in DataFrame

iris_data = pd.DataFrame(data=iris.data,columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'])
iris_data['target']=iris.target
Copy the code

Define a visual function plot_iris:

def plot_iris(iris,col1,col2): sns.lmplot(x = col1,y=col2,data=iris_data,hue='target',fit_reg=False) plt.xlabel(col1) plt.ylabel(col2) Plt.title (' Iris species distribution ') plt.show()Copy the code

1.4 Division of data sets

X_train, x_test y_train, y_test = train_test_split (iris data, iris. The target, test_size = 0.2, random_state = 2) # x as the characteristic value, Y for the target # print (' the eigenvalues of the training set is: \ n ', x_train) # print (' the eigenvalues of the test set is: \ n ', x_test) # print (' training set target value is: \ n, y_train) # print (' test set target value is: \n',y_test)Copy the code

The result is 80% training set and 20% test set

Normalization requires MinMaxScaler, which we import directly here

from sklearn.preprocessing import MinMaxScaler
Copy the code

You also need a converter, instantiate a converter

Transfer = MinMaxScaler (feature_range = (0, 1))Copy the code

Call the fit_transfrom method

Liters minmax_data = transfer.fit_transform(data[[' Milage ','Consumtime']]) print(' Examining 'is :\ N ',minmax_data)Copy the code

You can see that the normalized data is between 0 and 1.

To put it simply, standardization is to process data according to the columns of the eigenmatrix, and convert the eigenvalues of the sample to the same dimension through the method of z-score. Normalization is to process data according to the rows of the eigenmatrix, and its purpose is to have a unified standard for sample vectors in dot product operation or other kernel function similarity calculation, that is to say, are transformed into “unit vectors”. The normalization formula for rule L2 is as follows:

The code for normalizing data using the Normalizer class for the Preproccessing library is as follows:

From sklearn. Preprocessing import Normalizer # Normalizer().fit_transform(iris.data)Copy the code