Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
First, feature scaling
If a feature has a much larger range of values than other features, numerical calculations (such as calculating Euclidean distances) are dominated by that feature. But it doesn’t have to be the most important feature, and it’s often necessary to treat each feature as equally important. Normalized/standardized data allows the features of different dimensions to be compared together and can greatly improve the accuracy of the model. In addition, when using gDA to solve optimization problems, normalization/normalization of data can speed up the solution speed of gDA.
(I) Normalization of data
Data normalization is the processing of values in the range of 0-1 or -1-1.
Any data converted between 0 and 1:
newValue = (oldValue-min)/(max-min)
Copy the code
Any data is between -1 and -1
NewValue = ((oldValue - min)/(Max - min) 0.5) * 2Copy the code
(2) Mean standardization
Mean normalization is scaling the value range to the range [-1, 1] and the mean of the data to 0.
Let x be the characteristic data, u the mean value of data, and S the variance of data
newValue = (oldValue-u)/s
Copy the code
(3)Considerations for feature scaling
First, the data should be divided into training set and verification set, and the required values (such as mean value and standard value) should be calculated on the training set, and the data of the training set should be standardized/normalized (it should not be standardized/normalized on the whole data set, because it will bring the information of the verification set into the training set). The same normalization/normalization is then performed on the validation set data using previously calculated data such as mean and standard values.
Ii. Cross validation (Cross-Validation)
In machine learning, we generally can’t use all the data to train the model, otherwise we won’t have the data set to validate the model and evaluate the predictive effect of our model. One of the simplest approaches is to divide the entire dataset into two parts, one for training as a training set and one for validation as a test set. But this approach has two drawbacks:
1. The selection of the final model and parameters will greatly depend on how you divide the training set and test set. In other words, if the division method of our training set and test set is not good enough, the best model and parameters may not be selected.
2. This method only uses part of the data to train the model. We know that the larger the amount of data used for model training, the better the trained model will generally be. Therefore, the division of training set and test set means that we cannot make full use of the existing data at hand, so the model effect will be affected to a certain extent.
Based on these backgrounds, a cross validation method is proposed. The basic idea of cross-validation is to reuse the data, slice the sample data, combine them into different training sets and test sets, train the model with the training set, and evaluate the prediction quality of the model with the test set. On this basis, multiple different training sets and test sets can be obtained. A sample in one training set may become a sample in the next test set, which is called “crossover”.
Cross validation is used when there is not enough data. If the sample size is less than 10,000, we use cross-validation to train the optimized selection model. If there are more than 10,000 samples, we usually randomly divide the data into three parts: a Training Set, a Validation Set, and a Test Set. The training set is used to train the model and the validation set is used to evaluate the prediction of the model and select the model and its corresponding parameters. The final model is then used in the test set, and the final decision is made on which model to use and the corresponding parameters.