Feature engineering, machine learning optimization methods are all here! Collection!

1. What are the feature projects?

Feature engineering, as the name implies, is a series of engineering processes on raw data that are refined into features that are used as inputs for algorithms and models. Essentially, feature engineering is a process of representing and presenting data. In practice, feature engineering aims to remove impurities and redundancy in the raw data and design more efficient features to describe the relationship between the solved problem and the prediction model.

The following two common data types are discussed.

Structured data. Structured data type can be regarded as a table in a relational database. Each column is clearly defined and contains two basic types, numeric type and category type. Each row of data represents information about a sample.
Unstructured data. Unstructured data mainly includes text, image, audio and video data, and the information it contains cannot be represented by a simple value, nor is there a clear category definition, and each data is of different sizes.

1.1 Feature normalization

In order to eliminate the dimensional influence between data features, we need to normalize the features to make different indicators comparable. For example, to analyze the impact of a person’s height and weight on health, if meters (m) and kilograms (kg) are used as units, then the height characteristics will be in the range of 1.6 ~ 1.8m, and the weight characteristics will be in the range of 50 ~ 100kg, and the analysis results will obviously tend to the weight characteristics with a large difference ratio. To get more accurate results, there is a process of Normalization that puts indicators on the same level for analysis.

Normalization of the features of a numeric type can unify all the features into a roughly identical numeric interval. The most common methods are the following two.

The linear function is normalized(Min-max Scaling). It carries out linear transformation on the original data, so that the result is mapped to the range of [0, 1], and the equal scale of the original data is realized. The normalization formula is as follows, whereXIs the original data,Are the maximum value and minimum value of data respectively.
There has been z-score Normalization. It maps the raw data to a distribution with a mean of 0 and a standard deviation of 1. In particular, assuming that the mean of the original features is μ and the standard deviation is σ, the normalization formula is defined as

Advantages: After the normalization of training data, it is easier and faster to find the optimal solution through gradient descent.

Of course, data normalization is not a panacea. In practical application, the modes solved by gradient descent method usually need normalization, including linear regression, logistic regression, support vector machine, neural network and so on. But it is not applicable to the decision tree model.

1.2 Category characteristics

Categorical Feature refers to gender (male, female), blood type (A, B, AB, O) and other features that can only be selected in A limited number of options. The original input of categorical features is usually in the form of string. In addition to a few models such as decision tree, which can directly process the input in the form of string, for models such as logistic regression and support vector machine, categorical features must be processed and converted into numerical features to work correctly.

The serial number coding

Ordinal coding is usually used to process data with size relationships between categories. For example, grades can be divided into low, middle and three grades, and there is a “high > middle > low” ordering relationship. The serial number coding will assign a numerical ID to the category features according to the size relationship, for example, the high representation is 3, the middle representation is 2, and the low representation is 1, and the size relationship remains after the transformation.
One-hot coding

Unique thermal coding is usually used to deal with features that do not have a size relationship between categories. For example, there are four values of blood type (A, B, AB and O). The unique thermal coding will turn blood type into A 4-dimensional sparse vector, which is represented by (1, 0, 0, 0) for blood type A, (0, 1, 0, 0) for blood type B, and (0, 0, 1, 0) for blood type AB. Blood type O is (0, 0, 0, 1). In the case of a large number of categories, unique heat encoding is used.
** Binary encoding **

Binary encoding is mainly divided into two steps. First, each category is assigned a category ID with ordinal encoding, and then the binary encoding corresponding to the category ID is taken as the result. Taking blood types A, B, AB and O as examples, the following is the process of binary coding. The ID of blood type A is 1, which is 001 in binary. The ID of blood type B is 2, which is 010 in binary. The analogy can be used to get binary representations of blood types AB and O.

1.3 Processing of high-dimensional combination features

In order to improve the fitting ability of complex relations, first-order discrete features are often combined in pairs to form higher-order combined features in feature engineering. Take the advertising click estimation problem as an example. The original data has two discrete features, language and type. The first figure shows the influence of language and type on click. In order to improve the ability of fitting, language and type can be used to form second-order features. The second figure shows the influence of the combination features of language and type on click.

1.4 Text representation model

Text is a kind of very important unstructured data. How to represent text data has always been an important research direction in the field of machine learning.

Word bag model and N-gram model

The most basic text representation model is the word bag model. As the name suggests, this is to treat each text as a bag of words, ignoring the order in which each word appears. To be specific, the whole text is cut apart in terms of words, and then each article can be represented as a long vector. Each dimension in the vector represents a word, and the weight of the corresponding dimension reflects the importance of the word in the original article. Tf-idf is commonly used to calculate the weight.
Topic model

The topic model is used to find representative topics from the text library (to get the distribution characteristics of the words above each topic) and to calculate the topic distribution for each article.
Word embedding and deep learning models

Word embedding is a general term for a class of word vectorization models. The core idea is to map each word into a Dense Vector in a low-dimensional space (K=50 ~ 300 dimensions). Each dimension of the k-dimensional space can also be viewed as an implicit theme, though not as intuitively as the theme in the theme model.

1.5 Other feature engineering

If there are few missing values in a feature, the average value of the feature or other reliable data can be used for filling. If there are too many missing features, you can consider deleting this feature.
The correlation between features and results can be analyzed, and features with little correlation can be removed.

1.6 Feature engineering brain map

2. Machine learning optimization methods

Optimization is a branch of applied mathematics and a core component of machine learning. In fact, machine learning algorithm = model representation + model evaluation + optimization algorithm. What the optimization algorithm does is to find the model with the best model evaluation index in the model representation space. Different optimization algorithms have different model representation and evaluation indexes.

2.1 Loss functions commonly used in machine learning

Loss function is used to estimate the degree of inconsistency between the predicted value F (x) and the real value Y of your model. It is a non-negative real value function and is usually represented by L(Y, f(x)). The smaller the loss function is, the better the model’s robustness will be. Common loss functions are as follows:

Square loss function

Y minus f(X) represents the residuals, and the whole formula represents the sum of the squares of the residuals, and our goal is to minimize the value of the objective function (note: this formula does not include the regular term), which is to minimize the sum of the squares of the residuals. In practical application, the mean square deviation (MSE) is usually used as a measurement index, and the formula is as follows:

This loss function is commonly used in linear regression.
Log loss function

In the formula y=1 means that the first formula is used when the true value is 1, and the second formula is used to calculate the loss when the true value is 0. Why do I add the log function? If you think about it, if the real sample is 1, but h=0, then log0= infinity, that’s the maximum penalty for the model; When h=1, log base 1=0, there’s no penalty, there’s no loss, there’s an optimal result. So the mathematicians came up with the log function for the loss function.

Finally, according to the gradient descent method, the minimum point is solved to obtain the desired model effect. This loss function is commonly used in logistic regression.
Hinge loss function

SVM adopts Hinge Loss, which is used for “max-margin” classification.

See the previous SVM article 1.2.3 for details

2.2 What is convex optimization

A convex function is strictly defined as that the function L(·) is convex if and only if for any two points x, y and any real number λ∈[0,1] in the domain:

An intuitive interpretation of this inequality is that any point on a line segment joined by any two points on a convex surface will not lie below the surface of the function, as shown in the figure below.

Examples of convex optimization problems include linear models such as support vector machines and linear regression, while examples of non-convex optimization problems include low-rank models (such as matrix decomposition) and deep neural network models.

2.3 Regularization terms

The regularization item is used, that is, a parameter item is added to loss function. The regularization item includes L1 regularization, L2 regularization and ElasticNet. The benefits of adding this regularization item:

Control parameter amplitude, do not let the model “lawless”.
Limit the parameter search space
Solve the problem of underfitting and overfitting.

Refer to the previous article for details: Linear regression — Point 5

2.4 Common optimization methods

Gradient descent method

Gradient descent is the earliest, simplest and most commonly used optimization method. When the objective function is convex, the solution of gradient descent method is global solution. In general, its solution is not guaranteed to be the global optimal solution, and the speed of gradient descent method is not necessarily the fastest. The optimization idea of gradient descent method is to use the negative gradient direction of the current position as the search direction, because this direction is the fastest descent direction of the current position, so it is also called “the fastest descent method”. The closer the fastest descent method is to the target value, the smaller the step size and the slower the progress. The search iteration of gradient descent method is shown in the figure below:

Disadvantages: convergence speed slows down when approaching the minimum value; Linear search can cause some problems; It may zigzag down.
Newton’s method

Newton’s method is a method to approximate the solution of equations in the field of real and complex numbers. Methods Use the first few terms of the Taylor series of the function f (x) to find the roots of the equation f (x) = 0. The greatest characteristic of Newton’s method is that it converges quickly. Specific steps:
- First, select an x0 that is close to the zero of the function f (x) and compute the corresponding f (x0) and the slope of the tangent line F ‘(x0) (where f’ represents the derivative of the function f).
- Then we compute the x-coordinate of the X-axis intersection of the line through the point (x0, f (x0)) with slope F ‘(x0), that is, solve the following equation:
- We’ll call our new point x1, and in general x1 will be closer to the solution to the equation f (x) = 0 than x0. So we can now start the next iteration with X1.
Because Newton’s method is based on the tangent line of the current position to determine the next position, Newton’s method is also vividly called the “tangent line method.” Dynamic example graph of Newton method search:

Essentially, Newton’s method is second-order convergence, and gradient descent is first-order convergence, so Newton’s method is faster. Disadvantages:
- Newton’s method is an iterative algorithm, and every step needs to solve the inverse matrix of the Hessian matrix of the objective function, so the calculation is complicated.
- In higher dimensions this matrix is very large, and computation and storage are problems.
- In the case of small batches, Newton’s method is too noisy to estimate the second derivative.
- When the objective function is not convex, Newton’s method is easily attracted by saddle point or maximum point.
Quasi-newton method

Quasi-newton’s method is one of the most effective methods to solve nonlinear optimization problems. The essential idea is to improve the defect of Newton’s method to solve the inverse of the complex Hessian matrix every time. It uses the positive definite matrix to approximate the inverse of the Hessian matrix, thus simplifying the operation complexity. ** Quasi-Newton method, like gradient descent method, only requires that the gradient of the objective function be known at each iteration step. By measuring the variation of the gradient, a model of the objective function is constructed that is sufficient to produce superlinear convergence. Such methods are vastly superior to gradient descent, especially for difficult problems. In addition, the quasi-Newtonian method is sometimes more efficient than Newton’s method because it does not require information about the second derivative. Today, optimization software contains a large number of quasi-Newtonian algorithms to solve unconstrained, constrained, and large-scale optimization problems.
Conjugate gradient method

Conjugate gradient method is between gradient descent and Newton’s laws of a method, it only using first derivative information, but overcomes the drawback of gradient descent method converges slowly, and avoids the Newton’s method need storage and computing Hesse matrix and the disadvantage of the inverse conjugate gradient method is not only one of the most useful method to solve large linear equations, It is also one of the most efficient algorithms for solving large nonlinear optimization. Among all kinds of optimization algorithms, conjugate gradient method is very important. It has the advantages of small storage, step convergence, high stability, and does not need any external parameters.

See Wikipedia conjugate gradient method for details. The following figure shows the path comparison between conjugate gradient method and gradient descent method for searching optimal solutions:

3. Machine learning assessment methods

Confusion matrix, also known as error matrix, is a standard format for precision evaluation, expressed in matrix form of N rows and n columns. The specific evaluation indexes include overall accuracy, mapping accuracy, user accuracy, etc. These accuracy indexes reflect the accuracy of image classification from different aspects. The confusion matrix is shown below

	Is the class	Negative class
Right,	TP(True Positives)	FP(False Positives)
Prediction error	FN(False Negatives)	TN(True Negatives)

3.1 Accuracy

** Accuracy. **, as the name suggests, is the total percentage of all predictions that are correct.

Accuracy is the simplest and most intuitive evaluation index in classification problems, but there are obvious defects. For example, when 99 percent of the samples are negative, the classifier can also achieve 99 percent accuracy by predicting all samples as negative. Therefore, when the sample proportion of different categories is very imbalanced, the category with a large proportion often becomes the most important factor affecting the accuracy.

3.2 Precision

Precision, Precision. That’s the percentage of all positive predictions that are correct. Personal understanding: the percentage of all predictions that are truly correct is positive.

3.3 Recall

Recall d. That is, the percentage of the people who are correctly predicted to be positive that are actually positive. Personal understanding: what is truly true is the percentage of all that is actually positive.

In order to comprehensively evaluate the quality of a ranking model, not only the Precision@N and Recall@N of the model under different Top N conditions, but also the p-R (Precision-recall) curve of the model should be drawn. Here is a brief introduction to the drawing method of P-R curve.

The horizontal axis of the P-R curve is the recall rate and the vertical axis is the accuracy rate. For a ranking model, a point on the p-R curve represents that, under a certain threshold, the model determines the results greater than the threshold as positive samples, and those less than the threshold as negative samples. At this point, the recall rate and accuracy rate corresponding to the results are returned. The entire P-R curve is generated by moving the threshold from high to low. The figure below is A sample p-R curve diagram, where the solid line represents the P-R curve of model A and the dotted line represents the p-R curve of model B. The vicinity of the origin represents the accuracy and recall rate of the model when the threshold value is maximum.

As can be seen from the figure, when the recall rate is close to 0, the accuracy rate of model A is 0.9, and that of model B is 1, which indicates that the samples with the top scores of model B are all real positive samples, while the prediction errors exist even in the samples with the highest scores of Model A. And, with the increase of recall rate, the overall accuracy tends to decrease. However, when the recall rate is 1, the accuracy of model A is higher than that of model B. This fully shows that only the accuracy rate and recall rate corresponding to a certain point cannot comprehensively measure the performance of the model, and only the overall performance of the P-R curve can make a more comprehensive evaluation of the model.

3.4 F1 value (H-mean value)

F1 value (H-mean value). The value of F1 is the arithmetic mean divided by the geometric mean, and the larger the better. By substituting the above formula of Precision and Recall, it can be found that when the value of F1 is small, the True Positive increases while the false decreases, that is, both Precision and Recall increase. That is, F1 weights both Precision and Recall.

3.4 ROC curve

The ROC curve. Receiver operating characteristic Curve is a comprehensive index reflecting the continuous variables of sensitivity and specificity. Each point on the ROC curve reflects the receptivity to the same signal stimulus. Below is an example of the ROC curve.

Abscissa: 1-specificity, False positive rate (FPR, FPR=FP/(FP+TN)), percentage of positive samples in all negative samples;

Ordinate: Sensitivity, True positive rate (TPR, TPR=TP/(TP+FN)), and the proportion of predicted positive and actual positive samples in all positive samples.

In a truly ideal case, TPR would be close to 1 and FPR close to 0, the point (0,1) on the graph. The closer the ROC curve is to the point (0,1), the more it deviates from the 45 degree diagonal, the better.

AUC value

Area Under Curve (AUC) is defined as the Area Under the ROC Curve. Obviously, the value of this Area is not greater than 1. And since the ROC curve is generally above the straight line y=x, the value range of AUC is generally between 0.5 and 1. The AUC value is used as the evaluation standard because in many cases, the ROC curve cannot clearly explain which classifier has a better effect, and as a value, the classifier with a larger AUC has a better effect.

Criteria for judging the merits and demerits of classifier (prediction model) from AUC:

AUC = 1 is a perfect classifier. When this prediction model is adopted, there is at least one threshold to get a perfect prediction. For the vast majority of prediction situations, there is no perfect classifier.
0.5 < AUC < 1, better than random guess. This classifier (model) can have predictive value if the threshold is set properly.
AUC = 0.5, the model has no predictive value as a random guess (e.g., lost copper plate).
AUC < 0.5, worse than random guess; But as long as you always make a counter-prediction, it’s better than a random guess.

In a word, the higher the AUC value of the classifier, the higher the accuracy.

3.5 Cosine distance and Euclidean distance

Cosine distance:

** Euclidean distance: ** In mathematics, the Euclidean distance or Euclidean metric is the “ordinary” (i.e. straight line) distance between two points in a Euclidean space.

For two vectors A and B, the cosine distance is concerned with the angular relationship between the vectors and does not care about their absolute size, which ranges from [−1,1]. If word frequency or word vector are used as features, the Euclidean distance of a pair of texts in feature space is usually very large when the length of similarity is very different but the content is similar. If cosine similarity is used, the Angle between them may be small, so the similarity is high. In addition, in the fields of text, image, video and so on, the characteristic dimension of the object of study is often very high. In the case of high dimension, the cosine similarity still retains the property of “1 when identical, 0 when orthogonal, and −1 when opposite”, while the value of Euclidean distance is affected by the dimension, the range is not fixed, and the meaning is vague.

3.6 A/B testing

AB test is made for Web App interface or process (A/B) two or more (A/B/n) version, at the same time dimension, respectively, let visitors to the components of the same (similar) group (target groups) random access these versions, collect user experience data and business data of each group in the final analysis, evaluating the best version, formally adopted.

3.7 Model evaluation method

Holdout test

Holdout test is the simplest and most direct verification method, which randomly divides the original sample set into training set and verification set. For example, for a CTR prediction model, we divide the sample into two parts in a proportion of 70% to 30%. 70% of the sample is used for model training. 30% of the samples were used for model validation, including drawing ROC curves, calculating accuracy and recall to evaluate model performance.

The shortcoming of Holdout test is obvious, that is, the final evaluation index calculated on the verification set has a great relationship with the original grouping. To eliminate randomness, the researchers introduced the idea of “cross-checking.”
Cross check

K-fold cross validation: Firstly, all samples were divided into k equal sample subsets; The k subsets are traversed in turn, and the current subset is taken as the validation set each time, and all the other subsets are taken as the training set to train and evaluate the model. Finally, the average value of k evaluation indexes was taken as the final evaluation index. In actual experiments, k is often 10.
Self-help method

Both Holdout test and cross test are based on the method of dividing training set and test set to evaluate the model. However, when the sample size is relatively small, dividing the sample set will further reduce the training set, which may affect the training effect of the model. Is there a validation method that can maintain the sample size of the training set? Self-help method can solve this problem better.

Self-help method is an inspection method based on self-help sampling method. For the sample set with a total number of N, n times of random sampling with repositions are carried out to obtain a training set with a size of N. In the n sampling process, some samples will be repeatedly sampled and some samples have not been extracted. These samples that have not been extracted will be used as the verification set for model verification, which is the verification process of self-help method.

3.8 Hyperparameter Tuning

In order to carry out hyperparameter tuning, we usually use grid search, random search, Bayesian optimization and other algorithms. Before introducing the algorithm in detail, it is necessary to make clear which elements are generally included in the hyperparameter search algorithm. One is the objective function, that is, the objective that the algorithm needs to maximize/minimize; Second, the search scope is generally determined by upper and lower limits; Third, other parameters of the algorithm, such as the search step.

Grid search, perhaps the simplest and most widely used hyperparametric search algorithm, determines the optimal value by searching all points within the search norm. If large search scope and small step size are used, grid search has a high probability to find the global optimal value. However, this search scheme consumes computational resources and time, especially when there are many hyperparameters to tune. Therefore, in practical application, the grid search method usually uses a wider search range and a larger step size to find the possible location of the global optimal value. Then the search scope and step size will be gradually narrowed to find a more accurate optimal value. This operation can reduce the time and computation required, but because the objective function is generally non-convex, it is likely to miss the global optimal value.
Random search, the idea of random search is similar to grid search, but instead of testing all values between upper and lower bounds, random sample points are selected in the search range. Its theoretical basis is that if the sample point set is large enough, the global optimal value, or its approximate value, can be found with high probability through random sampling. Random search is generally faster than grid search, but like the fast version of grid search, results are not guaranteed.
Bayesian optimization algorithm, bayesian optimization algorithm in the search for optimal parameters, using grid search, random search completely different methods. Grid search and random search will ignore the information of the previous point when testing a new point. The Bayesian optimization algorithm makes full use of the previous information. By learning the shape of the objective function, the bayesian optimization algorithm finds the parameters that promote the objective function to the global optimal value.

3.9 Over-fitting and under-fitting

Overfitting means that the model fits the training data too properly, which is reflected in the evaluation index. That is, the model performs well on the training set, but poorly on the test set and new data. Underfitting refers to a situation in which the model performs poorly both in training and prediction. The following diagram illustrates the difference between overfitting and underfitting.

To prevent overfitting:
- Start with data and get more training data.
- Reduce model complexity.
- The regularization method adds some regularization constraints to the parameters of the model.
- Ensemble learning is an approach to integrating multiple models together.
To prevent under-fitting:
- Add new features.
- Increase model complexity.
- Decrease the regularization coefficient.

4. References

Hundred side machine learning

5. Machine learning series

GitHub:github.com/NLP-LOVE/ML…

Machine Learning

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]