Introduction to Machine Learning (PART 1)

Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author: Xu Min

preface

1. Machine learning algorithm classification

1) Supervised learning: there is train set, and the value of y in train set is known.

2) Unsupervised learning: there is train set, and the value of y in train set is unknown.

3) Semi-supervised learning: there is train set, in which some of the values of y are known and some are not.

Reinforcement learning 4) Reinforcement learning

2. Common algorithms

3. Learning algorithm concepts

1) Least squares regression

Least Squares Regression (OLS), also known as Generalized Least Squares[GLS], is a common linear Regression method. The basic principle of least square method is: the optimal fitting line should minimize the sum of the distances from each point to the line, which can also be expressed as the sum of the squares of the distances.

The basic assumptions of the classical linear regression model are as follows :(1) the residual has zero mean; (2) Var <∞, that is, the residual has constant variance and is finite for all x values; (3) The residuals are statistically independent of each other; (4) The residual term has nothing to do with variable X; (5) The residual term is normally distributed;

If hypothesis (1)-(4) is satisfied, the Estimators obtained by the least square method have some properties, they are the Best Linear Unbiased Estimators (BLUE for short). 1) Linear: it means that the relationship between X and random variable Y is a linear function; 2) Unbiased: means that, on average, the actual parameter value of X obtained from the sample data is consistent with the true value in the overall data; 3) Best: it means that OLS estimators have the minimum variance among all linear unbiased estimators.

Three common problems that regression must address are:

1) Heterroskedasticity: the variance of residuals is not constant, and residuals are related to X (eg, when x becomes large, residuals become large), contrary to assumptions 2 and 4

2) Autocorrelation: residuals are autocorrelated, which violates hypothesis 3

3) Multicollinearity: multiple X are not independent, that is, there is correlation between XI and XJ.

2) Ridge regression

Ridge Regression (English name: Ridge Regression (Tikhonov Regularization) is a biased estimation regression method dedicated to collinear data analysis, which is essentially an improved least squares estimation method (OLS regression). By abandoning the unbias of least squares, It is more practical and reliable to obtain the regression coefficient at the cost of losing part of the information and reducing the accuracy, and it is better than the least square method to fit ill data.

General linear regression is the least square regression and residual calculation is the square error term. Ridge Regression is to add regular terms on the basis of square error, so that the balance between variance and deviation can be achieved through the determined value: with the increase of, the model variance decreases while the deviation increases.

Ridge regression is a supplement to least square regression, which loses the unbiasedness in exchange for higher numerical stability, thus obtaining higher computational accuracy. Generally, the r-square value of ridge regression equation is slightly lower than that of ordinary regression analysis, but the significance of regression coefficient is often significantly higher than that of ordinary regression, which has great practical value in the research with collinearity problems and too much ill-condition data.

Note: In the regression problem, if there are too many x choices, the expansibility of model complexity will increase. If x is selected too little, the prediction ability of the model may be insufficient. Too many x choices, on the one hand, are too complicated, and on the other hand, the constructed model will be over-fit to the data of train set, resulting in poor effect when the model is applied to test set. So there needs to be a mechanism for feature selection, that is, there needs to be a trade-off between sets of X. There are three types of feature selection: 1) subset selection; 2) Shrinkage method, also known as Regularization, arises from ridge and LASSO regression methods. 3) Dimension reduction, namely dimension reduction

3) LASSO is back

In fact, LASSO Shrinkage (Least Absolute Shrinkage and Selectionator Operator) is a dimensionality reduction method proposed by Tibshirani(1996). This algorithm obtains a refined model by constructing a penalty function. The LASSO algorithm achieves the goal of index set simplification by finally determining the coefficients of some indexes to be zero. This is a way to deal with biased estimators of data with complex collinearity. Lasso’s basic idea is to minimize the sum of squares of residuals under the constraint that the sum of absolute values of regression coefficients is less than a constant, so that some regression coefficients strictly equal to 0 can be generated, and a model with strong explanatory power can be obtained.

4) LARS is back

LARS Regression (Least Angle Regression). A variable selection method proposed by Efron in 2004 is similar to the form of Forward Stepwise. It is an efficient solution of LASSO regression from the point of view of the solution process.

Like model selection, minimum Angle regression is a step by step process, in which a feature with the greatest correlation is selected at each step. The total number of operational steps is only related to the number of features, and has nothing to do with the size of the training set. The input of minimum Angle regression training is the eigenmatrix X={X1,X2… ,XP}, and the phase output vector Y={y1,y2… YN}, Xi is the matrix with length N, N represents the size of the training set, P is the number of features. Another point to note is that both vectors Xi and Y are regularized vectors, that is, the mean of their elements is 0, and the length of each vector is 1. This is done for the convenience of calculating correlation and Angle later.

It is a bit like Forward Stepwise (Stepwise Forward regression), but different from Forward Stepwise is that Forward Stepwise completely fits the linear model and calculates RSS according to the selected subset of variables every time. Design statistics (such as AIC) to penalize higher model complexity, and LARS is used to find the variable that is most correlated with the dependent variable, and then adjust the predictor’s coefficients along the LSE. In this process, the correlation between the variable and the residual decreases. When the correlation becomes less significant, the new variable with the highest correlation should be selected, and the changes should be made again in the direction of the LSE. ** At the end, all variables are selected, and the LSE is probably the same.

The actual implementation steps of LARS’s algorithm are as follows: 1. Standardize the Predictors (remove the effects of different scales) and centralize the Target Variable (remove the effects of intercept items). All the initial coefficients are set to 0, and then the residual r is equal to the centralized Target Variable. 2. Find the variable X_j that has the highest correlation with residual r. 3. Change the coefficient Beta_j of X_j from 0 along the direction of LSE (the least squares estimation of only one variable X_j) until some new variable X_k is more related to the residual r than X_j. The coefficients of X_j and X_k, Beta_j and Beta_k, move together along the new Least Squares Estimate of X_k until a new variable is selected. 5. Repeat 2, 3, and 4 until all variables are selected, and the final estimate is OLS of ordinary linear regression.

5) Support vector regression

SVR (Support-vector Regression). Support vector machine (SVM) is a better way to minimize structural risk. Its machine learning strategy is the structural risk minimization principle in order to minimize expected risk, both empirical risk and confidence range should be minimized.)

Support vector machine (SVM) can be used for regression and classification. Its basic ideas are as follows: (1) it is a learning machine specially for limited sample cases, which realizes the minimization of structural risk: it seeks a compromise between the accuracy of the given data approximation and the complexity of the approximation function, in order to obtain the best generalization ability; (2) It finally solves a convex quadratic programming problem. Theoretically, it will get a global optimal solution and solve the local extremum problem which is unavoidable in the neural network method.

(3) It transforms the actual problem into a high-dimensional feature space by nonlinear transformation, constructs a linear decision function in the high-dimensional space to realize the nonlinear decision function in the original space, skillfully solves the dimension problem, and ensures a good generalization ability, and the algorithm complexity is independent of the sample dimension.

At present, SVM algorithm has been applied in pattern recognition, regression estimation, probability density function estimation and other aspects, and the algorithm has exceeded or equal to the traditional learning algorithm in efficiency and accuracy. For empirical risk R, different loss functions can be used to describe it, such as E-insensitive function, Quadratic function, Huber function, Laplace function, etc. Kernel function is generally have a polynomial kernel, gaussian RBF kernel, index more than RBF kernel, hidden layer perception nucleus, Fourier series, spline, the b-spline kernel, etc., although some experiments show that the classification of different kernel functions can produce almost the same result, but in return, different kernel functions tend to have larger influence the result of the fitting.

Support vector regression algorithm mainly realizes linear regression by constructing linear decision function in high-dimensional space after dimension raising. When using e-insensitive function, its foundation is mainly E-insensitive function and kernel function algorithm. If the fitted mathematical model is expressed as a curve in multi-dimensional space, the result obtained according to the e-insensitive function is the “E channel” including the curve and the training point. Of all the sample points, only the portion of sample points distributed on the “pipe wall” determines the position of the pipe. This part of the training sample is called “support vector”. In order to adapt to the nonlinearity of the training sample set, the traditional fitting method usually adds higher-order terms after the linear equation. This method is effective, but the increase of adjustable parameters increases the risk of overfitting. Support vector regression algorithm uses kernel function to solve this contradiction. Substituting kernel function for linear term in linear equation can make original linear algorithm “nonlinear”, that is, can do nonlinear regression. At the same time, the introduction of kernel function achieves the purpose of “dimension raising”, while the increase of adjustable parameters can still control the overfitting.

6) CART regression

CART (English name: Classification And Regression Tree (Regression Tree) is an important machine learning algorithm, which can be used to create Classification And Regression trees. CART is called CART regression when used for regression analysis.

The important basis of CART algorithm includes the following three aspects: (1) Binary Split: In each judgment process, observed variables are divided. CART algorithm adopts a binary recursive segmentation technology. The algorithm always divides the current sample set into two sub-sample sets, so that each non-leaf node of the generated decision tree has only two branches. Therefore, the decision tree generated by CART algorithm is a binary tree with simple structure. Therefore, CART algorithm is suitable for scenarios where the value of sample features is yes or no, and the processing of continuous features is similar to C4.5 algorithm. (2) Split Based on One Variable: Each optimal partition is for a single Variable. (3) Pruning strategy: the key point of CART algorithm and the key step of the whole tree-based algorithm. Pruning is very important, so it plays an important role in the process of optimal decision tree generation. Some studies have shown that the importance of pruning process is more important than that of Tree generation. For Maximum trees generated by different classification standards, the most important attribute classification can be retained after pruning, with little difference. Instead, the pruning method is more critical to the generation of optimal trees.

When the data has many features and the relationships between features are complex, the idea of building a global model can be difficult and unwieldy. Moreover, many problems in real life are nonlinear, and it is impossible to fit any data using a global linear model. One possible approach is to divide the dataset into many pieces of data that are easy to model and then use linear regression techniques to model them. If it is still difficult to fit the linear model after the first segmentation, the segmentation will continue. In this way, tree structure and regression are useful.

When the regression tree is created, the values of the observed values are continuous and there is no classification label. Only the values derived from the observed data can create a prediction rule. In this case, the optimal partition rules of Classification Tree are powerless, and CART uses Squared Residuals Minimization to determine the optimal partition of Regression Tree. The partition criterion is to minimize the error variance of the subtree after the expected partition. Create a model tree, and each leaf node is a machine learning model, such as linear regression model.

The idea of regression tree is similar to that of classification tree, but the data type of leaf node is continuous rather than discrete. A slight modification of CART can handle regression problems. **CART algorithm can be divided into regression tree and model tree according to whether the leaf is a specific value or another machine learning model. ** However, both regression trees and model trees are applicable to the following scenarios: label values are continuously distributed, but they can be divided into communities. There are distinct differences between communities, that is, there is similar continuous distribution within each community, but the distribution between communities is indeed different. So regression trees and model trees are both regression and classification.

Regression is to deal with the situation where the predicted values are continuously distributed and the return value should be a specific predicted value. The leaves of a regression tree are specific values. Strictly speaking, a regression tree cannot be called a “regression algorithm” in the sense of continuous predicted values. This is because the regression tree returns the mean value of the “cluster” of data, rather than specific and continuous predicted values (that is, although the label values of the training data are continuous, the predicted values of the regression tree can only be discrete). Therefore, regression tree can also be regarded as a “classification” algorithm, and its applicable scenarios should have the characteristics of “birds of a feather flock together”, that is, the combination of characteristic values will make the label belong to a certain “community”, and there will be a relatively distinct “gap” between the communities. For example, people’s style is a continuous distribution, but can be “grouped” into three groups: artistic, ordinary and 2B. Regression trees can be used to judge whether a person is artistic or 2B, but they cannot measure how artistic or 2B he is. Therefore, the complex training data can be divided into relatively simple communities by using regression trees, which can be further learned by using other machine learning models.

The leaves of the model tree are machine learning models, such as linear regression models, so it is better known as “regression” algorithms. A model tree can be used to measure a person’s literary value. Regression and model trees also need pruning, the same theory as classification trees. In order to obtain the best model, the combination of pre-pruning and post-pruning is often used for tree pruning.

In tree regression, the consistency of data needs to be measured in order to successfully construct a tree with piecewise constant as leaf node. When the classification decision tree is created, the chaos degree of classification data will be calculated at the given node. So how do you calculate the chaos of a continuous number? In fact, it is very simple to calculate the confusion degree on a continuous data set — to measure the total difference of before and after label data divided according to a certain feature, and to select the feature that minimizes the total difference of data to do the best branch feature each time. In order to treat the positive and negative difference equally, absolute value or square value is generally used to replace the above difference). The smaller the difference, the more similar it is, the more likely it is to belong to a community. Then, if the variance is selected as the difference, there are two ways to calculate the total variance :(1) calculate the mean value of the data set, STD, calculate the variance of each data point and STD, and then sum the n points. (2) Calculate the variance of the dataset var, and then var_sum = var*n, where n is the number of data in the dataset. Var method can be used to calculate the variance of data set in Python Matrix, so this method is simple and convenient.

Similar to the processing method of Gini Gain for discrete features and continuous features, multi-valued discrete features need to select the optimal binary sequence, while continuous features need to find the optimal splitting point. Function chooseBestSplitFeature()(1) Shilling best variance is infinite bestVar= INF. (2) Calculate the total variance currentVar (after dividing data on a feature series recount iteration) in sequence, and the calculation method is as follows: If currentVar

7) CART classification

The CART classification tree needs to be built when using CART for classification problems.

In the recursive process of creating classification tree, CART selects the feature with the minimum Gini information gain in the current data set as the node division decision tree every time. Although ID3 algorithm and C4.5 algorithm can mine as much information as possible in the learning of the training sample set, the decision tree branch and scale generated by them are large. The dichotomy of CART algorithm can simplify the scale of decision tree and improve the efficiency of decision tree generation. For continuous features, CART also adopts the same method as C4.5. In order to avoid Overfitting, the CART decision tree needs pruning. Of course, the prediction process is very simple. According to the generated decision tree model, the prediction category is obtained by extending the matching feature value to the last leaf node.

The difference between CART and C4.5 is that node splitting is based on the concept of GINI index, which mainly measures the impurity of data partition or training data set. The smaller the GINI value is, the higher the purity of the sample is (that is, the higher the probability that the sample only belongs to the same category). After measuring the Gini index for all values of a feature in the data set, a Gini Split info, also known as a GiniGain, is obtained for that feature. When pruning is not considered, the recursive creation process of classification decision tree is to choose the node with the smallest GiniGain to bifurcate each time until all subdatasets belong to the same class or all features are used up.

Because of the characteristics of CART dichotomy, when training data has more than two categories, CART needs to consider merging the target categories into two supercategories, which is called binarization. GINI index is a measure of inequality, usually used to measure income inequality. It can be used to measure any uneven distribution. It is a number between 0 and 1, 0- completely equal, 1- completely unequal. When measuring by category, the more cluttering the population contains, the larger the GINI index (similar to entropy). For a dataset T, its Gini calculation method is:

By analyzing the recursive tree building process of classification regression tree, it is not difficult to find that there exists a problem of over-fitting data in it. In the construction of decision tree, due to noise or outlier points in training data, many branches reflect anomalies in training data, so the accuracy of classification is not high when using such decision tree to classify unknown data. So trying to detect and subtract such branches, the process of detecting and subtracting these branches is called tree pruning. Tree pruning is used to deal with over-adaptive data. Typically, this approach uses statistical measures, subtracting the least reliable branches, which results in faster classification, improving the tree’s ability to classify correctly independent of training data. There are two common methods of decision tree Pruning: pre-pruning and post-pruning. Pre-pruning is to stop the tree growth as early as possible according to some principles, such as the depth of the tree reaches the depth required by the user, the number of samples in the node is less than the number specified by the user, and the maximum reduction of the purity index is less than the range specified by the user. Post pruning is achieved by cutting branches in the fully growing tree, by deleting branches of nodes to cut tree nodes, can use a variety of post-pruning methods, such as: cost complexity pruning, minimum error pruning, pessimistic error pruning and so on. Post-pruning method is often adopted in CART. The second key in the process of decision tree construction is to prune the tree grown by training set with independent verification data set.

A series of recommendations

Summary Notes on Machine Learning Concepts (2) Summary Notes on Machine Learning Concepts (3) Summary Notes on Machine Learning Concepts (4)

This article has been published by Tencent Cloud Technology community authorized by the author