Summary and Comparison of Commonly used Machine Learning Algorithms

Introduction to Machine Learning series 2 — How to Build a Complete Machine Learning Project, Part 9!

The first eight articles in this series:

How to Build a Complete Machine Learning Project
Machine learning data set acquisition and test set construction method
Data Preprocessing for feature Engineering (PART 1)
Data Preprocessing for feature Engineering (Part 2)
Feature scaling & feature coding for feature engineering
Feature Engineering (finished)
Summary and Comparison of Common Machine Learning Algorithms (PART 1)
Summary and Comparison of Commonly used Machine Learning Algorithms (Middle)

Boosting algorithm, GBDT, Optimization algorithm, and Convolutional neural network are both discussed in detail.

Boosting method 9

The paper

Boosting method (Boosting) is a common statistical learning method. In classification problems, it learns multiple classifiers by changing the weight of training samples and linearly combining these classifiers to provide classification performance.

Bagging and boosting

Boosting and Bagging are both basic algorithms in the field of ensemble learning, and they use the same type of multiple classifiers.

Bagging

Bagging is also called bootstrap aggregating. For example, we gating N samples from the original data set, we extract N samples from the original data set every time we extract N samples from the original data set. Then we extract S of N samples from the original data set. Then S new data sets with N samples are obtained, and the S data sets are used to train S classifiers, and then the S classifiers are applied for classification, and the categories with the most votes of classifiers are selected as the final classification result. Generally speaking, self-help samples contain 63% of the original training data, because:

If I take N samples, then the probability that I don’t get any samples N times is equal to

So the probability of a sample being picked is 1

So, when N is large:.

In this way, 36% of the samples will not be sampled during the bootstrap process. These samples are called out-of-bags (OOB), which is one of the advantages self-service sampling brings to bagging. Because we can use OOB for ** “out-of-bag estimates” **.

Bagging improves generalization errors by reducing the variance of the base classifier, and the performance of Bagging depends on the stability of the base classifier. If the base classifier is unstable, bagging helps to reduce the errors caused by random fluctuations of training data. If the base classifier is stable, that is, it is robust to small changes in the training data set, then the errors of the combined classifier are mainly caused by the base classifier offset. In this case, Bagging may not improve the base classifier significantly and may even reduce the performance of the classifier.

Boosting and Bagging

Bagging got S data sets through extraction with substitution, while Boosting uses the original data set all the time, but the weight of the sample will be changed.
Boosting’s training for the classifier is serial, and the training of each new classifier will be affected by the classification result of the previous one.
In Bagging, the weight of each classifier is equal, but Boosting is not. The weight of each classifier represents the success degree of its corresponding classifier in the last round of classification.

AdaBoost is the most popular version of boosting

AdaBoost algorithm

AdaBoost (Adaptive Boosting) is a meta-algorithm, which builds a strong classifier by combining several weak classifiers.

We for each sample of training data to endow them with a weight, these weightings constituted vector D, in the beginning, these weights are initialize contour, every time and then add a weak classifier to classify the sample, starting from the second classification, will be the last time the wrong sample weight, points to the sample weight is reduced, the last iteration.

In addition, for each weak classifier, each classifier has its own weight, depending on its classification of the weighted error rate, the lower the weighting error rate, the higher the weight value of alpha is the classifier, the comprehensive classification of multiple weak classifier results for the prediction results and the corresponding weights of alpha, AdaBoost is one of the best learning supervision classification method.

The advantages and disadvantages

advantages

Low generalization error
Easy to implement, high classification accuracy, not too many parameters can be adjusted

disadvantages

They are sensitive to outliers
The training time is too long
The execution effect depends on the choice of weak classifier

10. GBDT

The paper

GBDT is a decision tree algorithm based on iterative accumulation. It constructs a group of weak learner (tree) and aggregates the results of multiple decision trees as the final prediction output.

The tree in GBDT is a regression tree, not a classification tree.

Random Forest (RF) and GBDT were compared

Trees in RF are generated in parallel; In GBDT, trees are generated sequentially. Too many trees in both will overfit, but GBDT is easier to overfit
The splitting characteristics of each tree in RF are relatively random. In GBDT, the front tree splits preferentially for distinguishing features of most samples, while the back tree splits preferentially for distinguishing features of a small number of samples
The main parameter in RF is the number of trees; The main parameter in GBDT is the tree depth, which is generally 1

The advantages and disadvantages

advantages

High precision
Ability to process nonlinear data
Able to handle multiple feature types
Suitable for low dimensional dense data
The model can be interpreted well
No need to do feature normalization, can automatically select features
Can adapt to a variety of loss functions, including mean square error andLogLossEtc.

disadvantages

Boosting is a serial process, so parallelism is troublesome and the connection between upper and lower trees needs to be considered
High computational complexity
High dimensional sparse features are not used

Adjustable parameter

The number of trees ranges from 100 to 10000
Leaves 3 to 8 in depth
Learning rate 0.01~1
The maximum node tree on the leaf is 20
Training sampling ratio 0.5~1
Training feature sampling ratio

xgboost

Xgboost is a boosting Tree implementation that shines in the Kaggle contest. It has the following excellent characteristics:

The tree model complexity is displayed as a regular term added to the optimization objective.
We use the second derivative, we use the second Taylor expansion.
An approximate algorithm for finding split points is implemented.
The sparsity of features is utilized.
Data is sorted in advance and stored in block form, which facilitates parallel computing.
Based on distributed communication framework Rabit, it can run on MPI and YARN. (The latest version is not based on Rabit)
Implementation of the architecture oriented optimization, for cache and memory performance optimization.

It is found that the training speed of Xgboost is much faster than the traditional GBDT implementation, which is 10 times of magnitude.

Code implementation

The following is a simple example of using the XgBoost framework.

Divide the data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1729)
print(X_train.shape, X_test.shape)

# Model parameter Settings
xlf = xgb.XGBRegressor(max_depth=10, 
                        learning_rate=0.1, 
                        n_estimators=10, 
                        silent=True, 
                        objective='reg:linear', 
                        nthread=- 1, 
                        gamma=0,
                        min_child_weight=1, 
                        max_delta_step=0, 
                        subsample=0.85, 
                        colsample_bytree=0.7, 
                        colsample_bylevel=1, 
                        reg_alpha=0, 
                        reg_lambda=1, 
                        scale_pos_weight=1, 
                        seed=1440, 
                        missing=None)

xlf.fit(X_train, y_train, eval_metric='rmse', verbose = True, eval_set = [(X_test, y_test)],early_stopping_rounds=100)

# Calculate AUC score and forecast
preds = xlf.predict(X_test)
Copy the code

11. Optimize the algorithm

Common optimization methods include gradient descent method, Newton method and quasi-Newton method, conjugate gradient method and so on

Gradient descent method

Gradient descent is the earliest, simplest and most commonly used optimization method.

When the objective function is convex, the solution of gradient descent method is global solution.

In general, its solution is not guaranteed to be the global optimal solution, and the speed of gradient descent method is not necessarily the fastest.

The optimization idea of gradient descent method is to use the negative gradient direction of the current position as the search direction, because this direction is the fastest descent direction of the current position, so it is also called “the fastest descent method”. The closer the fastest descent method is to the target value, the smaller the step size and the slower the progress.

The search iteration of gradient descent method is shown in the figure below:

Its disadvantages are:

(1) Convergence slows down when it approaches the minimum value, as shown in the figure below;

(2) There may be some problems in linear search;

(3) may zigzag down.

As can be seen from the figure above, the convergence speed of gDA is obviously slower in the region close to the optimal solution, and it requires many iterations to solve the problem using GDA.

In machine learning, three gradient descent methods are developed based on the basic gradient descent method:

Batch gradient descent method: The entire training set is used for each iteration
Random gradient descent method: use one training sample randomly for each iteration
Small batch gradient descent method: each iteration uses a small training subset

The small-batch gradient descent method is a compromise of the former two methods and also the most commonly used gradient descent method at present. It not only avoids the disadvantage of batch gradient descent method that requires calculation of the entire training set, but also avoids the disadvantages of training shock and instability like random gradient descent method. Of course, compared with the previous two methods, it is easier to fall into the local minimum.

Newton’s method

Newton’s method is a method to approximate the solution of equations in the field of real and complex numbers. Methods Use the first few terms of the Taylor series of the function f(x) to find the roots of the equation f(x) = 0.

It is a second-order algorithm, which uses Hessian matrix to find the second-order partial derivative of weight, and aims to find a better training direction by using the second-order partial derivative of loss function.

The greatest characteristic of Newton’s method is that it converges quickly.

Newton’s method is based on the current position of the tangent line to determine the next position, so Newton’s method is also vividly known as the “tangent method”. The search path of Newton’s method (two-dimensional case) is shown in the figure below:

Dynamic example graph of Newton method search:

Comparison of efficiency between Newton method and gradient descent method:

Essentially, Newton’s method is second-order convergence, and gradient descent is first-order convergence, so Newton’s method is faster. If more informally, such as you want to find a shortest path to walk to the bottom of a basin, the gradient descent method at a time from your current location selected a gradient in the direction of the greatest step, Newton’s method when choosing the direction, will not only consider whether or not the slope is enough big, will consider you a step, slope will become larger. So it’s fair to say that Newton’s method can see a little bit further than gradient descent, and get to the bottom faster. (Newton’s method took a longer view, so fewer detours; In contrast, the gradient descent method only considers local optimization, without global thought.
According to the explanation on the wiki, say from geometry, Newton’s method is using a quadric surface to fit location before the local curved surface, and the gradient descent method is to use a plane to fitting the current local curved surface, usually, the quadric surface fitting is better than flat, so Newton method choice of descent path is more accord with the real optimal path.

Note: The red iteration path of Newton’s method and the green iteration path of gradient descent method.

The advantages and disadvantages

advantages

Second order convergence, fast convergence;

disadvantages

Hessian matrix (the inverse of Hessian matrix) requires a large amount of calculation. When the problem scale is large, it not only requires a large amount of calculation but also requires a large amount of storage space. Therefore, Newton’s method is not applicable when facing massive data due to the huge cost of each iteration.
Newton’s method cannot always guarantee that the Hessian matrix is positive definite in each iteration. Once the Hessian matrix is not positive definite, the optimization direction will “run off”, which makes Newton’s method invalid, which also shows the poor robustness of Newton’s method.

Quasi-newton method

The essential idea of quasi-Newton’s method is to improve the defect that Newton’s method needs to solve the inverse of complex Hessian matrix every time. It uses positive definite matrix to approximate the inverse of Hessian matrix, thus simplifying the operation complexity.

Like the steepest descent method, the quasi Newtonian method only requires that the gradient of the objective function be known at each iteration. By measuring the variation of the gradient, a model of the objective function is constructed that is sufficient to produce superlinear convergence. Such methods are vastly superior to the fastest descent method, especially for difficult problems.

In addition, because the quasi-Newton method does not require information about the second derivative, it computs a matrix at each iteration that approximates the inverse of the Hesse matrix. Most importantly, the approximation value is calculated using only the first partial derivative of the loss function, so it is sometimes more efficient than Newton’s method.

Today, optimization software contains a large number of quasi-Newtonian algorithms to solve unconstrained, constrained, and large-scale optimization problems.

Conjugate Gradient method

Conjugate gradient method is between the steepest descent method and Newton’s law of a method, * * it only using first derivative information, but to overcome the shortcoming of slow convergence of steepest descent method, Newton method and avoids the need to store and calculation Hesse matrix and the disadvantage of the inverse * * conjugate gradient method is not only one of the most useful method to solve large linear equations, It is also one of the most efficient algorithms for solving large nonlinear optimization. Among all kinds of optimization algorithms, conjugate gradient method is very important. It has the advantages of small storage, fast convergence, high stability, and does not need any external parameters.

In conjugate gradient training algorithms, since the search is performed along the conjugate directions, the algorithm will generally converge more rapidly than the conjugate descending direction. The training direction of conjugate gradient method is conjugate with Hessian matrix.

The conjugate gradient method has proved to be much more efficient than the gradient descent method in neural networks. And because conjugate gradient method does not require the use of Hesse matrix, so it can achieve good performance in large-scale neural networks.

Heuristic optimization method

Heuristic method refers to a method of discovery based on empirical rules when solving problems. It is characterized by the use of past experience in solving problems, and the selection of effective methods, rather than systematically, in order to determine the steps to seek answers. There are many heuristic optimization methods, including classical simulated annealing method, genetic algorithm, ant colony algorithm and particle swarm optimization algorithm.

There is also a special optimization algorithm known as multi-objective optimization algorithm, which is mainly aimed at simultaneously optimizing multiple objectives (two or more than two) optimization problems, this aspect of the more classical algorithms are NSGAII algorithm, MOEA/D algorithm and artificial immune algorithm.

Solving constrained optimization problem — Lagrange multiplier method

This method can be found in the article Lagrange multiplier method

Levenberg – Marquardt algorithm

Levenberg-marquardt algorithm, also known as damped least-squares method, takes the form of sum of square errors in its loss function. The implementation of the algorithm does not need to calculate the specific Hesse matrix, it only uses gradient vector and Jacobian matrix.

The Levenberg-Marquardt algorithm is customized for squared errors and functions. This allows neural networks using this error measure to train very quickly. However, the Levenberg-Marquardt algorithm has some disadvantages:

Cannot be used for functions such as square root error or cross entropy error,
The algorithm is also incompatible with regular terms.
Finally, for large data sets or neural networks, the Jacobian becomes very large and therefore requires a lot of memory. Therefore, we do not recommend the Levenberg-Marquardt algorithm for large data sets or neural networks.

Comparison of memory and convergence speed

The following figure shows all of the algorithms discussed above, with their convergence rates and memory requirements. Among them, the gradient descent algorithm has the slowest convergence speed, but this algorithm only requires the least memory. On the contrary, the Levenberg-Marquardt algorithm may be the fastest convergent, but it also requires the most memory. The compromise method is the quasi – Newton method.

To sum up:

If our neural network has tens of thousands of parameters, we can use gradient descent or conjugate gradient method to save memory.
If we need to train multiple neural networks, each of which has only hundreds of parameters and thousands of samples, then we can consider the Levenberg-Marquardt algorithm.
For the rest, the quasi Newtonian method works just fine.

12. Convolutional Neural Networks (CNN)

CNN can be applied to scene classification, image classification, and now it can also be applied to many problems in natural language processing (NLP), such as sentence classification.

LeNet is one of the earliest CNN structures, which was created by the great god Yann LeCun and mainly used in character classification problems.

Convolutional neural network mainly contains four different network layers, namely, convolutional layer, nonlinear layer (that is, ReLU function is used), Pooling layer and full connection layer. These network layers will be introduced one by one below.

12.1 the convolution layer

Introduction of convolution

CNN gets its name because it uses convolution. The main purpose of convolution is to extract the features of images. The convolution operation can preserve the spatial relationship between pixels.

Each image can be thought of as a matrix containing each pixel value ranging from 0 to 255, with 0 being black and 255 being white. Here is an example of a 5 by 5 matrix with a value of either 0 or 1.

Here’s another 3 by 3 matrix:

The convolution of the above two matrices results in the pink matrix on the right side of the figure below.

The yellow matrix goes from left to right on the green matrix, and from top to bottom, with a step value of 1 pixel per slide, so you get a 3 by 3 matrix.

In CNN, the yellow matrix is called filter or kernel or Feature extractor, while the matrix obtained through convolution is called “Feature Map” or “Activation Map”.

In addition, different Feature maps can be obtained by using different filter matrices, as shown in the figure below:

In the figure above, different operations such as edge detection, sharpening and blurring are implemented through the filter matrix.

In practical application, CNN can learn the values of these filters in its training process, but we need to specify the size, number and network structure of filters first. The more filters are used, the more image features can be extracted, and the network will have better performance.

The dimension of Feature Map is determined by the following three parameters:

Depth: Depth is equal to the number of filters.
Stride: The Stride value is the distance of each slide when using the filter to slide on the input matrix. The larger the step value is, the smaller the size of the Feature Map obtained.
Zero-padding: Sometimes we can fill in 0 at the edge of the input matrix, so that the filter can be applied to the edge pixels. A good zero-padding allows us to control the size of the feature image. Convolution with this method is called wide convolution, while convolution without this method is called narrow convolution.

Convolution formula and number of parameters

Convolution is the most common operation in nature. All signal observation, collection, transmission and processing can be realized by convolution process, which can be expressed as follows:

In the above formulaRepresents the convolution kernel.

The calculation steps of the convolution layer in CNN are slightly different from the two-dimensional convolution defined by the above formula. First, the dimension rises to three-dimensional and four-dimensional convolution, and compared with two-dimensional convolution, there is one more ** “channel” **. Each channel is still calculated in two-dimensional convolution mode, while multiple channels and multiple convolution kernels conduct two-dimensional convolution respectively. Get multi-channel output, need to “merge” into one channel; Secondly, the convolution kernel does not “flip” in the convolution calculation, but does “correlation” calculation with the input image in the sliding window. It can be expressed as follows:

Here, it is assumed that there are L output channels and K input channels in the convolution layer, so K×L convolution kernels are needed to convert the number of channels. Where X^k represents the two-dimensional feature graph of the KTH input channel, Y^ L represents the two-dimensional feature graph of the L th output channel, and H^{kl} represents the two-dimensional convolution kernel of the K th row and l th column.

Assuming that the size of the convolution kernel is I×J and the size of the feature graph of each output channel is M×N, the computation of the convolution layer when each sample of this layer performs forward propagation is

Calculations (MAC) = I * M * N * K * J L.

The learning parameters of the convolution layer, namelyThe number of convolution kernels times the size of the convolution kernels.

I’m going to define the ratio of the number of computations to the number of argumentsCPR=.

Therefore, it can be concluded that the larger the output characteristic graph size of the convolutional layer is, the larger the CPR is and the higher the parameter repetition ratio is. If a sample of size B is given, the CPR can be increased by a factor of B.

advantages

Convolutional neural network greatly reduces the number of connections through ** “parameter reduction” and “weight sharing” **, that is, the number of parameters to be trained.

Let’s say our image is 1000×1000, and we have 10^6 hidden layer neurons, so if they’re all connected, so each hidden layer neuron is connected to every pixel of the image, then we have 10^12 connections, 10^12 weight parameters that need to be trained, which is obviously not worth it.

But for a convolution kernel that recognizes only specific features, do all the pixels need to be large enough to cover the entire image?

Usually not, a particular feature, especially one that needs to be extracted at the first layer, is usually fairly basic and only takes up a small part of the image. So we set up a small local sensing area, say 10*10, which means that each neuron only needs to connect to the 10*10 local image, so 10^6 neurons have 10^8 connections. This is called parameter reduction.

So what is weight sharing?

In the local connection above, 10 to the sixth neurons, each neuron has 100 parameters, so 10 to the eighth parameters, so if each neuron has the same parameters, you only have 100 parameters to train.

This hidden behind the truth is that the 100 is a convolution kernel parameters, and convolution kernels is to extract the characteristics of the way, which had nothing to do with position in the image, the image is a local statistical characteristic is the same with other local statistical characteristic, we use the partial extraction characteristics of convolution kernels can also be used in any other parts of the image.

And the 100 parameters are only one convolution kernel, and only one feature can be extracted, so we could have taken 100 convolution kernels and extracted 100 features, and we would only have to train 10 to the fourth parameters, and we would only have to train 10 to the 12th parameters to extract one feature. By selecting 100 convolution kernels, we can get 100 feature graphs, and each feature graph can be regarded as a different channel of an image.

CNN is mainly used to identify 2d figures with displacement, scaling and other forms of distortion invariance.

Because CNN feature detection layer learns from training data, explicit feature extraction is avoided when using CNN, and implicit learning is carried out from training data.

Moreover, since the weights of neurons on the same feature graph are the same, the network can learn in parallel, which is also a big advantage of convolutional network compared with networks connected with neurons.

Convolutional neural network has unique advantages in speech recognition and image processing due to its special structure of local weight sharing. Its layout is closer to the actual biological neural network. Weight sharing reduces the complexity of the network and avoids the complexity of data reconstruction in feature extraction and classification.

12.2 Nonlinear Layer (ReLU)

The nonlinear correction function **ReLU(Rectified Linear Unit)** is shown in the figure below:

This is a dot product for each pixel, replacing negative pixels with 0.

Its purpose is to add nonlinearity into CNN, because the real world problems solved by CNN are all nonlinear, while convolution operation is linear operation, so a nonlinear function such as ReLU must be used to add nonlinear properties.

Other nonlinear functions include TANH and Sigmoid, but the ReLU function has been shown to perform best in most cases.

12.3 Pooling layer

Spatial Pooling ** can also be called subsampling or subsampling to reduce the dimension of a feature map while preserving the most important information. It has different types, such as maximization, average, sum and so on.

For Max Pooling operation, first define a spatial neighbor, such as a 2×2 window, and extract the largest element from the ReLU feature map in the window. In addition to extracting the largest element, you can also use the average or sum value of the elements in the window.

However, Max Pooling provides the best performance. An example would look like this:

The step value used in the figure above is 2.

According to relevant theories, the error of feature extraction mainly comes from two aspects:

The variance of estimation increases due to the limitation of neighborhood size.
The parameter error of convolution layer causes the deviation of the estimated mean value.

Generally speaking, mean-pooling can reduce the first error and retain more background information of the image, while max-pooling can reduce the second error and retain more texture information.

The reasons for Pooling are as follows:

Immutability, more concerned with the presence or absence of features than with the specific location of features. You can think of it as adding a very strong prior, so that the learned features are tolerant of some variation.
Reduce the input size of the next layer, reduce the amount of calculation and the number of parameters.
Get fixed-length output. (For text categorization, the input is of variable length, and the output can be pooled to a fixed length.)
Prevent overfitting or the possibility of underfitting

12.4 Full Connection Layer

The full connection layer is a traditional multi-layer perceptron that uses a Softmax activation function at the output layer.

Its main function is to combine the features extracted from the previous convolution layer and then classify them.

The Softmax function can turn a vector whose input is an arbitrary real number fraction into a vector whose values range from 0 to 1, but whose sum of all values is 1.

Before CNN, the earliest computing types of deep learning networks were all fully connected.

Comparing the convolutional layer with the fully connected layer, the convolutional layer realizes weight sharing in the dimension of the output feature graph, which is an important measure to reduce the number of parameters. Meanwhile, the local connection feature of the convolutional layer (compared with the fully connected layer) also significantly reduces the number of parameters.

Therefore, the convolution layer has a small proportion of parameters but a large proportion of computational work, while the fully connected layer has a large proportion of parameters and a small proportion of computational work. Therefore, the convolution layer is emphasized in computational acceleration optimization. In parameter optimization and weight clipping, emphasis is placed on the full connection layer.

12.5 Backpropagation

The entire training process of CNN is as follows:

The first is to randomly initialize all filters and other parameters and weight values;
Input images for forward propagation, that is, through convolution layer, ReLU and pooling operation, finally reach the full connection layer for classification, and obtain a classification result, that is, output a vector containing the predicted probability value of each class.
Calculation error, that is, the cost function, here the cost function can have a variety of calculation methods, the more commonly used is the sum of squares function;
Back propagation is used to calculate the gradient of the error corresponding to each weight in the network. Generally, the gradient descent method is used to update the weight value of each filter, in order to minimize the output error, that is, the value of the cost function.
Repeat steps 2 through 4 above until the number of sessions is reached.

summary

This is a brief introduction to commonly used machine learning algorithms, and the next article will introduce model evaluation methods.

Reference:

Statistical Learning Methods
Ensemble learning:Bagging,Random Forest,Boosting
Machine learning (IV) — From GBDT to XGBoost
Xgboost Introduction and Practice (Principles)
What are the differences between GBDT and XGBOOST in machine learning algorithms?
Common optimization methods
An Intuitive Explanation of Convolutional Neural Networks
Understanding of POOLING in CNN
Deep Learning Effortlessly: Core Algorithms and Visual Practices
ResNet parsing

Welcome to follow my wechat official account – Machine Learning and Computer Vision, or scan the qr code below, we can communicate, learn and progress together!

Past wonderful recommendation

Machine learning series

Introduction to Machine Learning series 1 – An Overview of Machine learning
How to Build a Complete Machine Learning Project
Machine learning data set acquisition and test set construction method
Data Preprocessing for feature Engineering (PART 1)
Data Preprocessing for feature Engineering (Part 2)
Feature scaling & feature coding for feature engineering
Feature Engineering (finished)
Summary and Comparison of Common Machine Learning Algorithms (PART 1)
Summary and Comparison of Commonly used Machine Learning Algorithms (Middle)

Github projects & Resource tutorials recommended

[Github Project recommends] a better site for reading and finding papers
TensorFlow is now available in Chinese
Must-read AI and Deep learning blog
An easy-to-understand TensorFlow tutorial
Recommend some Python books and tutorials, both beginner and advanced!
[Github project recommendation] Machine learning & Python

Summary and Comparison of Commonly used Machine Learning Algorithms

Boosting method 9

The paper

Bagging and boosting

Bagging

Boosting and Bagging

AdaBoost algorithm

The advantages and disadvantages

advantages

disadvantages

10. GBDT

The paper

Random Forest (RF) and GBDT were compared

The advantages and disadvantages

advantages

disadvantages

Adjustable parameter

xgboost

Code implementation

11. Optimize the algorithm

Gradient descent method

Newton’s method

The advantages and disadvantages

advantages

disadvantages

Quasi-newton method

Conjugate Gradient method

Heuristic optimization method

Solving constrained optimization problem — Lagrange multiplier method

Comparison of memory and convergence speed

12. Convolutional Neural Networks (CNN)

12.1 the convolution layer

Introduction of convolution

Convolution formula and number of parameters

advantages

12.2 Nonlinear Layer (ReLU)

12.3 Pooling layer

12.4 Full Connection Layer

12.5 Backpropagation

summary

Past wonderful recommendation

Machine learning series

Github projects & Resource tutorials recommended

Related Posts

Tensorflow 1.x Tutorial (11) — Save and restore a model

Semantic segmentation – Interpretation of DeeplabV3 paper

After a $10 billion cloud computing project, the Pentagon is ready to throw its money away