According to: this article is transferred from Liu Zhiwei, it is very important to choose a proper algorithm in machine learning, this paper mainly introduces 8 kinds of computer algorithms and their advantages and disadvantages, to provide some advice for everyone to choose the algorithm.

Introduction to the

There are too many machine learning algorithms, such as classification, regression, clustering, recommendation, image recognition, etc. It is not easy to find a suitable algorithm, so in practical applications, we generally adopt heuristic learning to experiment. At the beginning, we usually choose algorithms commonly recognized by everyone, such as SVM, GBDT and Adaboost. Now deep learning is very popular, and neural network is also a good choice. If you care about accuracy, the best way to do this is to cross-validate each algorithm one by one, compare them, adjust the parameters to make sure each algorithm gets the best solution, and then choose the best one. But if you’re just looking for a “good enough” algorithm to solve your problem, or there are some tips to consider, here are the pros and cons of each algorithm, based on the advantages and disadvantages of the algorithm, it’s easier to choose one.

Deviation & variance

In statistics, a model is measured in terms of bias and variance, so let’s generalize about bias and variance:

Deviation: Describes the difference between the predicted (estimated) expected value E ‘and the true value Y. The larger the deviation, the more it deviates from the real data.

Variance: describes the range of variation of the predicted value P, the degree of dispersion, and is the variance of the predicted value, i.e. the distance from its expected value E. The larger the variance, the more distributed the data.

The real error of the model is the sum of the two, as shown below:

                       

In the case of small training sets, a classifier with high bias/low variance (e.g., naive Bayes NB) has a greater advantage than a large classification with low bias/high variance (e.g., KNN) because the latter will overfit. However, as your training set grows, the model becomes better at predicting the original data, the bias decreases, and the low bias/high variance classifiers become more dominant (because they have lower asymptotic errors), and the high bias classifiers are no longer sufficient to provide an accurate model.

Of course, you can also think of this as a difference between the generative model (NB) and the discriminant model (KNN).

Why is naive Bayes high bias low variance?

The following is quoted from Zhihu:

First of all, suppose you know the relationship between the training set and the test set. To put it simply, we need to learn a model from the training set, and then use it in the test set. Whether the effect is good or not is measured by the error rate of the test set. However, most of the time, we can only assume that the test set and the training set conform to the same data distribution, but can not get the real test data. How do you measure the test error rate when you only see the training error rate?

Since there are few training samples (or at least not enough of them), the model derived from the training set is never really correct. (Even if the correct rate of the training set is 100%, it does not mean that it depicts the real data distribution. It is our goal to depict the real data distribution, not only the limited data points of the training set.) Moreover, in practice, training samples often have certain noise errors, so if a very complex model is adopted in pursuit of perfection on the training set, the model will regard all the errors in the training set as real data distribution features, and thus get wrong data distribution estimation. That way, when it comes to the actual test set, it’s all wrong (a phenomenon called fitting). However, we should not use a simple model, otherwise the model will not be enough to describe the data distribution when the data distribution is complicated (reflected in the high error rate even on the training set, which is less fitting). Overfitting indicates that the adopted model is more complex than the real data distribution, while underfitting indicates that the adopted model is simpler than the real data distribution.

In the statistical learning framework, there is a view that Error = Bias + Variance when describing model complexity. Error here can be roughly understood as the prediction Error rate of the model, which is composed of two parts: one is the inaccurate estimation (Bias) due to the simplicity of the model, and the other is the larger variation space and Variance due to the complexity of the model.

So, that makes it easy to analyze naive Bayes. It simply assumes that individual data are irrelevant, a grossly simplified model. Therefore, for such a simple model, the Bias part will be greater than the Variance part in most cases, that is, high Bias and low Variance.

In practice, in order to keep Error as small as possible, we need to balance the proportion of Bias and Variance when selecting the model, that is, to balance over-fitting and under-fitting.

The relationship between bias and variance and model complexity is made clearer by using the following figure:

           

As the model complexity increases, the bias decreases and the variance increases.

Advantages and disadvantages of common algorithms

Naive Bayes

Naive Bayes is a generative model, it’s very simple, you just do a bunch of counting. If conditional independence is assumed (a strict condition), naive Bayes classifiers will converge faster than discriminant models such as logistic regression, so you need less training data. Even if the NB conditional independence hypothesis is not tenable, NB classifiers still perform well in practice. Its main disadvantage is that it cannot learn the interaction between features, which in mRMR R is called feature redundancy. To use a classic example, for example, although you like Brad Pitt and Tom Cruise movies, it doesn’t learn that you don’t like the movies they’re in together.

Advantages:

  • Naive Bayes model originated from classical mathematics theory, which has a solid mathematical foundation and stable classification efficiency.

  • Good performance on small-scale data, can handle multi-classification tasks, suitable for incremental training;

  • It is not sensitive to missing data and the algorithm is relatively simple, which is often used for text classification.

Disadvantages:

  • You need to calculate the prior probability;

  • Classification decision has error rate;

  • Sensitive to the presentation of input data.

2.Logistic Regression

Being a discriminant model, there are many ways to regularize the model (L0, L1, L2, etc.) and you don’t have to worry about whether your features are related as you would with naive Bayes. You also get a good probability interpretation compared to decision trees and SVM machines, and you can even easily update the model with new data (using online gradient Descent). Use a probabilistic architecture if you need it (for example, to simply adjust classification thresholds, indicate uncertainty, or obtain confidence intervals), or if you want to quickly incorporate more training data into the model in the future.

The Sigmoid function:

Advantages:

  • Simple implementation, widely used in industrial problems;

  • When classifying, the computation is very small, the speed is fast, and the storage resources are low.

  • Convenient observation sample probability score;

  • For logistic regression, multicollinearity is not a problem, which can be solved by combining L2 regularization.

Disadvantages:

  • When the feature space is large, the performance of logistic regression is not very good.

  • Easy to underfit, general accuracy is not too high

  • Unable to deal with a large number of multi-class features or variables;

It can only deal with two categories (softmax derived from this can be used for multiple categories) and must be linearly separable;

For nonlinear features, transformation is needed;

3. Linear regression

Linear regression is used for regression, unlike Logistic regression, which is used for classification. Its basic idea is to use gradient descent method to optimize the error function in the form of the least square method. Of course, normal equation can also be used to directly solve the parameters.

In LWLR (locally weighted linear regression), the calculation expression of parameters is:

It can be seen that LWLR is different from LR. LWLR is a non-parametric model, because training samples must be traversed at least once every time regression calculation is performed.

Advantages: simple implementation, simple calculation; Disadvantages: can’t fit nonlinear data.

4. The nearest leading algorithm — KNN

KNN is the nearest neighbor algorithm, and its main process is:

1. Calculate the distance of each sample point in the training sample and test sample (common distance measures include Euclidean distance, Mahalanobis distance, etc.); 2. Sort all the above distance values; 3. Select the first k samples with the minimum distance; 4. Vote according to the labels of the K samples to get the final classification category;

How to choose an optimal K value depends on the data. In general, a larger K value can reduce the influence of noise during classification. But it can blur the lines between categories. A good K value can be obtained using various heuristic techniques, such as cross-validation. In addition, noise and non-correlation feature vectors will reduce the accuracy of k-nearest neighbor algorithm.

The nearest neighbor algorithm has strong consistency. As the data goes to infinity, the algorithm guarantees that the error rate will not exceed twice that of the Bayesian algorithm. For some good K values, k-nearest neighbors guarantee that the error rate will not exceed the Bayesian theory error rate.

Advantages of KNN algorithm

  • Mature theory, simple thought, can be used to do classification can also be used to do regression;

  • It can be used for nonlinear classification;

  • The training time complexity is O(n).

  • No assumptions for data, high accuracy, insensitive to outlier;

disadvantages

  • Large amount of calculation;

  • Sample imbalance problem (that is, some categories have a large number of samples and others have a small number of samples);

  • Requires a lot of memory;

5. The decision tree

Easy to explain. It can be treated without pressure characteristics of the interaction between relationship and is parameterized, so you don’t have to worry about outliers or data are linearly separable (for example, A decision tree can easily handle category in A certain feature dimensions x, at the end of class B in the middle, and then the category A appeared in feature dimension x front). One of its drawbacks is that it does not support online learning, so the decision tree needs to be completely rebuilt when new samples arrive. Another disadvantage is the tendency to overfit, but this is where integration methods such as random forest RF (or enhanced tree Vwintree) come in. In addition, random forest is often the winner of many classification problems (usually slightly better than SUPPORT vector machines), it is fast to train and tunable, and you don’t have to worry about tuning a lot of parameters like support vector machines, so it has been very popular in the past.

An important point in decision trees is to choose an attribute to branch off from, so pay attention to the formula for information gain and understand it in depth.

The calculation formula of information entropy is as follows:

Where n means there are n categories (for example, n=2 if it is a category 2 problem). In this way, the information entropy before the unselected attribute branch can be calculated.

Now select an attribute xixi for branching, and the branching rule is: if xi=vxi=v, assign the sample to a branch of the tree; If they are not equal they go to another branch. Obviously, the samples in the branch are likely to include two categories. Calculate the entropy H1 and H2 of the two branches respectively, and calculate the total information entropy H ‘=p1H1+ P2 H2 after the branch, then the information gain δ H = h-H’. Based on the principle of information gain, all attributes are tested and one attribute with maximum gain is selected as the branch attribute.

Advantages of decision tree:

  • Simple calculation, easy to understand, strong explanatory;

  • It is suitable for processing samples with missing attributes;

  • Ability to deal with irrelevant features;

  • Able to produce feasible and effective results for large data sources in a relatively short time.

disadvantages

  • Overfitting is easy to occur (random forest can greatly reduce overfitting);

  • Ignoring correlations between data;

  • For data with an inconsistent number of different types of data, the results of the information gain in the decision tree are biased towards features with more numerical values (this is a drawback whenever information gain is used, such as RF).

5.1 Adaboosting

Adaboost is a kind of Adaboost model. Each model is established based on the error rate of the previous model, paying too much attention to the misclassified samples, while reducing attention to the correctly classified samples. After successive iterations, a relatively good model can be obtained. Boosting is a typical Boosting algorithm. Here is a summary of its advantages and disadvantages.

advantages

  • Adaboost is a very high precision classifier.

  • There are various ways to build subclassifiers, and the Adaboost algorithm provides the framework.

  • When simple classifiers are used, the calculated results are understandable, and the construction of weak classifiers is extremely simple.

  • Simple, no need to do feature screening.

  • Overfitting is not easy to occur.

  • For combinatorial algorithms such as random forest and GBDT, see this article: Summary of Machine Learning combinatorial Algorithms

Disadvantages: Sensitive to outliers

6.SVM support vector machine

With high accuracy, it provides a good theoretical guarantee to avoid overfitting, and even if the data is linearly indivisible in the original feature space, it can run well as long as a proper kernel function is given. It is especially popular in text classification problems that are prone to super-high dimensions. Unfortunately, memory consumption is large, difficult to explain, and running and tuning are a bit annoying, while random forest just avoids these shortcomings and is more practical.

advantages

  • It can solve the problem of high dimension, that is, large feature space;

  • Able to deal with the interaction of nonlinear features;

  • You don’t have to rely on the whole data;

  • Can improve generalization ability;

disadvantages

  • It’s not very efficient when you have a lot of samples;

  • There is no general solution to nonlinear problems, sometimes it is difficult to find a suitable kernel function;

  • Sensitive to missing data;

The choice of kernel is also tricky (libSVM comes with four kernel functions: linear, polynomial, RBF, and Sigmoid) :

  • First, if the sample number is smaller than the feature number, there is no need to select nonlinear kernel, simply use linear kernel;

  • Second, if the number of samples is greater than the number of features, then nonlinear kernel can be used to map the samples to higher dimensions, and generally better results can be obtained.

  • Third, if the number of samples is equal to the number of features, a nonlinear kernel can be used in this case, as in the second case.

In the first case, it is also possible to reduce the dimension of the data first and then use a nonlinear kernel, which is another approach.

7. Advantages and disadvantages of artificial neural networks

Advantages of artificial neural network:

  • High accuracy of classification;

  • Strong parallel distribution processing ability, distributed storage and learning ability,

  • It is robust and fault-tolerant to noise neural and can fully approximate complex nonlinear relations.

  • Have associative memory function.

Disadvantages of artificial neural network:

  • Neural network needs a lot of parameters, such as network topology, weight and initial value of threshold.

  • Unable to observe the learning process, the output results are difficult to explain, which will affect the credibility and acceptability of the results;

  • Studying for too long may not even achieve the purpose of learning.

8. K-means clustering

I have written an article on k-means clustering before, link to the post: Machine learning algorithm -K-means clustering. As for the derivation of K-means, there is a strong EM idea in it.

advantages

  • The algorithm is simple and easy to implement.

  • The algorithm is relatively scalable and efficient for processing large data sets, as its complexity is approximately O(NKT), where n is the number of all objects, k is the number of clusters, and t is the number of iterations. Usually k < < n. This algorithm usually converges locally.

  • The algorithm tries to find k partitions that minimize the squared error function. The clustering effect is better when the clusters are dense, spherical or clustered and the differences between clusters are obvious.

disadvantages

  • High requirements for data type, suitable for numerical data;

  • May converge to local minima and converge slowly on large scale data

  • K value is difficult to select;

  • It is sensitive to the cluster center value of initial value, and may lead to different clustering results for different initial value.

  • Not suitable for finding clusters that are not convex shapes, or clusters that vary greatly in size.

  • Sensitive to “noise” and outlier data, a small amount of such data can greatly affect the average.

Algorithm selection reference

I have translated some foreign articles before, and one article gives a simple algorithm selection technique:

1. Logistic regression should be the first choice. If its effect is not good, its results can be used as a reference and compared with other algorithms on the basis;

2. Then try decision trees (random forests) to see if you can dramatically improve the performance of your model. Even if you don’t end up with it as the final model, you can use random forests to remove noise variables and do feature selection;

3. If the number of features and observation samples is very large, then when resources and time are sufficient (this premise is important), SVM can be an option.

GBDT>=SVM>=RF>=Adaboost>=Other… Now deep learning is very popular and used in many fields. It is based on neural network. At present, I am also learning it myself, but my theoretical knowledge is not very solid and my understanding is not deep enough, so I will not introduce it here.

Algorithms are important, but good data is better than good algorithms, and designing good features goes a long way. If you have a very large data set, then whichever algorithm you use probably won’t have much impact on classification performance (at this point you can make a choice based on speed and ease of use).

Lei Feng net copyright article, unauthorized reprint prohibited. See instructions for details.