Abstract:There are many different approaches to solving classification and regression problems in machine learning models. Here I try to briefly summarize the algorithmic patterns for each model, hoping to help you find the right solution for your particular problem.

There are many different approaches to solving classification and regression problems in machine learning models. These different models can be viewed as black boxes for solving the same problem. However, each model is derived from a different algorithm and behaves differently on different data sets. The best approach is to use cross-referencing to determine which model works best on the data to be tested. Here I try to briefly summarize the algorithmic patterns for each model, hoping to help you find the right solution for your particular problem.

Commonly used machine learning models

1. Naive Bayes model

Naive Bayes model is a simple but very important model. It is a generative model, that is, it conducts joint modeling of problems. By using the product rule of probability, we can obtain:

Because of the complexity of the above form, naive Bayes makes an assumption that given y, x1,… , the generation probability between xn is completely independent, that is:

Note that this does not mean x1,… The probability of the formation of Xn is independent of each other, but is independent given y, that is, it is a kind of “conditional independence”. For those of you who are familiar with probabilistic graph models, the following graph model can best illustrate this problem:

Now that we say naive Bayes is a generative model, what is the generative process? For mail spam classification, its generation process is as follows:

  • First, according to p(y), y is adopted to determine whether the currently generated message is spam or non-spam
  • Mail of length n is determined, and then according to the previous step to get y, again by p (xi | y) sampling get x1, x2,… ,xn

This is the naive Bayes model. Obviously, naive Bayes’ hypothesis is a very strong one, which is seldom met in practical application, because it believes that as long as the mail is determined to be spam or non-spam, the generation of mail content is completely independent and there is no connection between words.

Advantages and disadvantages of naive Bayes model

  • Advantages: Good performance for small-scale data, suitable for multi-classification tasks, suitable for incremental training.
  • Disadvantages: Sensitive to the representation of input data.

2. Decision tree model

Decision tree model is a kind of nonparametric classifier which is easy to use. It does not require any prior assumptions on the data, and the calculation speed is fast, the results are easy to interpret, and the robustness is strong.

In complex decision-making situations, multi-level or multi-stage decision-making is often required. When a stage decision is made, m new and different natural states may occur; In each natural state, there are m new strategies to choose from, which will produce different results and face the new natural state again, and continue to produce a series of decision-making process, which is called sequential decision making or multi-level decision making.

At this point, if we continue to follow the above decision criteria or use the benefit matrix analysis problem, it is easy to make the corresponding table relationship very complicated. Decision tree is an effective tool to help decision makers to make sequential decision analysis. The method is to express the strategy, natural state, probability and profit value in the problem in a tree-like form through lines and figures.

Decision tree model is a tree graph composed of decision points, strategy points (event points) and results. It is generally used in sequential decision making. The maximum expected profit or minimum expected cost is usually taken as decision making criteria.

Advantages and disadvantages of decision tree model

  • Advantages: Shallow decision trees are visually intuitive and easy to interpret; The structure and distribution of data need not make any assumptions; The Interaction between variables can be captured.
  • Disadvantages: Deep decision trees are visually and interpretively difficult; Decision trees tend to lose stability and anti-oscillation due to excessive fine-tuning of sample data. Decision tree has a large demand for Sample Size. The ability to handle missing values is very limited.

3. The KNN algorithm

KNN is the nearest neighbor algorithm. The core idea is that if most of the k closest samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of the samples in this category.

This method only determines the classification of the samples according to the category of the nearest one or several samples. The kNN method is only concerned with a very small number of adjacent samples. Because THE kNN method mainly relies on the surrounding limited adjacent samples rather than the method of discriminating the class domain to determine the category, the kNN method is more suitable than other methods for the sample sets to be divided with more overlapping or overlapping class domains. The main process is as follows:

1. Calculate the distance of each sample point in the training sample and test sample (common distance measures include Euclidean distance, Mahalanobis distance, etc.);

2. Sort all the above distance values;

3. Select the first k samples with the minimum distance;

4. Vote according to the labels of the K samples to get the final classification category;

How to choose an optimal K value depends on the data. In general, a larger K value can reduce the influence of noise during classification. But it can blur the lines between categories. A good K value can be obtained using various heuristic techniques, such as cross-validation. In addition, noise and non-correlation feature vectors will reduce the accuracy of k-nearest neighbor algorithm.

The nearest neighbor algorithm has strong consistency. As the data goes to infinity, the algorithm guarantees that the error rate will not exceed twice that of the Bayesian algorithm. For some good K values, k-nearest neighbors guarantee that the error rate will not exceed the Bayesian theory error rate.

Advantages and disadvantages of KNN algorithm

  • Advantages: simple, easy to understand, easy to implement, no parameter estimation, no training; The theory is mature and can be used for classification or regression. It can be used for nonlinear classification; Suitable for classifying rare events; High accuracy, no assumptions on data, insensitive to outlier.
  • Disadvantages: large amount of calculation; Sample imbalance problem (that is, some categories have a large number of samples and others have a small number of samples); Requires a lot of memory; Poor comprehensibility and inability to give rules like decision trees.

4. The SVM algorithm

Support Vector Machine (SVM) refers to the Support Vector Machine, which is a common discrimination method. In machine learning, it is a supervised learning model, usually used for pattern recognition, classification, and regression analysis.

The main ideas of SVM can be summarized as follows:

1. It is in view of the linearly separable case analysis, in the case of linear inseparable, through the use of nonlinear mapping algorithm will undivided linear sample low-dimensional input space into a high-dimensional feature space linear separable, so as to make the high dimensional feature space by using linear algorithm for linear analysis on the nonlinear characteristics of samples is possible.

2. It constructs the optimal hyperplane in the feature space based on the theory of structural risk minimization, so that the learner is globally optimized and the expectation of the whole sample space satisfies a certain upper bound with a certain probability.

Advantages and disadvantages of SVM algorithm

  • Advantages: Can be used for linear/nonlinear classification, also can be used for regression; Low generalization error; Easy to explain; The computational complexity is low.
  • Disadvantages: the choice of parameters and kernel function is sensitive; The original SVM was only good at dealing with dichotomies.

5. Logistic regression model

Logistic regression, also known as Logistic regression analysis, is a generalized linear regression analysis model, which is often used in data mining, automatic diagnosis of diseases, economic forecasting and other fields. For example, to explore the risk factors causing the disease, and predict the probability of disease occurrence according to the risk factors.

Taking gastric cancer disease analysis as an example, two groups of people were selected, one group was gastric cancer group, the other group was non-gastric cancer group, the two groups of people must have different physical signs and lifestyle, etc. Therefore, the dependent variable is whether gastric cancer, and the value is “yes” or “no”, and the independent variable can include many, such as age, gender, eating habits, helicobacter pylori infection, etc. Independent variables can be either continuous or classified. Then, the weight of independent variables can be obtained through logistic regression analysis, so as to roughly understand which factors are risk factors of gastric cancer. This weight can also be used to predict a person’s likelihood of developing cancer based on risk factors.

Applicable conditions of Logistic regression model:

1. The dependent variable is the classification variable of dichotomies or the occurrence rate of an event, and it is a numerical variable. However, it should be noted that the double counting indicator does not apply to Logistic regression.

2. Both residuals and dependent variables should obey binomial distribution. Binomial distribution corresponds to classification variables, so it is not normal distribution, and then it is not the least square method, but the maximum likelihood method to solve the problem of equation estimation and test.

3. There is a linear relationship between independent variables and Logistic probability

4. Each observation object is independent from each other.

The essence of Logistic regression is to divide the probability of occurrence by the probability of non-occurrence and take the logarithm. It’s this less complicated transformation that changes the contradiction between the values and the curve between the dependent variables and the independent variables. The reason is that the probability of occurrence and non-occurrence becomes a ratio, and this ratio is a buffer. When the value range is expanded, the logarithmic transformation is performed, and the whole dependent variable changes.

Moreover, this transformation often makes the relationship between dependent variables and independent variables linear, which is summarized according to a lot of practice. Therefore, Logistic regression fundamentally solves the problem of what to do if the dependent variable is not a continuous variable. Moreover, Logistic is widely used because many practical problems match its model. For example, whether or not an event occurs in relation to other numerical independent variables.

Advantages and disadvantages of logistic regression model

  • Advantages: simple implementation; Classification requires very little computation, high speed and low storage resources.
  • Disadvantages: easy to lack of fitting, general accuracy is not too high; Can handle two classification problems (softmax derived from this basis can be used for multiple classification), and must be linearly separable.


Click to follow, the first time to learn about Huawei cloud fresh technology ~