Through this article, we can have a common sense understanding of ML algorithms, no code, no complex theoretical derivation, just a diagram, know what these algorithms are, how they are applied, examples are mainly classification problems.

Each algorithm watched several videos, and picked out the most clear and interesting ones for popularization.


We’ll have time to do a more in-depth analysis of individual algorithms later.

Today’s algorithm is as follows:

  • The decision tree

  • Random forest algorithm

  • Logistic regression

  • SVM

  • Naive Bayes

  • K nearest neighbor algorithm

  • K-means algorithm

  • Adaboost algorithm

  • The neural network

  • markov


1. The decision tree

Classify according to some features, ask a question for each node, divide the data into two categories through judgment, and then continue to ask questions. These questions are learned from existing data, and when you put in new data, you can divide the data into the appropriate leaves based on the problems on the tree.



2. Random forest

Select data randomly from the source data to form several subsets


S matrix is the source data, with 1-N pieces of data. A, B and C are features, and the last column C is the category


Generate M submatrices randomly from S


These M subsets give me M decision trees


Put the new data into the M trees, and get M classification results. Count to see which category has the most number of predictions, and take this category as the final prediction result



Logistic regression

When the prediction target is probability, the range needs to be greater than or equal to 0, less than or equal to 1. At this time, the simple linear model cannot do this, because when the range is not within a certain range, the range also exceeds the specified interval.


So it would be good to have a model with this shape


So how do you get this model?

This model has to satisfy two conditions: greater than or equal to 0, less than or equal to 1


If you have a model greater than or equal to 0, you can choose the absolute value, the square value, the exponential function, is greater than 0


Less than or equal to 1 if you divide, the numerator is itself, and the denominator is itself plus 1, that must be less than 1


And if you do another transformation, you get a Logistic regression model


You can calculate the coefficients from the source data


And you end up with the logistic graph



4. SVM

support vector machine

To separate the two types and obtain a hyperplane, the optimal hyperplane is to maximize the margin of the two types. Margin is the distance between the hyperplane and its nearest point, as shown in the figure below, Z2>Z1, so the green hyperplane is better


I’m going to represent this hyperplane as a linear equation, one class above the line, all greater than or equal to 1, and one class less than or equal to minus 1


The distance from point to surface is calculated according to the formula in the figure


Therefore, the expression of the total margin is as follows. The goal is to maximize the margin, so the denominator needs to be minimized, which becomes an optimization problem


For example, three points, find the optimal hyperplane and define the weight vector = (2,3) – (1,1).


The weight vector is obtained as (a, 2a), the two points are substituted into the equation, (2,3) and its value = 1, and (1, 1) and its value = -1, then the values of a and the moment w0 are solved, and the expression of the hyperplane is obtained.


After a is solved and substituted into (a, 2a), the support vector is obtained

The equation of a and W0 into the hyperplane is the support Vector machine

Naive Bayes

Take an example of an application in NLP

Give a paragraph of text, return to the emotional classification, the attitude of the paragraph is positive or negative


To solve this problem, just look at some of the words


This text will only be represented by a few words and their count


The original question was: To give you a word, which category does it fall into


Bayes rules make this a relatively easy problem to solve


The question becomes, what is the probability of this statement in this category, and of course, remember the other two probabilities in this formula

The probability of the word “love” occurring in positive situations is 0.1, and in negative situations 0.001



6. K nearest neighbor

k nearest neighbours

When you give a new number, the number of k points closest to it, the number of k points closest to it, that number belongs to that category

To distinguish the cat and dog, according to the claws and sound features, circle and triangle are categorized, so what category does the star represent


When k = 3, the points connected by these three lines are the nearest three points, so there are more circles, so this star belongs to the cat



7. K-means

You want to divide a set of data into three categories, pink is high, yellow is low


It is best to initialize first, and we chose the simplest 3,2,1 as the initial values of each class


The rest of the data is calculated from the three initial values, and then grouped into the category of the initial value closest to it


After sorting the categories, calculate the average of each category as the center point of the new round


After a few rounds, the group has stopped changing, and you can stop



8. Adaboost

Adaboost was one of Bosting’s methods

Bosting is to synthesize several classifiers with poor classification effect and get a classifier with good effect.

In the figure below, the left and right decision trees, individually, are not very good, but putting the same data into them, considering the two results together, will increase the credibility, right


Adaboost chestnut, handwriting recognition, can capture a lot of features on the drawing board, such as the direction of the beginning point, the distance between the beginning point and the end point and so on


During training, the weight of each feature will be obtained. For example, the beginning parts of 2 and 3 are very similar. This feature plays a small role in classification, so its weight will be small


This alpha Angle has strong identification, and the weight of this feature will be large. The final prediction result is the comprehensive consideration of the results of these features



Neural networks

Neural Networks are suitable for an input that can fall into at least two categories

NN consists of several layers of neurons, and the connections between them


The first layer is the input layer, and the last layer is the Output layer

Each layer has its own classifier in the hidden layer and output layer


Input is input into the network, activated, and the calculated scores are transferred to the next layer, which activates the neural layer behind. Finally, the scores on the nodes of the output layer represent the scores belonging to various types. In the example below, the classification result is class 1

The same input is transmitted to different nodes. The result is different because each node has different weights and bias

This is also called forward Propagation



10. Markov

Markov Chains are composed of State and Transitions

The sentence: “The quick brown fox jumps over the lazy dog” : to get markov chain

Step, first set each word into a state, and then calculate the probability of switching between the states


This is the probability calculated in one sentence. When you use a large amount of text to do statistics, you will get a larger state transition matrix, such as the words that can be connected after the, and the corresponding probability


In life, the alternative result of keyboard input method is the same principle, the model will be more advanced.

For more free technical information: annalin1203