This article is an overview of machine learning algorithms and a summary of personal learning. By reading this article, you can quickly have a clear understanding of machine learning algorithms. This article promises that there will not be any mathematical formula and derivation, and it is suitable for relaxing reading after dinner, hoping to make readers more comfortable to get something useful.

This paper is mainly divided into three parts, the first part is the introduction of anomaly detection algorithm, I feel that this kind of algorithm is very useful for monitoring system; The second part is a brief introduction of several common algorithms of machine learning. The third part is the introduction of deep learning and reinforcement learning. There will be a summary of my own at the end.

Anomaly detection, just as its name implies is to detect abnormal algorithms, such as network quality, an abnormal user access behavior, servers, switches and system anomaly, etc., can be through the anomaly detection algorithm for monitoring, personally think that this algorithm is worth we do monitor reference to reference, so I will first introduce the content of this part of alone.

Outliers are defined as “more likely to be separated” — points that are sparsely distributed and farther away from a dense group. In terms of statistics, in the data space, sparsely distributed areas indicate that the probability of data occurring in this area is very low, so the data falling in these areas can be considered abnormal.

FIG. 1-1 Outliers are far from normal points with high density

As shown in Figure 1-1, the data in the blue circle is more likely to belong to this group of data, while the more remote data is, the less likely it is to belong to this group of data.

The following is a brief introduction of several exception detection algorithms.

1.1 Anomaly detection algorithm based on distance

Figure 1-2 Distance-based anomaly detection

Thought: A point is considered an anomaly if it has few friends around it.

Procedure: Given a radius r, calculate the ratio of the number of points in a circle with radius R centered on the current point to the total number of points. If the ratio is less than a threshold, it is considered an outlier.

1.2 Depth-based anomaly detection algorithm

Figure 1-3 Anomaly detection algorithm based on depth

Idea: Outliers are far from dense groups and tend to be at the very edge of the group.

Step: by connecting the outermost points and indicating that the layer is depth value 1; Then connect the points of the secondary outer layer to indicate that the depth value of this layer is 2. Repeat the above action. Those with a depth value less than a certain value k can be considered outliers because they are the points farthest from the central population.

1.3 Anomaly detection algorithm based on distribution

Figure 1-4 Gaussian distribution

Idea: When the current data point deviates 3 standard deviations from the average value of the overall data, it can be considered as an outlier (the deviation number can be adjusted according to the actual situation).

Procedure: Calculate the mean and standard deviation of the existing data. When the new data point deviates 3 standard deviations from the mean value, it is regarded as an outlier.

1.4 Exception detection algorithm based on partition

Figure 1-5 Isolated forest

The idea: By constantly dividing data across attributes, outliers can often be set aside early, or isolated early. Normal points need to be divided more often because there are so many groups.

Procedure: Multiple isolated trees were constructed in the following way: an attribute of data was randomly selected in the current node, and a value of the attribute was randomly selected to divide all data in the current node into left and right leaf nodes; If the depth of the leaf node is small or there are many data points in the leaf node, the above division is continued. Outliers are represented by a very low tree depth on average across all isolated trees, as shown in red in Figure 1-5.

Several common algorithms of machine learning are briefly introduced: K-nearest neighbor, K-means clustering, decision tree, naive Bayes classifier, linear regression, logistic regression, hidden Markov model and support vector machine. It is recommended to skip the parts that are not well spoken.

2.1 K neighbor

Figure 2-1 Two of the three points closest to each other are red triangles, so the points to be determined should be red triangles

Classification problem. For the point to be judged, the nearest data points are found from the existing labeled data points, and the type of the point to be judged is determined according to the principle of minority obeying the majority according to their label types.

2.2 k means clustering

Figure 2-2 Iterative completion of “Birds of a feather flock together”

The goal of k-means clustering is to find a partition that minimizes the sum of distances of squares. Initialize k center points; By Euclidean distance or other distance calculation method, to calculate the distance from the center of each data point, will be the most close to the center of a data point to the same class, and then from the logo for the same class of the data points and the center of the center of new alternative before, repeat the calculation process, until convergence center position is no longer change.

2.3 the decision tree

Figure 2-3 Determine whether today is a good day to play through the decision tree

The form of decision tree is similar to if-else, except that when the decision tree is generated from data, information gain is used to determine which attribute is used to partition first. The advantage of decision trees is that they are expressive and easy to understand how the conclusions were arrived at.

2.4 Naive Bayes classifier

Naive Bayes mages based on Bayes theorem and characteristic condition independence hypothesis classification method. The joint probability distribution is learned from the training data, and then the posterior probability distribution is obtained. (sorry, no picture, no formula, leave it at that -_-)

2.5 Linear regression

FIG. 2-4 Draw a straight line and minimize the sum of the differences with the actual values of all data points

For the function f(x)=ax+b, by substituting the existing data (x,y), find the most appropriate parameters A and B, so that the function can best express the mapping relationship between the input and output of the existing data, so as to predict the output corresponding to the future input.

2.6 Logistic regression

Figure 2-5 Logical functions

In fact, the logistic regression model just uses a logical function on the basis of the above linear regression, and converts the output of linear regression into a value between 0 and 1 through the logical function, so as to represent the probability of belonging to a certain category.

2.7 Hidden Markov model

Figure 2-6 Probability of transition between hidden states X and probability of observation of state X as Y

Hidden Markov model is a probability model about time sequence, which describes the process of generating a sequence of unobtainable states randomly from a hidden Markov chain, and then generating a sequence of observations randomly from each state. The hidden Markov model has three elements and three basic problems, which can be understood separately by those interested. Recently, I read an interesting paper, in which the hidden Markov model is used to predict the stage at which American graduate students will change their major, so as to make countermeasures to retain students of a certain major. Could it also be used by corporate HUMAN resources to predict at what stage employees will jump ship and implement necessary retention measures in advance? (^_^)

2.8 Support Vectors

Figure 2-7 Maximum interval supported by support vectors

Support vector machine (SVM) is a binary classification model. Its basic model is a linear classifier with maximum spacing in feature space. As shown in FIG. 2-7, this classification model is called support vector machine because support vector plays a key role in determining the separated hyperplane.

For the nonlinear classification problem in the input space, it can be transformed into a linear classification problem in a high dimensional feature space by nonlinear transformation (kernel function), and linear support vector machines can be learned in the high dimensional feature space. As shown in Figure 2-8, the training points are mapped to a three-dimensional space where the separated hyperplane can be easily found.

Figure 2-8 Converting two-dimensional linear indivisibility into three-dimensional linear divisibility

Here is a brief introduction to the origin of neural networks. The order of introduction is perceptron, multilayer perceptron (neural network), convolutional neural network and cyclic neural network.

3.1 machine perception

Figure 3-1 Input vectors are added into the activation function by weighted summation to obtain the results

Neural networks originated in the 1950s and 1960s, when they were called perceptrons, with an input layer, an output layer and a hidden layer. Its disadvantage is that it cannot represent slightly more complex functions, hence the multilayer perceptron described below.

3.2 Multilayer perceptron

FIG. 3-2 Multi-layer perceptron, which shows that there are multiple hidden layers between input and output

On the basis of perceptron, several hidden layers are added to meet the ability to represent more complex functions, which is called multi-layer perceptron. To force lattice, it’s called a neural network. The more layers the neural network has, the stronger its performance will be. However, it will lead to the disappearance of gradient during BP back propagation.

3.3 Convolutional neural network

Figure 3-3 General form of convolutional neural network

The fully connected neural network has many hidden layers in the middle, which leads to the expansion of the number of parameters, and the fully connected mode does not make use of the local mode (for example, the adjacent pixels in the picture are related, which can form more abstract features like eyes), so the convolutional neural network appears. Convolutional neural network limits the number of parameters and excavates the local structure, which is especially suitable for image recognition.

3.4 Recurrent neural network

FIG. 3-4 Recurrent neural network can be regarded as a neural network transmitted over time

The recurrent neural network can be regarded as a neural network transmitted over time, its depth is the length of time, and the output of the neuron can act on the processing of the next sample. Normal fully connected neural network and convolutional neural network process samples independently, while cyclic neural network can deal with tasks that require learning time-ordered samples, such as natural language processing and language recognition.

Machine learning is really about learning the mapping from input to output:

That is, we hope to find out the rules in the data through a large number of data. (In unsupervised learning, the main task is to find patterns in the data itself, not mappings.)

In summary, the general machine learning approach is: according to the applicable scenarios of the algorithm, select the appropriate algorithm model, determine the objective function, select the appropriate optimization algorithm, and determine the parameters of the model through iterative approximation to the optimal value.

As for the future prospects, some people say that reinforcement learning is the real hope of artificial intelligence. They hope to further study reinforcement learning and deepen their understanding of deep learning before they can understand articles on deep reinforcement learning.

In the end, I just took time to study for a few months, so THERE are mistakes in the article, I hope to forgive and correct, I will immediately modify, hope not to mislead others.

【 References 】

[1] Li Hang. Statistical Learning Methods [J]. Tsinghua University Press, Beijing, 2012.

[2] Kriegel H P, Kroger P, Zimek A. Outlier Detection Techniques [J]. Tutorial at KDD, 2010.

[3] Liu F T, Ting K M, Zhou Z H. Isolation Forest [C]//Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008: 413-422.

【4】 Aulck L, Aras R, Li L, et al. Stem- Ming the Tide: Predicting STEM ATTRItion Using Student Transcript Data [J]. ArXiv PrePrint arXiv: 178.09344, 2017.

[5] Li Hongyi. Deep learning tutorial. http://speech.ee.ntu.edu.tw/~tlkagk/slide/Deep%20Learning%20Tutorial%20Complete%20 (v3)

[6] Scientific research jun. Convolution neural network, circulating the internal structure of neural network, the depth of the neural networks difference. https://www.zhihu.com/question/34681168