Author 1. Write first
If you want to engage in data mining or machine learning, it is necessary to master common machine learning algorithms. Here is a brief overview of common machine learning algorithms:
- Supervised learning algorithms: logistic regression, linear regression, decision tree, Naive Bayes, K-nearest Neighbor, support vector machine, integration algorithm Adaboost, etc
- Unsupervised algorithms: clustering, dimensionality reduction, association rules, PageRank, etc
For detailed understanding of the principles, have seen watermelon book, statistical learning methods, such as machine learning field, also heard some machine learning courses, but always feel more abstruse words, don’t have the patience to read, and theory are everywhere, but practice is the most important, so here want to use the most simple language to write a vernacular theory + practice series of machine learning algorithm.
In my opinion, it is more important to understand the idea behind the algorithm and its use than to understand its mathematical derivation. Idea will let you have an intuitive feeling, so as to understand the rationality of the algorithm, the mathematical deduction is to express this kind of rationality in a more rigorous language, for example, a pear is sweet and can be expressed in mathematical language to sugar content is 90%, but only the bite personally, you can truly feel the pear how sweet, And really understand the math of what 90 percent sugar looks like. If the algorithm is a pear, the primary purpose of this article is to take a bite out of it. There are also several other purposes:
- Test your understanding of the algorithm and make a summary of the algorithm theory
- Can be happy to learn the core ideas of these algorithms, find the interest in learning these algorithms, for in-depth learning these algorithms lay a foundation.
- The theory of each class will be put to a practical case, can really learn to apply, not only can exercise programming ability, but also can deepen the grasp of algorithm theory.
- Also want to put all the previous notes and reference in a piece, convenient for the convenience of checking later.
The process of learning algorithms should not only gain the theory of algorithms, but also have fun and the ability to solve practical problems!
Today is the vernacular machine learning algorithm theory + actual combat of the seventh K nearest neighbor algorithm, this should be many machine learning algorithm is the most easy to understand or operate an algorithm, but don’t look at it simple, but very practical. Because it can be used to do classification, also can be used to do regression, easy to implement, no parameter estimation, no training. Of course, there are also some disadvantages, such as a large amount of calculation, slow analysis speed, sample imbalance problem is not very easy to use, can not give the inherent meaning of data, is a lazy learning method. Of course, KNN can also be used in the recommendation algorithm. Although tD-IDF, collaborative filtering and Apriori algorithm are used in many algorithms of recommendation systems, it is feasible to adopt KNN as the recommendation algorithm when the amount of data is not large.
So simple, and the role of the algorithm we should immediately grasp, so in this class, we will learn the calculation principle of KNN, and use KNN to achieve a real case, let’s start.
The outline is as follows:
- How do you classify movies according to the number of fights and kisses? (We get KNN)
- How KNN works
- How to use KNN to do regression? (A brief introduction)
- Use KNN to realize the recognition of written numbers (handwritten number recognition is the Hellow World of deep learning, here first take you to know)
OK, let ‘s go!
2. K-nearest Neighbor algorithm? Again, we’ll start with a category of movies
KNN is called K-nearest Neighbor in English, which should be the simplest data mining algorithm. Although simple, but very practical, old rules, since the vernacular, or from the example to start, let you experience the algorithm source life again.
Suppose I had the following movies, and I counted the number of fights, the number of kisses, and everything else, as well, as below:It’s easy to understand that Wolf Warrior, Operation Red Sea, Mission: Impossible 6 are action movies, While Exes 3, Puff and Titanic are romance movies. (If you don’t understand, go to the movies by yourself and find an interface to relax.)
But let’s say I have a new movie right now, fast & Furious 9, and I count the fights and the kisses. Would that classify as love or action? That’s not easy, you say. I knew it was a Vin Diesel action movie. But the machine doesn’t know, the machine doesn’t care whether you are male god or goddess, you have to let the machine understand this classification rules, let the machine to classify ah, then you should use what method?
We can view the number of fights as the X-axis and the number of kisses as the Y-axis, and then label these movies on a two-dimensional coordinate axis, as shown in the figure below.For Fast and Furious, (x,y), we need to look at what movies are closest to movie A, which category most of those movies belong to, and which category movie A belongs to.
Well, that’s what you teach the machine to do, actually it’s K Nearest Neighbor, easy to understand, it’s a new movie coming, see which neighbor it’s close to, what kind of neighbor it is, it’s what kind of neighbor it is.
However, there are still some details we need to know, let’s really look at the principle.
3. The working principle of K Nearest Neighbor
A simple sentence can explain the working principle of KNN: “Keep company and you will get red.
The process of k-nearest Neighbor algorithm is that given a training data set, for the newly input instance, K instances closest to the instance are found in the training data set. Most of these K instances belong to a certain class, and the input instance is divided into this class.Take a look at the figure above: there are two different types of sample data, which are represented by blue squares and red triangles respectively, and the green circle in the middle represents the data to be classified. Below, we classify green dots according to the K-nearest Neighbor idea:
- If k=3, the three nearest neighbors of the green dot are two red triangles and one blue small square, and the minority is subordinate to the majority. Based on statistical methods, the green point to be classified is determined to belong to the category of red triangle
- If k=5, the five nearest neighbors of the green dot are two red triangles and three blue small squares, or the minority is subordinate to the majority. Based on the statistical method, it is determined that the point to be classified belongs to the blue square
As can be seen from the above examples, the idea of k-nearest neighbor algorithm is very simple. How to classify the new points? Find the k instances closest to it, which category is the most.
Therefore, the working principle of KNN can be roughly divided into three steps:
- Calculate the distance between the object to be classified and other objects;
- Count K neighbors closest to each other.
- For K nearest neighbors, the object to be classified belongs to whichever category they belong to the most.
But this raises two questions:
- How do I measure distance and lock the K nearest neighbors? (Different measurement methods may result in K different neighbors)
- How to determine the value of K (different values of K, the final result may be different, as above)
Let’s solve these two problems.
So let’s start with the second question, how do WE choose K?
You can see the whole KNN classification process, the choice of K is still very important
- If K is small, it means that the unclassified object is very close to its neighbor. A problem arising from this is that, if the neighbor point is a noise point, the classification of unclassified objects will also produce errors, so KNN classification will produce overfitting. For example, in the following case (K=1, the green point is the classification point) :So in this case, K=1, it puts the green point in the orange category, but the orange point is a noisy point.
- If the K value is relatively large, it is equivalent to the point with too far distance will also have an impact on the classification of unknown objects. Although the advantage of this case is strong robustness, the deficiency is also obvious, which will produce the under-fitting situation, that is, the unclassified objects are not really classified. Take this example:In the above case K=N, no matter what the input instance is, it will simply be predicted that it belongs to the class with the largest number of training instances. In this case, it is equivalent to having no training model at all, and it is not desirable to completely ignore a large amount of useful information in training data instances.
So how do we determine K? Ha ha, sorry, this value can not be determined in advance, need constant practice, again and again try. In engineering, we generally choose K value by cross validation.
u
The idea of cross-validation is to take most samples in the sample set as the training set, and the remaining small samples for prediction to verify the accuracy of the classification model. Therefore, in KNN algorithm, K value is generally selected in a small range, and the one with the highest accuracy in the verification set is finally determined as K value.
“
And then the second question, how do you calculate distance?
In KNN algorithm, another important calculation is about the measurement of distance. The distance between two sample points represents the similarity between the two samples. The greater the distance, the greater the difference; The smaller the distance, the greater the similarity.
There are five ways to measure distance:
- Euclidean distanceEuclidean distanceIs our most commonly used distance formula, also known as Euclidean distance. In two dimensions, the Euclidean distance between two points is:Similarly, we can also find the distance between two points in n-dimensional space:
- Manhattan distanceManhattan distanceIt’s used more in geometric space. The Manhattan distance is equal to the sum of the absolute wheelbases of the two points on the coordinate system. It can be expressed by the formula:For example, the green line represents the Euclidean distance between two points, while the red and yellow lines represent the Manhattan distance between two points. Feel the difference:
- Minkowski distanceMinkowski distanceNot a distance, but a definition of a set of distances. For two points x(x1,x2… , xn) and y (y1, y2,… ,yn), the Minkowski distance between x and y is:Where p represents the dimension of space. When P =1, it is the Manhattan distance. When p=2, it’s the Euclidean distance; As p goes to infinity, it’s the Chebyshev distance. The graph of the point with a distance of 1 from the original Lp under different P values is given below to intuitively feel the difference of P values and the change of distance:
- Chebyshev distance so how do we calculate Chebyshev distance? Chebyshev distance between two points is the point of the two coordinates, a maximum of the absolute value of the difference, expressed in mathematics is that Max (| x1, y1, | | x2, y2 |).
- Cosine distance Cosine distance actually computes the Angle between two vectors, it computes the difference between them in direction, it’s not sensitive to absolute values. In the comparison of interest correlation, the Angle relation is more important than the absolute value of distance, so cosine distance can be used to measure the differentiation degree of users’ interest in content. For example, if you use a search engine to search for a certain keyword, it will also recommend other related searches based on cosine distance calculation. So cosine distance is going to be used a lot in search recommendations.
The first three are the most commonly used distances in KNN.
Well, once we’ve solved this problem, we’ve figured out how KNN works. Can’t wait to get real? Before you panic, there’s something else I need to introduce.
4. KD tree
I’m not going to tell you too much about KD trees, there are a lot of mathematical formulas involved, but I’m going to tell you something about this, you don’t need to know much about the mathematics of KD trees, you just need to know that it’s a binary tree data structure, convenient for storing data in k-dimensional space. And in SkLearn, we can call KD trees directly, which is very convenient.
In fact, as you can see from the above, the calculation process of KNN is to calculate the distance between sample points in large quantities. In order to reduce The Times of calculating distances and improve the search efficiency of KNN, people put forward KD tree (abbreviation of K-dimensional). KD tree is a data structure dividing logarithmic data points in K – dimensional space. In the construction of KD tree, every node is a binary tree of k-dimensional numerical points. Since it is a binary tree, the binary tree can be used to add, delete, change and search operations, so as to greatly improve the search efficiency.
Well, about KD tree, say so much, before actual combat, say again, KNN how to do regression, after all, the actual combat is a classification task, since said KNN can do regression before, you can not say I lie.
5. KNN does regression
KNN can not only do classification, but also do regression. So first of all, what is regression. In the case of opening films, if we want to categorize unknown films, it is a classification problem. First, let’s look at the unknown movie to be classified. This movie belongs to whichever category most of the K movies closest to it belong to.
If it’s a new movie and you know it’s a romance, it’s a regression problem to know how many fights, how many kisses it’s likely to have.
So how does KNN do regression?
For a new movie X(think fast and the Furious, an action movie). We want to predict the value of one of its properties, such as the number of fights, as shown below.At this point, we will first calculate the distance between the point to be measured (new movie X) and the known point, and select K points closest to it. Assuming K=3 and the last three points are Wolf Warrior, Operation Red Sea and Mission Impossible 6, the number of fights is the average value of this attribute of these three points, i.e. (100+95+105)/3=100 times.
Ok, this is KNN, nothing special to explain, the following actual combat.
6. KNN: How to recognize handwritten numbers?
Again, let’s look at using the sklearn tool.
6.1 How do I Use KNN in SkLearn?
The KNN algorithm is available in Python’s Sklearn toolkit. KNN can be used as both classifier and regression.
- For classification, you need to quote: from sklearn.neihbors import KNeighborsClassifier
- Neighbors import KNeighborsRegressor is used for regressions
How to create KNN classifier in SkLearn?
u
KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30)
- N_neighbors: the K value in KNN represents the number of neighbors. If the K value is relatively small, it will cause overfitting. If K is large, you can’t classify unknown objects. Generally we use the default value 5.
- Weights: weights are used to determine the weights of neighbors. There are three methods:
u
- Weights =uniform, which means that all neighbors have the same weights.
- Weights =distance, which means that the weight is the reciprocal of the distance, that is, inversely proportional to the distance;
- Custom functions, you can customize the weights for different distances. In most cases, you don’t need to define your own functions.
“
- Algorithm: Specifies the methods for calculating neighbors. There are four methods:
u
- Algorithm =auto, an algorithm is automatically selected according to the data. Auto is selected by default.
- Algorithm = KD_tree, also known as KD tree, is a data structure of multi-dimensional space, which is convenient for key data retrieval. However, KD tree is suitable for the case of few dimensions. Generally, the dimension is less than 20, and if the dimension is greater than 20, the efficiency will decrease.
- Algorithm = BALL_tree, also known as ball tree, is the data results of multi-dimensional space like KD tree. Different from KD tree, ball tree is more suitable for large dimensions.
- Algorithm = Brute, also known as brute search, differs from KD trees in that it uses a linear scan rather than a tree structure for quick retrieval. When the training set is large, it is inefficient.
“
- 4. Leaf_size: represents the number of leaves when constructing KD tree or sphere tree. The default value is 30.
“
After creating the KNN classifier, we can input the training set to train it. Here, we use fit() function to input the sample feature matrix and classification identifier of the training set, and the trained KNN classifier will be automatically obtained. Then the predict() function can be used to predict the results, where the characteristic matrix of the test set is passed in to obtain the predicted classification results of the test set.
6.2 How to Use KNN in SkLearn?
Handwritten digital data set is a well-known data set for image recognition. The process of digital recognition is to match these pictures with the classification results 0-9 one by one. The complete handwritten digital dataset MNIST includes 60,000 training samples and 10,000 test samples. MNIST is basically the first data set you touch if you study deep learning.
For KNN classification, we use the handwritten digital dataset that comes with SkLearn. You can think of this dataset as a simplified version of the MNIST dataset, which contains only 1797 digital images, each of which is 8 × 8 pixels.
Let’s break down the process:The whole training process basically consists of three stages:
- Data loading: We can load our own handwritten numeric data sets directly from Sklearn;
- Preparation stage: In this stage, we need to have a preliminary understanding of the data set, such as the number of samples, what the image looks like, and what the recognition result is. You can use visualization to see how the image is rendered. Data normalization allows data to be in the same order of magnitude of dimension. In addition, since the training set is an image and each image is an 8*8 matrix, we do not need to carry out feature selection for it and just take all the image data as the eigenvalue matrix.
- Classification stage: classifier can be obtained through training, and then the accuracy of the test set is calculated.
Ok, let’s get started:
- First of all, import package, here import a lot of algorithm package, because this project is relatively simple, so let’s compare the previous several methods, see the effect.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
Copy the code
- Load the data set and explore
Data = load_digits() data = digits.dataprint(data.shape) # View the first imageprint(digits.images[0] # Number meaning represented by the first imageprint(digits.target[0Plt.gray () plt.imshow(digits.images[)0]) plt.show() #1797.64)
[[ 0. 0. 5. 13. 9. 1. 0. 0.]
[ 0. 0. 13. 15. 10. 15. 5. 0.]
[ 0. 3. 15. 2. 0. 11. 8. 0.]
[ 0. 4. 12. 0. 0. 8. 8. 0.]
[ 0. 5. 8. 0. 0. 9. 8. 0.]
[ 0. 4. 11. 0. 1. 12. 7. 0.]
[ 0. 2. 14. 5. 10. 12. 0. 0.]
[ 0. 0. 6. 13. 10. 0. 0. 0.]]
0
Copy the code
We carried out data visualization on the first image in the original data set, and it can be seen that the image is an 8*8 pixel matrix. The image above is a “0”. We can also see that the classification label is “0” in the classification label of the training set.
- Klearn’s own handwritten digital dataset consists of 1797 samples, each image is an 8 by 8 pixel matrix. Since there is no special test set, we need to divide the data set into training set and test set. Since KNN algorithm is related to distance definition, we need to normalize the data and adopt Z-Score normalization. The code is as follows:
# Split data, will25% of the data as a test set, the rest as a training set (you can also specify other proportions of the data as a training set) train_X, test_x, train_y, test_Y = train_test_split(data1, target1, test_size=0.25Ss = StandardScaler() train_ss_scaled = ss. Fit_transform (train_x) test_SS_scaled = ss. Transform (test_x) # using0- 1Train_scaled = mm. Fit_transform (train_x) test_mm_scaled = mm. Transform (test_x)Copy the code
The reason why 0-1 normalization is used here is because the polynomial naive Bayesian classification model, the incoming data cannot have negative numbers. Because z-Score normalizes the values to a standard normal distribution with a mean of 0 and a variance of 1, the values will contain negative numbers. Therefore, we need to adopt min-max normalization to normalize the data to the range of [0,1].
- The model is established and compared: five classifiers are constructed here, which are K-nearest neighbor, SVM, polynomial naive Bayes, decision tree model and AdaBoost model. PS: Note that when doing polynomial naive Bayes, the incoming data cannot have negative numbers. Since z-Score will normalize the values and a standard normal distribution will contain negative numbers, it cannot have this. MinMax can be normalized to the range [0-1].
models = {}
models['knn'] = KNeighborsClassifier()
models['svm'] = SVC()
models['bayes'] = MultinomialNB()
models['tree'] = DecisionTreeClassifier()
models['ada'] = AdaBoostClassifier(base_estimator=models['tree'], learning_rate=0.1)
for model_key in models.keys():
if model_key == 'knn' or model_key == 'svm' or model_key == 'ada':
model = models[model_key]
model.fit(train_ss_scaled, train_y)
predict = model.predict(test_ss_scaled)
print(model_key, "Accuracy:", accuracy_score(test_y, predict))
else:
model = models[model_key]
model.fit(train_mm_scaled, train_y)
predict = model.predict(test_mm_scaled)
print(model_key, "Accuracy:", accuracy_score(test_y, predict))
Copy the code
Output result:You can see that the accuracy of KNN is still good, comparable to SVM. And it’s even better than AdaBoost, which makes me wonder why decision trees and AdaBoost are so bad. Later, I found that it was the problem of the number of samples. Our maximum data set was only over 1000 photos, which was too small for AdaBoost to play its role. Therefore, I memorized the data amplification and copied the original data three times:
data1 = np.vstack((data, data, data))
target1 = np.hstack((target, target, target))
Copy the code
When tested, AdaBoost and Tree improved their performance to match that of SVM:
7. To summarize
Ok, this is basically the end, KNN is not too difficult, so the space may be a little less, but KNN is still a good algorithm to use, the following simple summary:
First of all, through the example of film classification, experience what is KNN, and then write about the working principle of KNN, the main problem is how to choose K value? How do you measure distance?
Then, two episodes KD tree and KNN do regression.
Finally is handwritten number recognition actual combat, this of course is relatively simple, and then is compared to the effect of these algorithms. KNN has a good performance in it, and THE SVM algorithm is still good in dealing with this problem. When there are many samples, the integration method still has a certain advantage. In this process, you should have some experience in the use of data exploration, data visualization, data normalization, model training, and results evaluation. Using Sklearn is convenient for small amounts of data.
Reference:
- Note.youdao.com/noteshare?i…
- Note.youdao.com/noteshare?i…
- Note.youdao.com/noteshare?i…
Machine Learning Online Manual Deep Learning online Manual AI Basics Download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code