Knn-k nearest neighbor classification model
Let’s start with a question
Instead of talking about this, think about it this way, based on the model:
We have four categories of race: black, white, yellow, and alien. When we put people who needed to be identified in a black community in Guangzhou.
If we set the range to this neighborhood, then we can assume that there is a high probability that he is black, and we will classify him as black
If we broaden the scope to include all of China, we might think he’s yellow.
Well, if we were to scale it out globally, we’d be inclined to say he’s white, because white people make up 43 percent of the world, yellow people 41 percent, and black people 16 percent.
If we zoom in to the infinite universe, the model makes us more inclined to think he’s an alien.
Hey, hey, hey, hey, hey, you think there’s a problem. If you cut my head off today, there’s no way he’s an alien.
So let’s take this problem and look at the kNN model algorithm step
KNN model algorithm steps
Let’s look at this step first
Training data set D: It is a feature vector with X as the coordinate (guangzhou black community) and Y as the data category (except black, white and yellow).
And now we have a lot of samples, for example in the black community
(Building 1, 302, Yellow), (Building 2, 302, black), (Building 3, 302, black), (Building 4, 302, black)......
And then the numbers for all of Guangzhou are even bigger
At this time, we received a new sample data (community park, Y).
We need to make a decision about this y
(1) The distance d of x (community park) between each sample and the new data should be collected
(2) Arrange the calculated distance in ascending order, and take the first K samples in ascending order
(3) Count the Y value of the K samples and find out the label with the highest frequency
(4) Finally determine which type of Y is this (community park, Y)
At this point, careful people may have a second problem
He has a problem with this distance D
What’s wrong with him? We all know the earth is round!
What’s wrong with the round one? If you think about the distance he could travel from Guangzhou to the United States by plane, it would be the same distance, a straight line from the underground, and he would be much shorter.
So let’s talk about the distance calculation in kNN
Distance calculation in kNN
Euclidean distance
So this is the real, straight line distance from the earth to the bottom of the earth, and notice, this (x,y) roar, this is his data features correspond to the coordinates of this person’s position with respect to the Earth, not the ones I just said.
Manhattan distance
With the Euclidean distance, it’s easy to understand if this is the actual distance that you have to walk
So that’s the distance, but there are other distances that we’re not going to worry about
Look at a classification diagram of kNN
The k on this graph is the number of values from near to far. We can see that if k=1, his decision boundary is not very tortuous.
Let’s imagine, in that same black neighborhood, we have one black house, one yellow house, two black houses, three yellow houses. We can make three or four turns between each building. But if we take K =20, there are more black people in this building and fewer black people in that building, we can better judge whether a person is black or yellow based on which building he is in. The classification boundary is drawn between the buildings.
Therefore, we can conclude that the smaller the k value is, the more tortuous the classification boundary is, and the easier it is to be interfered by noise, that is, the worse the anti-interference ability is.
However, if you go too far, you might conclude that the person is an alien
Since this problem is not mentioned in the course, we will solve it by ourselves
Copy of the online
If we select a smaller value of K, it is equivalent to using training instances in a smaller neighborhood for prediction. The approximate error of “learning” will be reduced, and only training instances close to the input instances will play a role in the prediction results. However, the disadvantage is that the estimation error of “learning” will increase, and the prediction result is very sensitive to the neighboring instance points. If the neighboring instance points happen to be noisy, the prediction will be wrong. In other words, the decrease of k value means that the overall model becomes complex and prone to overfitting.
If you do not understand the k-valued small model is complicated, we might as well assume that k=N, and N is the size of the training set. Then no matter what the input instance is, it will simply predict that it belongs to the class with the largest number of training instances, which is obviously unreasonable. The model at this point is very simple and completely ignores a lot of useful information in the training instance.
If a larger value of K is selected, it is equivalent to using training instances in a larger neighborhood for prediction. Its advantage is that the estimation error of “learning” can be reduced, but its disadvantage is that the approximate error of “learning” will increase. At this time, the training instance far from the input instance will also play a role in the prediction result, making the prediction error. A larger k means that the overall model becomes simpler. In the application, the k value is generally taken as a small value, and the optimal k value is usually selected by cross validation method. ———————————————— Copyright notice: This article is originally published BY CSDN blogger “BlackEyes_SGC” under CC 4.0 BY-SA copyright agreement. Please attach the link to the original source and this statement. The original link: blog.csdn.net/u011204487/…
Cross validation
Cross Validation, sometimes called Rotation Estimation, is a practical way to statistically slice a sample of data into smaller subsets. The theory was developed by Seymour Geisser. In a given modeling sample, take out most samples to build the model, and leave a small part of samples to forecast with the model just established, and calculate the prediction error of this small part of samples, and record their square sum. This process continues until all samples have been predicted once and only once. The predicted Error Sum of PRESS is the Sum of the predicted Error Squares for each sample.
I don’t understand, but I can force you to understand
Let’s evaluate the model by defining the scope as global, calculating an error
Then narrow the scope, for example, according to the continent, for example, we choose Asia as the scope of guangzhou, and choose the continent of other places, calculate an error, and evaluate the model
Narrow it down. Country, city. Finally, we choose k with the smallest error as the k used in this model
I found an article that uses this cross-validation, so I’ll save it for later when I can make sense of it
KNN processing iris data set – use cross validation method to determine the optimal values of k blog.csdn.net/woswod/arti…
Let’s solve this problem and continue to study