Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Are you more knowledgeable today 🀑

🎞 Basic Concepts

K -Nearest Neighbor (KNN) algorithm is a basic classification and regression method

Given a training data set, k-nearest neighbor algorithm finds k instances closest to the new input instance in the training data set. Most of these K instances belong to a certain class, and then classiates the input instance into this class (similar to the idea of minority obeying the majority).

  • If k=3, the three nearest points of the green dot are two small red triangles and one small blue square, and the minority is subordinate to the majority. Based on statistical methods, the green point to be classified is determined to belong to the red triangle.
  • If k=5, the five nearest neighbors of the green dot are two red triangles and three blue squares, or the minority is subordinate to the majority. Based on the statistical method, it is determined that the green point to be classified belongs to the blue square.

Therefore, it can be seen from the above example that the core idea of k-nearest neighbor algorithm is to classify the new point, and as long as the k instances closest to it are found, the category will belong to the category at most.

πŸ”ˆk value selection

🚩 k value is too small

If k value is too small, the overall model will become complicated and overfitting is easy to occur.

The so-called overfitting is very high accuracy on the training set and low accuracy on the test set.

Suppose we have training data and points to be classified as follows:

In the picture above, there are two categories, one is the black dot, one is the blue rectangle, and now we need to classify the red pentagon.

When k=1, we can easily see that the pentagon is closest to the black circle. Then according to the idea of k-nearest neighbor algorithm, we can finally determine that the pentagon to be classified belongs to the black circle.

If k is too small, we can easily learn the noise πŸ™, which is easy to classify as noise and ignore the true distribution of the data. And in the figure above, if k is a little bit bigger, say k=8, and we include the blue rectangles, we can easily get the correct classification of the blue rectangles.

🚩 k value is too large

If k value is too large, the training instance far from the input instance (dissimilar) will also play a role in prediction, leading to prediction errors. The increase of K value means that the overall model becomes simple.

When k=n (n is the number of training samples), then whatever the input instance is, it will simply be predicted that it belongs to the class with the most training instances. At this point, the model becomes very simple, which is equivalent to no training model at all πŸ™„, and directly take the training data statistics of various data categories, and then find the largest.

In the figure above, there are 9 black dots and 7 blue rectangles. If K =16 (number of samples =9+7), then according to the k-nearest Neighbor algorithm, the conclusion is that the red pentagon belongs to black dots, which is obviously wrong πŸ˜…

At this time, the model is too simple and it is not desirable to completely ignore a lot of useful information in the training data instance.

The value of k should be neither too large nor too small, and our selection of k values between the red circle boundaries is best, as shown below

😳 In general, how do we pick k values from multidimensional features?

According to Dr. Li Hang’s book, we generally choose a smaller value and usually adopt the cross-validation method to select the optimal k value.

Cross validation

The core idea of cross-validation is to try out a number of possible k’s one by one, and then select the k that works best.

The most common cross-validation is S-fold cross-validation, which is as follows:

Firstly, the given data is randomly divided into S disjoint subsets with the same size. Then, the data of S-1 subset is used to train the model, and the test model of the remaining subset is used to repeat this process for possible S choices. Finally, the model with the smallest average test error in S evaluation is selected.

Another common approach is to set k equal to the square root of the number of instances of the training set.

β™  Measure of distance

As mentioned above, k-nearest neighbor algorithm is to find k instances closest to the instance in the training data set, and most of these K instances belong to a certain class, so we can say which class the prediction point belongs to.

How is nearest proximity measured in the definition ❓ how do we know who is closest to the test point? This leads to several measures of distance.

We can measure it in the following ways:

In practical application, the choice of distance function should be determined according to the characteristics of data and the needs of analysis. The nearest neighbor points determined by different distance measures are different, and generally p=2 Euclidean distance is chosen to represent.

It has a normalization feature

First of all, for example, I use a person’s height (cm) and foot size (size) as the characteristic value, and the category is male or female. There are 5 training samples, which are distributed as follows:

A [(172) and, male] B [lancet (178), male] C [4 (165), female] D [(177) and, male] E [(160, 35), female]

It is easy to see that the height feature of the first dimension is about 4 times that of the foot feature of the second dimension, so when carrying out the distance measurement, we will prefer the first dimension feature. As a result, the two features are not equally important, which may eventually lead to distance calculation errors and thus prediction errors.

For example, now there is a test sample F (167,43), let us predict whether he is male or female, we take k=3 to predict.

Below, we separately calculate the Euclidean distance between F and the training sample using Euclidean distance, and then take the three nearest ones. Most categories are our final results, and the calculation is as follows:

According to the calculation, the first three samples are C,D and E respectively, so C and E are female,D is male, and women are more than men, so our prediction result is female.

So the question is, the probability of a woman’s foot being size 43 is much lower than the probability of a man’s foot being size 43, so why does the algorithm still predict that F is female? That is due to the different dimensions of each feature, which leads to the importance of height is far more than the foot size, which is not objective. So, we should make each feature equally important!

The normalization formula is as follows:

πŸ“½ Classification decision rules

The classification decision rule in k-nearest neighbor method is usually majority voting, that is, the majority classes in the training instances of k nearest neighbors of the input instance decide the class of the input instance.

Majority voting rules are explained as follows:

If the loss function of classification is 0-1, the classification function is:

Then the probability of misclassification is:

In other words, the current candidates are C1, C2, and…. Cj, I choose which one minimizes our empirical risk (empirical risk is colloquially the error value of training data)

πŸ““ summary

1️ nearest neighbor method is a basic and simple classification and regression method. The process of k-nearest neighbor method is: if we have a data set now, when given a new instance, the simplest and most crude way is to calculate the distance between it and all the points, then find k nearest neighbors, and finally determine the class of the instance according to the rule of majority voting.

2️ nearest neighbor model corresponds to a partition of feature space based on training data set. In the k-nearest neighbor method, when the training set, distance measure, K value and classification decision rule are determined, the result is uniquely determined.

3 Three elements of ️ nearest neighbor method: distance measurement, selection of K value and classification decision rule. Common distance measures are the Euclidean distance and the more general Lp distance. When k is small, the K-nearest neighbor model is more complex. When k is large, the K-nearest neighbor model is simpler. The selection of k value reflects the tradeoff between approximate error and estimated error, and the optimal K is usually selected by cross validation. The commonly used classification decision rule is majority voting, which corresponds to empirical risk minimization.

πŸ’» code example

Including KNN algorithm simple examples, KNN implementation of iris classification examples, kd tree construction and KD tree search, etc., are available at πŸ‘‰gitee.com/xin-yue-qin… πŸ‘ˆ

One article to understand k nearest neighbor (KNN) algorithm – Yizhen