First, basic principles

The principle of k-nearest Neighbor algorithm is very simple: for a new sample, the purpose of k-nearest algorithm is to find K data most similar to it, or K data “closest to it” in the existing data. If most of these K data belong to a certain category, then the sample also belongs to this category.

In the picture below, for example, the pentacle represents romance and the triangle represents science fiction. At this point, a new sample square is added and its category needs to be determined. When the three nearest neighbor points (K = 3) closest to the new sample are selected as the basis for judgment, these three points are composed of one five-pointed star and two triangles. According to the principle of “minority follows majority”, the new sample can be considered to belong to the category of triangle, that is, the new sample is a science fiction film. Similarly, when selecting the 5 nearest points (K = 5) to the new sample as the judging basis, these 5 points are composed of 3 pentagons and 2 triangles. According to the principle of “minority obeying majority”, the new sample can be considered to belong to the category of pentagons, that is, the new sample is a love film.

Now that we understand the meaning of K, let’s learn how to judge the similarity of two data points, or the distance between two data points. Here, the most common Euclidean distance is used to define the distance between two points in vector space. For two-dimensional space, the two eigenvalues of sample A are (X1, Y1), and the two eigenvalues of sample B are (X2, Y2), so the distance between the two samples is calculated as follows.

In practical application, there are usually n features of the data. In this case, the distance formula can be extended to n-dimensional space, for example, the coordinates of point A in n-dimensional vector space are (X1, X2, X3… , Xn), the coordinates of point B are (Y1, Y2, Y3… , Yn), then the Euclidean distance between two points A and B can be calculated as follows.

2. Calculation steps

  • 1. Sample data

Here through a simple example “how to judge the type of wine” to explain the k-nearest neighbor algorithm calculation steps. There are many indicators used to judge wine in commercial practice. For the convenience of demonstration, wine is only divided into 2 categories according to the two characteristic variables of “alcohol content” and “malic acid content”, and there are only 5 groups of original sample data, as shown in the table below.

Alcohol content represents the alcohol content of the wine, malic acid content represents the malic acid content of the wine, classification value 0 represents wine A, and 1 represents wine B.

Now we need to use k-nearest neighbor algorithm to classify A new sample. The characteristic data of the new sample are shown in the following table. Then, does this new sample belong to wine A or wine B?

  • 2. Calculate distance

At this point, the distance formula can be used to calculate the distance between the new sample and the existing sample, that is, the similarity between different samples. For example, the distance between the new sample and sample 1 is calculated as follows.

Similarly, the distance between the new sample and other original samples can be calculated, as shown in the following table.

  • 3. Determine the category according to K value

After calculating the distance between each original sample and the new sample, sort them from near to far according to the distance. See the following table for the results.

If K is set to 1, that is, the species of the original sample closest to the new sample is taken as the species of the new sample, and the new sample is closest to sample 2, then the classification of the new sample is 0, that is, wine A.

If K value is equal to 3, it is based on the types of most samples of the 3 original samples closest to the new sample. At this time, the 3 original samples closest to the new sample are sample 2, sample 1 and sample 4. Classification 0 is the majority among them, so the classification of the new sample is determined to be 0, namely wine A.

3. Data standardization

In the demonstration data used in this section, the dimensional levels of different characteristic variables are not much different. If the data of “alcohol content” are enlarged to 10 times of the original and the data of “malic acid content” remain unchanged, then the dimensional levels of the two variables are quite different, as shown in the following table.

At this point, if the K-nearest Neighbor algorithm is directly used to build the model, the importance of “alcohol content” in the model will be far greater than that of “malic acid content”, which will lose the role of “malic acid content” as a characteristic variable, and the results will have a large error. For example, for a new sample with 70% alcohol content and 1% malic acid content, its distance from sample 1 is calculated as follows.

It can be seen that the distance at this time is almost dominated by “alcohol content”, while “malic acid content” hardly plays a role due to the large dimensional difference. If data pretreatment is not carried out, the prediction results will be biased.

Therefore, if the dimensionality levels of different characteristic variables differ greatly and influence each other in modeling, we usually preprocess the data, which is called data normalization or data normalization. Common methods for data normalization include min-max normalization (also known as deviation normalization) and Z-Score normalization (also known as mean normalization).

Four, code implementation

Neighbors import KNeighborsClassifier as KNN x = [[5,2],[6,1],[8,3],[10,2]] y = [0,0,0,1,1] k = KNN (n_neighbors = 3) k.f it (x, y) k.p redict ([[7, 1]])Copy the code

The output is as follows:

array([0])
Copy the code

5. K nearest neighbor algorithm regression model

The previous code used KNeighborsClassifier in the K-neighbor algorithm for classification analysis. The K-neighbor algorithm can also perform regression analysis. The corresponding model was KNeighborsRegressor. The classification model of k-nearest algorithm takes the classification of the sample points to be predicted which occurs most frequently among the K training sample points closest to the sample points to be predicted as the classification of the sample points to be predicted, while the regression model of K-nearest algorithm takes the average value of the K training sample points closest to the sample points to be predicted as the classification of the sample points to be predicted

The introduction code of k-nearest neighbor algorithm regression model is as follows.

KNR x = [[1,2],[3,4],[5,6],[7,8],[9,10]] y = [1,2,3,4,5] model = KNR (n_neighbors = 2) model fit (x, y) model, predict ([[5, 5]])Copy the code

X in line 2 is the characteristic variable, with two features; In line 3, y is the target variable; Line 4 creates the model and sets the hyperparameter N_NEIGHBORS (K) to 2. Line 5 trains the model with the fit() function; Line 6 uses the predict() function to make a prediction. The predictions are as follows.

Array ([2.5])Copy the code