This article is participating in Python Theme Month. See the link to the event for more details

Introduction to k proximity algorithm

The simplest and most elementary classifier is to record all the categories corresponding to the training data. When the attributes of the test object and that of a certain training object match perfectly, it can be classified. However, how is it possible that all test objects can find a training object that is exactly matched with it? Secondly, there is a test object that matches multiple training objects at the same time, which leads to the problem that a training object is divided into multiple classes. Based on these problems, KNN is generated.

KNN is classified by measuring the distance between different characteristic values. The idea is that a sample belongs to a category if most of the k most similar (i.e. closest neighbors) samples in the feature space belong to that category, where K is usually an integer not greater than 20. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. This method only determines the category of the samples to be divided according to the category of the nearest one or several samples.

Here is a simple example: as shown in the figure below, the green circle is to be assigned to which class, red triangle or blue square? If K=3, since the ratio of red triangles is 2/3, the green circle will be assigned to the class of red triangles, and if K=5, since the ratio of blue squares is 3/5, the green circle will be assigned to the class of blue squares. It also shows that the results of KNN algorithm largely depend on the choice of K.

In KNN, the distance between objects is calculated as the dissimilarity index between each object, so as to avoid the matching problem between objects. In this case, Euclidean distance or Manhattan distance is generally used:

                      

Meanwhile, KNN makes decisions based on the dominant category of K objects, rather than a single object category. These two points are the advantages of KNN algorithm.

Next, summarize the idea of KNN algorithm: Is concentrated in training data and the condition of known labels, enter test data, the characteristics of the test data and training focused on to compare the characteristics of the corresponding, and find the most similar of training focus and former K data, then the test data, the corresponding category is K data appear most frequently in the classification, the algorithm is described as:

1) Calculate the distance between test data and each training data;

2) Order by increasing distance;

3) Select K points with the smallest distance;

4) Determine the occurrence frequency of the categories of the first K points;

5) Return the category with the highest occurrence frequency among the first K points as the predicted classification of test data.

The target

Predicting happiness

Collecting survey data

  1. Student A: 10,000 yuan per month, 2 hours overtime, 0.5 hours commuting
  2. Student B: 20 thousand yuan per month, 3 hours overtime, 1 hour commute

Structured data processing

We treat the data in the format of monthly salary (ten thousand yuan), overtime (hours), commuting (hours)

data_happy = [
[1.5.1.5.0.5], [1.6.2.1.0.2], [1.8.1.4.0.5], [1.5.2.0.0.3],
[1.7.1.0.0.4],
[1.3.1.0.0.3], [1.5.2.0.0.5], [1.4.1.7.0.4], [1.6.0.5.0.5],
[1.9.2.0.0.3],
]
data_sad = [
[0.8.2.3.0.8], [1.0.2.6.0.9], [2.0.1.0.0.5], [1.2.3.0.1.1],
[1.8.3.5.1.3].Copy the code

Visual processing of survey data

We train an AI method: existing data → training model → new data (XX student: 15,000 yuan per month, 2 hours of overtime, 0.8 hours of commuting) → predict its happiness

K value choice

We know from the top graph that K is important, so how do we figure out what K is? The answer is through cross validation (sample data according to certain proportion will split out the training data and verify the use of data, such as 9:1 split out part of the training data and test data), starting from the select a smaller values of K, increasing the value of K, and then calculating the variances of the validation set, finally find a more appropriate K values. The variance was calculated through cross validation (9 of them were trained and 1 of them was verified in turn), and the following figure was roughly obtained:

conclusion

The k-neighbor algorithm is divided into four steps:

  1. Prepare data and preprocess data (cleaning: format, noise, etc.)
  2. Distance calculation
  3. Arrange the distances in ascending order, and then select the first k points with the smallest distances
  4. The new sample points are classified according to the majority principle