Abstract:This article briefly introduces the basic idea of the nearest neighbor algorithm and its specific Python implementation, and analyzes its advantages and disadvantages and the scope of application, suitable for beginners to understand and practice.

As freshmen start school, some universities have made headlines by assigning roommates according to their interests, which involves the use of machine learning algorithms. In addition, freshmen are likely to join at least a few student organizations or societies upon entering college. Clubs are divided into different categories based on students’ interests, so how do you define these categories, or distinguish the differences between organizations? I’m sure if you ask the people who run these clubs, they won’t say their clubs are the same as other clubs, but they are similar in some way. For example, the hometown association and the high school reunion have the same lifestyle; Football clubs and badminton associations share the same interests in sports; Science and technology innovation association and entrepreneurship club have similar interests and so on. Perhaps by asking you to measure what these groups or organizations do or how they operate, you can determine for yourself which groups are of interest to you. However, there is an algorithm that can help you make better decisions, that is, K-nearest Neighbors (NN) algorithm. In this paper, students’ associations will be used to explain some concepts of K-NN algorithm, which can be said to be the simplest machine learning algorithm, and the model constructed only contains stored training data sets. The algorithm predicts new data points by finding the closest data point — its “nearest neighbor” — in the training data set.


The working principle of

In its simplest version, the K-NN algorithm considers only one nearest neighbor, which is the nearest training data point of the point we want to predict. The predicted result is then the output of that training point. The following figure illustrates the classification of the constructed dataset.


As you can see from the figure, we have added three new data points, represented by stars. For each of the three points, we mark the nearest point in the training set, and the predicted output of the nearest neighbor algorithm is the marked point (represented by cross colors).

Similarly, we can consider any number of k neighbors instead of just one nearest neighbor. That’s where the k-NN algorithm gets its name. When considering multiple neighbors, we use voting to assign labels. This means that for each test point, we calculate how many neighbors are in class 0 and how many are in class 1. Then we count the proportion of those neighbors to determine which category the prediction points belong to: in other words, the minority rules the majority. The following example uses five nearest neighbors:


Again, the predictions are represented in crossed colors. As you can see from the figure, the prediction for the new data point in the upper left corner is not the same as it would be if we used just one nearest neighbor.

Although this figure only shows the problem for dichotomy, this approach can be applied to data sets with any number of classes. For the multi-classification problem, the class of k neighbors is also calculated, and the quantity statistics are carried out, and the class with the largest number is selected as the prediction result.

Scratch implements the K-NN algorithm

The following is the pseudo-code for the K-NN algorithm used to classify A data point (call it point A) : For each point in the data set:

  • First, calculate the distance between point A and the current point;
  • Then, the distance is sorted in ascending order.
  • Secondly, take the k points closest to A as the nearest neighbor;
  • After that, the vast majority of these neighbors are found;
  • Finally, return the vast majority of classes as our prediction for class A;

The Python implementation code is as follows:

def knnclassify(A, dataset, labels, k): DatasetSize = dataset. Shape [0] # Revoke (A, (datasetSize, 1)) -dataset sqDiffMat = diffMat ** 2 sqdiffmat. sum(axis=1) Distances = sqdiffmat. sum(axis=1) Distances = sqdiffmat. size ** 0.5 # Argsort () # select k points at the smallest distance classCount = {} for I in range(k): voteIlabel = labels[sortedDistIndices[i]] classCount[voteIlabel] = classCount.get(voteIlabel, SortedClassCount = sorted(classcount.iteritem (), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]Copy the code

Let’s take a closer look at the code above:

  • The function Knnclassify requires four input parameters: the input vector to classify called A, the complete matrix of the training sample called dataSet, the label vector called Labels, and K — the number of nearest neighbors to use in the vote.
  • Calculate the distance between A and the current point using Euclidean distance.
  • Sort distances in ascending order.
  • The k nearest distances are chosen to vote on class A.
  • After that, you get the classCount dictionary and decompose it into a list of tuples, then sort the tuples by the second item in the tuple. Since the sorting order is reversed, we choose from maximum to minimum (set reverse).
  • Finally, the most frequent category tags are returned.

Scikit-learn implements k-NN algorithm

Scikit-learn is a machine learning toolkit that incorporates many machine learning algorithms. Now let’s look at how to implement kNN using Scikit-learn. The code is as follows:

from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier # import iris dataset iris = datasets.load_iris() X = iris.data y = iris.target # X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) CLF = KNeighborsClassifier(n_neighbor =5) Predictions = CLF. Predict (X_test) print("Test set Predictions: Accuracy = clf.score(X_test, y_test) print("Test set accuracy: {:.2f}".format(accuracy))Copy the code

Let’s take a look at the code above:

  • Firstly, the iris data set was generated.
  • Then, the data is split into training and test sets to evaluate generalization performance;
  • After that, the number of neighbors (k) is specified as 5;
  • Next, the training set is used to fit the classifier.
  • In order to predict the test data, for each data point in the test set, the method is used to calculate the nearest neighbor of the training set and find the most frequent class.
  • Finally, the generalization ability of the model is evaluated by invoking score function with test data and test tags.

After the model is run, 97% accuracy is obtained on the test set, which means that the model can correctly predict the category in 97% of the samples in the test data set.


Advantages and disadvantages

Generally speaking, k-NN classifier has two important parameters: the number of neighbors and the way to calculate the distance between data points.

  • In practice, using a small number of three or five neighbors usually works well. Of course, this parameter should be adjusted according to specific circumstances;
  • Choosing the right distance measurement method can be difficult. In general, the Euclidean distance is used, and it works well in many Settings;

One of the advantages of K-NN is that the model is very easy to understand and usually achieves relatively good performance without a lot of parameter adjustment. Using this algorithm is a good baseline approach before considering more advanced techniques. The establishment of K-NN model is usually relatively fast, but when the training set is very large (no matter the number of features or the number of samples), the prediction will consume a lot of time. In addition, it is very important to preprocess data when using K-NN algorithm. This approach generally performs poorly on data sets with many features (hundreds or more), and is especially bad for data sets where most features are zero most of the time (so-called sparse data sets).

conclusion

K-nn algorithm is a simple and effective data classification method. It is a machine learning algorithm based on instance learning. It needs to execute the machine learning algorithm through data instance, and the algorithm must carry a complete data set. For large data sets, large storage is required. In addition, it is necessary to calculate the distance between each data point in the database and the predicted point, which can be cumbersome and time-consuming. Another disadvantage is that k-NN algorithms don’t give you an idea of the infrastructure of the data and what the “average” or “sample” looks like for each category.

Therefore, although k-NN algorithm is easy to understand, it is not commonly used in practice because of its slow prediction speed and inability to deal with multi-feature problems.

The resources

  • Machine Learning by Peter Harrington (2012)
  • Sarah GuidoAndreas Muller introduction to Machine Learning with Python (2016)

The above is translated by Ali Yunqi Community Organization.

K-nearest Neighbors: Who are Close to you. The article is a brief translation. For more details, please refer to the original text.

More technical dry goods please pay attention to the Cloud community Zhihu Organization Number:Ali Cloud Habitat Community – Zhihu

This article is the original content of the cloud habitat community, shall not be reproduced without permission.