• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/3…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

The introduction

K-nearest Neighbors (KNN) algorithm is a basic and simple machine learning method.

KNN also has a similar ideological application in our daily life. For example, to judge a person’s personality, we often only need to observe the moral quality of his closest people to get the result. This is the idea of KNN application, KNN method can not only do classification, but also can do regression. In this article, we will explain KNN related knowledge principles to you.

(this part involves the basic knowledge of machine learning, KNN no sequence knowledge reserve first baby articles can view ShowMeAI graphic machine learning | basic knowledge of machine learning.

1. Machine learning and classification

1) Classification problems

Classification problem is a very important part of machine learning. Its goal is to determine which category a sample belongs to based on certain characteristics of a known sample. The classification problem can be subdivided as follows:

  • Dichotomous problem: indicates which known sample class the new sample belongs to when there are two categories in the classification task.

  • Multiclass classification problem: Indicates that there are multiple categories in the classification task.

  • Multilabel classification problem: give each sample a series of target labels.

2) Mathematical abstraction of classification problems

To solve a classification problem from the perspective of algorithm, our training data will be mapped to sample points in n-dimensional space (where N is the feature dimension). What we need to do is to classify points in n-dimensional sample space, and some points will be assigned to a certain category.

The following figure shows two types of sample points in a two-dimensional plane. Our model (classifier) is learning a method to distinguish different categories. For example, a straight line is used to segment two types of sample points.

There are many common application scenarios for classification problems. We choose several examples to illustrate them:

  • Spam identification: As a dichotomous problem, you can classify messages as “spam” or “normal”.

  • Image content recognition: because there is more than one type of image content, the image content may be cats, dogs, people and so on, so it is a multi-category classification problem.

  • Text emotion analysis: it can be used as a dichotomous problem, which divides emotions into two kinds: positive or negative. It can also be used as a multi-category classification problem, which expands the types of emotions, such as: very negative, negative, positive, very positive, etc.

2. The core idea of k-nearest Neighbor algorithm

In the field of pattern recognition, k-nearest Neighbor algorithm (KNN algorithm, also known as K-nearest neighbor algorithm) is a nonparametric statistical method for classification and regression. In both cases, the input consists of K closest training samples in the feature space.

1) The core idea of K-nearest Neighbor

In KNN classification, the output is a taxonomic family. The classification of an object is determined by a “majority vote” of its neighbors, and the most common classification of the K nearest neighbors (K is a positive integer and usually smaller) determines the class assigned to the object.

  • If K=1, the category of the object is assigned directly to the nearest node.

In KNN regression, the output is the attribute value of the object. The value is the average of its K nearest neighbors.

K near neighbor method uses vector space model to classify cases with the same concept and high similarity. The possible classification of unknown cases can be evaluated by calculating the similarity with known cases.

KNN is a kind of instance-based learning, or lazy learning that approximates locally and postpones all calculations until after classification. The K-nearest neighbor algorithm is one of the simplest of all machine learning algorithms.

2) Examples of bean classification

Think about it: there are only three kinds of beans in the picture below, and the species of three beans are unknown. How to determine their species?

In 1968, Cover and Hart proposed the original nearest neighbor method. The idea was that if an unknown bean is closest to which bean, the unknown bean and that bean are considered to be of the same species.

Thus, the definition of nearest neighbor algorithm is derived: in order to determine the category of unknown samples, all training samples are taken as representative points to calculate the distance between unknown samples and all training samples, and the category of nearest neighbor is taken as the sole basis for decision-making of unknown sample category.

The disadvantage of the nearest neighbor algorithm is that it is too sensitive to noise data. It can be seen from the figure that the distance between a circled blue point and two circled red points to the green point is equal, and the shape of the point cannot be determined according to the nearest neighbor algorithm.

To solve this problem, we can calculate multiple recent samples around the location sample to expand the sample size of decision-making, so as to avoid individual data directly determining the decision result.

K-nearest neighbor algorithm is introduced — K samples with a certain number of unknown samples are selected within a certain range. Most of the K samples belong to a certain type, and the unknown samples are judged to be of this type. K- nearest neighbor algorithm is an extension of nearest neighbor algorithm.

According to the K-nearest neighbor algorithm, two of the three points closest to the green point are red points and one is blue. The number of samples of the red point is more than the number of samples of the blue point, so the category of the green point is judged as red point.

3. K-nearest Neighbor algorithm steps and examples

The following contents first sort out the steps of k-nearest Neighbor algorithm, and then show the calculation process of k-nearest Neighbor algorithm through examples.

1) The working principle of k-nearest Neighbor algorithm

  • There is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification.

  • After inputting the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) of the feature in the sample set.

  • In general, only the first N most similar data in the sample data set are selected. K is generally less than 20. Finally, the classification with the most occurrences in K is selected as the classification of new data.

2) Parameter selection of k-nearest Neighbor algorithm

  • How you choose an optimal K value depends on the data. In general, a large K value can reduce the influence of noise in classification, but it will blur the boundary between categories. A good K value can be obtained using various heuristic techniques (see hyperparametric optimization).

  • The existence of noise and non-correlation features, or the feature scale is not consistent with their importance, will seriously reduce the accuracy of k-nearest Neighbor algorithm. Much research has been done on selecting and scaling features to improve classification. One common approach is to use evolutionary algorithms to optimize function extensions, and another is to use mutual information of training samples to select features.

  • In binary (two-class) classification problems, choosing K as odd helps to avoid the situation where the two classifications are even. In this problem, self-help method is used to select the best empirical K value.

Note: KNN does not show the training process, it is the representative of “lazy learning”, it only saves the data in the training stage, the training time cost is 0, and it will be processed after receiving the test samples.

3) Example of k-nearest Neighbor algorithm

For example: Take the movie classification as an example, the movie genre can be divided into romance, action and so on. So what are the characteristics of romantic films? What are the characteristics of action movies? So given a movie, how do you classify it?

If there is a lot of kissing and a lot of fighting in a movie, it is clearly a romantic movie, and vice versa.

Some people have evaluated the number of fights and kisses in movies, as shown here. Given a movie data (18,90) with 18 fight scenes and 90 kissing scenes, how do you know what type it is?

Now, if we sort by increasing distance, we can find k movies that are closest to each other.

Add K=3, then look at the first three categories of movies, romance, romance, action, and then vote, this unknown movie romance 2 votes, action 1 vote, then we think this movie belongs to romance.

4. Shortcomings and improvements of k-nearest Neighbor algorithm

1) Advantages and disadvantages of k-nearest Neighbor algorithm

Sample points of different categories, distributed in different regions of space. K nearest neighbors are classified based on the sample categories with relatively close spatial distance, which is essentially the division of feature space.

  • Advantages: high precision, insensitive to outliers, no data input assumptions.

  • Disadvantages: High computational complexity and space complexity.

  • Applicable data range: numerical and nominal.

2) The core element of k-nearest Neighbor algorithm: distance measurement criterion

The nearest neighbor algorithm can implicitly compute the decision boundary in an efficient way. In addition, it can calculate decision boundaries explicitly, and do so efficiently, such that computational complexity is a function of boundary complexity. K-nearest neighbor algorithm relies on the close points in the space to make category judgment, and the measurement standard of distance judgment is very important.

The measurement standard of distance is a core element for many algorithms (for example, unsupervised learning clustering algorithm also relies heavily on the measurement of distance), and also has a great influence on its results.

Lp Distance (also known as Minkowski Distance, Minkowski Distance) is not a Distance but a definition of a group of distances.

  • When p=1, it is the Manhatton distance (also known as L1 distance or program block distance) and represents the sum of the absolute wheelbases of two points in the standard coordinate system.

  • The parameter p=2 is the Euclidean distance (also known as L2 distance or Euclidean measure), which is a common representation of the distance between two points or multiple points of the straight line distance.

  • When the parameter p→∞, it is the Chebyshev distance (the maximum value of the difference between the values of all coordinates).

3) The core element of k-nearest Neighbor algorithm: the size of K

For KNN algorithm, the value of K is also very important. If you choose a smaller K value, it means that the overall model becomes complicated (the model is prone to overfitting), and the approximation error of the model learning will decrease, but the estimation error will increase.

If a larger K value is selected, it means that the overall model becomes simpler and the estimation error of learning is reduced, but the disadvantage is that the approximation error of learning will increase.

In practical applications, a relatively small value of K is generally used. Cross – validation method is adopted to select an optimal K value.

4) Shortcomings and improvements of k-nearest Neighbor algorithm

(1) Disadvantages

If we look at the following example, we see that, for sample X, by KNN, we can clearly see that X should be in the red category. However, for sample Y, the result determined by KNN algorithm is that Y should belong to the blue category, but Y is closer to the red batch sample points from the distance. Therefore, the original KNN algorithm only considers the number of samples of different categories in the nearest neighbor, and ignores the distance.

In addition to the above disadvantages, KNN also has the following disadvantages:

  • The strong sample base capacity dependence greatly restricts the practical application of KNN algorithm: many categories cannot provide enough training samples, so that the relatively uniform feature space condition required by KNN algorithm cannot be satisfied, resulting in a large recognition error.

  • Determination of K value: KNN algorithm must specify K value, improper selection of K value can not guarantee the classification accuracy.

(2) Improvement methods

Accelerate the classification speed of KNN algorithm.

  • When the number of training samples in the training sample set is large, the training sample set can be edited, that is, the optimal reference subset can be selected from the original training sample set for k-nearest neighbor search, so as to reduce the storage of training samples and improve the computing efficiency.

  • Speed up K nearest Neighbor search This kind of method is to find K nearest neighbors of the sample to be classified in a short time through fast search algorithm.

Maintenance of training sample base.

  • To maintain the training sample library to meet the needs of KNN algorithm, including the samples in the training sample library to add or remove, adopt appropriate measures to ensure that the size of the space, can meet the conditions as a sample to join in the database, at the same time can meet the conditions to the database repository for some samples to delete. Thus, the samples in the training sample base can provide the relatively uniform feature space required by KNN algorithm.

5. Case introduction

If a house plans to rent, but do not know market price, can the specification according to the house (area, room quantity, toilet quantity, accommodate the number of people to wait), search similar in already data concentration (K nearest neighbor) the house price of specification, see others same or similar door model rented how many money.

Classification process: In the known data set, each rented house has fields such as number of rooms, number of toilets and capacity, as well as corresponding rental price. The Euclidean distance is calculated by comparing the expected rental house data with each record in the data set. The 5 records with the smallest distance are taken out and their prices are averaged, which can be regarded as the market average price of the expected rental house.

Note:

  • It is better not to test all the data. It is necessary to separate the training set and test set. The specific partition ratio is determined according to the data set.

  • Ideally, each field in the dataset has the same range of values. But in fact, it is almost impossible. If the original data is directly used in calculation, it will cause a large training error. Therefore, it is necessary to standardize or normalize the data in each column to minimize unnecessary training errors.

  • Non-numeric fields in the dataset need to be converted to replace the dollar $sign and thousandth comma.

ShowMeAI related articles recommended

  • 1. Machine learning basics
  • 2. Model evaluation methods and criteria
  • 3.KNN algorithm and its application
  • 4. Detailed explanation of logistic regression algorithm
  • 5. Detailed explanation of naive Bayes algorithm
  • 6. Detailed explanation of decision tree model
  • 7. Detailed explanation of random forest classification model
  • 8. Detailed analysis of regression tree model
  • 9. Detailed explanation of GBDT model
  • 10. The XGBoost model is fully resolved
  • 11. Detailed explanation of LightGBM model
  • 12. Detailed explanation of support vector machine model
  • 13. Detailed explanation of clustering algorithm
  • 14.PCA dimension reduction algorithm in detail

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials