A look at the k-nearest Neighbor algorithm (KNN), K-D tree, and handwritten digit recognition!

1. What is KNN

1.1 Popular explanation of KNN

What is k-nearest Neighbor algorithm, namely K-nearest Neighbor algorithm, abbreviated as KNN algorithm, can be simply and rudely considered as K Nearest neighbors. When K=1, the algorithm becomes the Nearest Neighbor algorithm, that is, to find the Nearest Neighbor.

In the official’s words, the so-called K nearest neighbor algorithm, is given a training data set, the new input instance, found in the training data set and the instance nearest K instance (that is, the K neighbors) mentioned above, the K most belong to a class instance, is to classify the input instance to this class.

As shown in the figure above, there are two different types of sample data, which are represented by a small blue square and a small red triangle respectively. The data indicated by the green circle in the middle of the figure is the data to be classified. In other words, at present, we do not know which category the green data in the middle belongs to (blue small square or red small triangle), KNN is to solve this problem.

If K=3, the nearest three neighbors of the green dot are two small red triangles and one small blue square, and the minority is subordinate to the majority. Based on the statistical method, the green point to be classified is determined to belong to the red triangle.

If K=5, are the nearest five neighbors of the green dot two red triangles and three blue squares, or are the minority subordinate to the majority? Based on statistical methods, the green point to be classified belongs to the blue square.

Thus, we can see that when the point to be classified cannot be determined which category in the known classification it belongs to, we can look at its location characteristics according to the theory of statistics, measure the weight of its neighbors, and classify it into (or assign it to) the category with greater weight. This is the core idea of the k-nearest neighbor algorithm.

1.2 Distance measurement of nearest neighbors

As we can see, the core of k-nearest Neighbor algorithm is to find the neighbor of the instance point. At this time, the problem comes one after another, how to find the neighbor, what is the judgment criterion of the neighbor, and what to measure. This set of problems is the distance metric notation that follows.

What are the representations of distance measures?

Euclidean distance, the most common representation of the distance between two or more points, also known as Euclidean metric, it is defined in Euclidean space, such as the point x = (x1,… ,xn) and y = (y1… The distance between,yn) is:
- Euclidean distance between two points a(x1,y1) and b(x2,y2) on a two-dimensional plane:
- Euclidean distance between two points A (x1,y1,z1) and B (x2,y2,z2) in three-dimensional space:
- Two n-dimensional vectors a(x11,x12… X1n) and b (x21, x22,… Euclidean distance between,x2n) :
  
  It can also be expressed as a vector operation:
Manhattan distance, we can define the Formal meaning of Manhattan distance as L1-distance or city block distance, that is, the sum of the distance generated by the projection of the line segment formed by two points on the axis in the fixed rectangular coordinate system of Euclidean space. For example, in the plane, the Manhattan distance between point P1 at coordinates (x1, y1) and point P2 at coordinates (x2, y2) is:, it should be noted that the Manhattan distance depends on the rotation of the coordinate system, not on the translation or mapping of the system on the coordinate axis.

In layman’s terms, imagine you’re driving from one intersection to another in Manhattan, the distance as the crow flies between two points? Obviously not, unless you can get through the building. The actual driving distance is the “Manhattan Distance”, which is the origin of the name Manhattan Distance, also known as City Block Distance.
- The Manhattan distance between two points a(x1,y1) and b(x2,y2) in a two-dimensional plane
- Two n-dimensional vectors a(x11,x12… X1n) and b (x21, x22,… ,x2n)
Chebyshev distance, if two vectors or two points p and q are Pi and QI, then the Chebyshev distance between them is defined as follows:

This is also equal to the extreme values of the following Lp measures:, so chebyshev distance is also called L∞ metric.

From a mathematical point of view, Chebyshev distance is a metric derived from the uniform norm (or upper precisely bound norm) and is also a type of super-convex metric space.

In plane geometry, if the cartesian coordinates of two points P and q are (x1,y1) and (x2,y2), the Chebyshev distance is:

Those of you who have played chess may know that a king can move to any of the eight adjacent squares with one move. So what’s the minimum number of steps it takes for the king to go from x1,y1 to x2,y2? . Steps you will find that the least always Max (| x2 – x1 | | y2 – y1 |) step. A similar distance measure is called Chebyshev distance.
- Chebyshev distance between two points a(x1,y1) and b(x2,y2) in a two-dimensional plane:
- Two n-dimensional vectors a(x11,x12… X1n) and b (x21, x22,… ,x2n), chebyshev distance:
  
  The other equivalent form of this formula is
Minkowski Distance is not a kind of Distance, but a definition of a group of distances.

Two n-dimensional variables A (x11,x12… X1n) and b (x21, x22,… The Minkowski distance between,x2n) is defined as:

Where P is a variable parameter. When p is equal to 1, it’s the Manhattan distance when p is equal to 2, it’s the Euclidean distance and when p goes to infinity, it’s the Chebyshev distance and depending on the variable parameter, the Min distance can represent a class of distances.
Standardized Euclidean distance is an improved scheme for the shortcomings of simple Euclidean distance. The idea of standard Euclidean distance: since the distribution of each dimension component of the data is not the same, first all the components should be “standardized” to the mean and variance are equal. As far as the mean and variance are concerned, let’s review some statistics.

Assuming that the mathematical expectation or mean of sample set X is m and the standard deviation is s, the “standardized variable” X* of X is expressed as :(x-m) /s, and the mathematical expectation of the standardized variable is 0 and the variance is 1. That is, the standardization process of sample sets can be described by the formula:

The normalized value = (the pre-normalized value – the mean value of the components)/the standard deviation of the components can be obtained by simple derivation of two N-dimensional vectors A (x11,x12… X1n) and b (x21, x22,… The formula of standardized Euclidean distance between,x2n) is as follows:
Markov distance

There are M sample vectors X1~Xm, the covariance matrix is denoted as S, and the mean is denoted as vector μ. Then, the Mahalanobis distance between sample vector X and u can be expressed as:
- If the covariance matrix is the identity matrix (independent and identically distributed among each sample vector), then the formula is obtained, namely the Euclidean distance:
- If the covariance matrix is diagonal, the formula becomes the normalized Euclidean distance.
Advantages and disadvantages of Mahalanobis distance: dimensionless, excluding the interference of correlation between variables.
Pap distance

In statistics, the Bartlett distance measures the similarity of two discrete or continuous probability distributions. It is closely related to the Pasteur distance coefficient, which measures the amount of overlap between two statistical samples or populations. The Babbitan distance and the Babbitan distance coefficient are named after A. Bhattacharya, A statistician who worked at the Indian Statistical Institute in the 1930s. At the same time, the Bhattacharyya coefficient can be used to determine which two samples are considered relatively close, and it is used to measure the separability of the class classification.

For the discrete probability distribution P and Q in the same field X, it is defined as:

Among them:

It’s the Bhattacharyya coefficient.
Hamming distance

The Hamming distance between two strings of equal length, S1 and S2, is defined as the minimum number of substitutions required to change one into the other. For example, the hamming distance between the string 1111 and 1001 is 2. Application: information coding (to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible).
Angle cosine

In geometry, Angle cosine can be used to measure the difference between two vectors. Machine learning uses this concept to measure the difference between sample vectors.
- The cosine formula of the Angle between vector A(x1,y1) and vector B(x2,y2) in two-dimensional space is as follows:
- Two n-dimensional sample points a(x11,x12… X1n) and b (x21, x22,… , cosine of the included Angle of x2n) :
The cosine of the included Angle ranges from [-1,1]. The larger the Angle cosine, the smaller the Angle between the two vectors, the larger the Angle between the two vectors. The maximum Angle cosine is 1 when the two vectors are in the same direction, and the minimum Angle cosine is -1 when the two vectors are in opposite directions.
Jekard similarity coefficient

The proportion of the intersection elements of two sets A and B in the union set of A and B is called the Jacquard similarity coefficient of the two sets and is expressed by the symbol J(A,B). The Jacquard similarity coefficient is an index to measure the similarity of two sets.

The opposite concept to the Jacquard similarity coefficient is the Jacquard distance:
Pearson coefficient

In statistics, Pearson’s product moment correlation coefficient is used to measure the correlation (linear correlation) between two variables X and Y, with a value between -1 and 1. In general, the correlation strength of variables can be judged by the following value range:

0.8-1.0 Strong correlation 0.6-0.8 Strong correlation 0.4-0.6 Moderate correlation 0.2-0.4 Weak correlation 0.0-0.2 very weak correlation or no correlation

Simply put, the application scenarios of various “distances” can be summarized as:

Space: Euclidean distance,
Paths: Manhattan distance, Chess Kings: Chebyshev distance,
The unified form of the three above: Minkowski distance,
Weighted: normalized Euclidean distance,
Dimensions and dependencies excluded: Mahalanobis distance,
Vector gap: cosine of the included Angle,
Coding differences: Hamming distance,
Set approximation: Jekard similarity coefficient and distance,
Correlation: correlation coefficient and correlation distance.

1.3 K value selection

If you choose the smaller values of K, equivalent to a smaller training instances in the field of forecast, “learning” approximation error will decrease, only with the input instance is close or similar training instances will only work on forecast results, at the same time the problem is to “learn” the estimation error will increase, in other words, the decrease of the K value means the whole model is complicated, Overfitting is easy to occur;
If a larger value of K is selected, it is equivalent to using training examples in a larger field to make predictions. Its advantage is that it can reduce the estimation error of learning, but its disadvantage is that the approximation error of learning will increase. At this time, training instances far from the input instance (dissimilar) will also act on the predictor, making the prediction error, and the increase of K value means that the overall model becomes simple.
When K=N, it is completely inadequate, because no matter what the input instance is at this time, it is simply predicted that it is the most tired in the training instance. The model is too simple, and a large amount of useful information in the training instance is ignored.

In practical application, K value is generally taken as a relatively small value. For example, the optimal K value is selected by cross-validation method (simply speaking, part of the sample is used as the training set and part of the test set).

1.4 The process of KNN nearest neighbor classification algorithm

Calculate the distance of each sample point in test samples and training samples (common distance measures include Euclidean distance, Mahalanobis distance, etc.);
Sort all of the above distance values;
The first k minimum distance samples were selected;
Vote according to the labels of the k samples to get the final classification category;

2. Implementation of KDD: KD tree

Kd-tree is an abbreviation of k-Dimension tree. It is a logarithmic point in k-dimensional space (e.g. 2d (x, y), 3d (x, y, z), k-dimensional (x1, y, Z..). ), which is mainly used for searching critical data in multidimensional space (e.g., range search and nearest neighbor search). Essentially, a KD-tree is a balanced binary tree.

First of all, it must be made clear that k-D tree is a spatial partition tree. To put it bluntly, the whole space is divided into several specific parts, and then relevant search operations are carried out in certain parts of the space. Imagine a three-dimensional space (multi-dimensional is a bit difficult for your imagination). The KD tree divides the three-dimensional space into multiple Spaces according to certain partition rules, as shown in the figure below:

2.1 KD tree Construction

The pseudo-code of kd tree construction is shown in the figure below:

Take a simple and intuitive example to introduce the K-D tree construction algorithm. Assume that there are six two-dimensional data points {(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2)}, data points in two-dimensional space, as shown in the figure below. In order to effectively find the nearest neighbor, k-D tree adopts the idea of divide and conquer, that is, the whole space is divided into several small parts. First, the thick black line divides the space into two parts, and then in the two subspaces, the thin black line divides the whole space into four parts, and finally the virtual black line divides the four parts further.

Six two-dimensional data points {(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2)} building kd tree steps as follows:

Make sure: split field =x. Specifically, the variances of the 6 data points in x and Y dimensions are 39 and 28.63 respectively, so the variances on X-axis are larger, so the split domain value is X.
Confirm: node-data = (7,2). Specifically, the data is sorted according to the value in the X-dimension, and the median value of the six data (the so-called median value, namely the value of the middle size) is 7, so the node-data field digit point (7,2). Thus, the split hyperplane of the node is the line x=7 through (7,2) and perpendicular to: split=x axis;
Determine: left subspace and right subspace. Specifically, the hyperplane x=7 is divided into two parts: the part of x<=7 is the left subspace, which contains three nodes ={(2,3),(5,4),(4,7)}; The other part is the right subspace, which contains 2 nodes ={(9,6), (8,1)};
As mentioned in the algorithm above, the construction of kd tree is a recursive process. By repeating the process of root node for the data in the left subspace and right subspace, we can obtain the first level child nodes (5,4) and (9,6). Meanwhile, the space and data set are further subdivided, and the process repeats until the space contains only one data point.

At the same time, after dividing the space shown above, we can see that point (7,2) can be the root node, the two thick red slashes pointing to (5,4) and (9,6) from the root node are the left and right children of the root node, and (2,3), (4,7) are the left and right children of (5,4) (connected by two thin red slashes). Finally, (8,1) is the left child of (9,6) (connected by thin red slashes). Thus, the following K-D tree is formed:

For k-dimensional data of N instances, the time complexity of establishing KD-tree is O(knLogn).

K-d tree algorithm can be divided into two parts, in addition to the upper part of the k-D tree itself such as the data structure of the establishment of the algorithm, the other part is in the establishment of k-D tree various such as insert, delete, search (nearest search) and other operations involved in the algorithm. Next, let’s look at the insert, delete and search operations of KD tree in turn.

2.2 Insertion of KD tree

Elements are inserted into a K-D tree in a similar way to binary search trees. Essentially, x coordinate values are compared in even layers and Y coordinate values are compared in odd layers. When we reach the bottom of the tree (i.e. when a null pointer appears), we know where the node will be inserted. The shape of the resulting K-D tree depends on the order in which nodes are inserted. Given N points, the average cost of insertion and retrieval for one node is O(log2N).

The insertion process is as follows:

It should be noted that during the insertion described here, each node splits its plane into two parts. By comparison, Chicago divides all the nodes on the plane into two parts, one of which has an x coordinate value less than 35, and the other has an x coordinate value greater than or equal to 35. Similarly, Mobile divides all nodes whose X coordinate value is greater than 35 into two parts. The Y coordinate value of some nodes is less than 10, and the Y coordinate value of the other nodes is greater than or equal to 10. Toronto and Buffalo continue to divide according to the rule of bisection.

2.3 Deletion of KD tree

The deletion of KD tree can be realized by recursive program. Let’s say we want to remove nodes (a,b) from the K-d tree. If both subtrees of (a,b) are empty, replace (a,b) with an empty tree. Otherwise, find a suitable node in the subtree of (a,b) to replace it, such as (c,d), and recursively delete (c,d) from the k-d tree. Once (c,d) has been deleted, replace (a,b) with (c,d). Suppose (a,b) is an X identifier, then it must replace the node with either the maximum X value in the left subtree of (a,b) or the minimum X value in the right subtree of (a,b).

Here is A practical example (source: China University of Geosciences electronic courseware, the error of the original courseware has been corrected below), as shown in the picture below, the original image and the corresponding KD tree, now to delete the A node in the picture, please see A series of deletion steps:

To delete node A in the figure above, select the node with the lowest X coordinate value in the right subtree of node A, here is C, where C becomes the root, as shown below:

Find a node in the right subtree of C to replace C,

Here is D, and convert the left subtree of D to its right subtree, D replaces the previous position of C, as shown below:

In the new right subtree of D, find the node with the lowest X coordinate, here is H, where H is in place of D,

Find A minimum Y coordinate value in the right subtree of D, here is I, and replace the original position of H with I, so that A node is successfully deleted from the figure, as shown below:

Removing a node from a K-D tree is expensive, and it is clear that removing the root of a subtree is limited by the number of nodes in the subtree. We use TPL (T) to represent the total path length of tree T. It can be seen that the sum of tree neutron tree size is TPL (T) +N. The TPL of the tree formed by randomly inserting N points is O(N*log2N), which means that the upper bound on the average cost of removing a randomly selected node from a randomly formed K-D tree is O(log2N).

2.4 The nearest neighbor search algorithm of KD tree

The pseudo-code of the K-D tree query algorithm is as follows:

I wrote a recursive version of the two-dimensional kd tree search function you can compare:

For example,

The asterisk indicates the point to be queried query point (2,4.5). Through binary search, the nearest approximate point can be found quickly along the search path. The leaf node found is not necessarily the nearest one. The nearest one is definitely closer to the query point and should be located in the circular domain with the query point as the center and through the leaf node. To find the true nearest neighbor, a related ‘backtrace’ operation is required. In other words, the algorithm first reversely searches along the search path to see if there is a data point closer to the query point.

Binary tree search: The node (5,4) is first searched from (7,2), and y = 4 is used as the partition hyperplane during the search. Since the search point is y value 4.5, it enters the right subspace to find (4,7), forming the search path <(7,2), (5,4), (4,7)>, but the distance between (4,7) and the target search point is 3.202. The distance between (5,4) and the search point is 3.041, so (5,4) is the nearest point of the query point.
Backtracking: take (2,4.5) as the center of the circle and 3.041 as the radius of the circle, as shown in the figure below. It can be seen that the circle is delivered to the hyperplane y = 4, so it needs to enter the left subspace of (5,4) for search, that is, add node (2,3) to the search path to get <(7,2), (2,3)>; Then search to the leaf node (2,3). The distance from (2,4.5) to (5,4) is closer than that to (5,4), so the nearest neighbor point is updated as (2,3) and the nearest distance is updated as 1.5.
Backtracking to (5,4) and finally to the root node (7,2), the circle is made with (2,4.5) as the center of the circle and 1.5 as the radius, and does not split the hyperplane with x = 7, as shown in the figure below. At this point, the search path is backtracked and the nearest neighbor point (2,3) is returned with the nearest distance of 1.5.

2.5 Kd tree neighbor search algorithm improvement: BBF algorithm

The instance points are randomly distributed, so the average computational complexity of kd tree search is O (logN), where N is the training instance tree. Therefore, kd tree is more suitable for k-nearest neighbor search when the number of training instances is much larger than the space dimension. When the space dimension is close to the number of training instances, its efficiency will rapidly decline to the speed of linear scanning before liberation.

As described in step 4 of the k-nearest neighbor search algorithm above: “The search ends when the search returns to the root node”. The query comparison completion process of each nearest neighbor point eventually returns to the root node and ends, resulting in many unnecessary backtracking and comparison nodes. These redundant losses will make the search efficiency quite low when searching for high-dimensional data. So what can be done to improve on the original kD-tree nearest neighbor search algorithm?

It can be seen from the above standard KD tree query process that the “backtracking” in the search process is determined by the “query path”, without considering some properties of some data points on the query path. A simple improvement idea is to sort the nodes on the “query path”, such as sorting by the distance between the respective partition hyperplane (also called bin) and the query point, that is, backtracking always starts from the tree node with the highest priority (Best bin).

Take the above query (2,4.5) as an example, the search algorithm flow is as follows:

Put (7,2) into the priority queue;
Extract (7,2) in the priority queue. Since (2,4.5) is located to the left of the hyperplane (7,2) segmentation, its left child node (5,4) is retrieved.
At the same time, according to BBF mechanism, “search left/right subtree, put the sibling node corresponding to this layer, namely right/left node, into the queue”, and put the sibling node corresponding to (5,4), namely right child node (9,6) into the priority queue
At this point, the priority queue is {(9,6)} and the optimal point is (7,2). The priority queue is {(2,3), (9,6)}, and the “optimal point” is (5,4).
Extract the node with the highest priority (2,3) and repeat step 2 until the priority queue is empty.

2.6 Application of KD tree

SIFT+KD_BBF search algorithm, refer to the references at the end of the paper.

3. Some questions about KNN

In K-means or kNN, we use Euclidean distance to calculate the distance between the nearest neighbors. Why not use the Manhattan distance?

** A: ** We do not use the Manhattan distance because it only counts horizontal or vertical distances and has dimensional limitations. Euclidean distance, on the other hand, can be used to calculate distance in any space. Because data points can exist in any space, Euclidean distance is a more viable option. For example: imagine a chess board, where the movement of an elephant or car is calculated by the Manhattan distance because they are moving in their respective horizontal and vertical directions.
What are the advantages of KD-tree compared with KNN for fast image feature comparison?

A: Great time saving cost. If the distance between point and line is > the minimum point, there is no need to go back to the previous layer; if <, search for the next layer.

4. References

From K nearest neighbor algorithm, distance measure to KD tree, SIFT+BBF algorithm

5. Handwritten number recognition cases

KNN handwritten number recognition system

Machine Learning

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]