This is the 31st day of my participation in the August Text Challenge.More challenges in August

Hardware and software Environment

  • Ubuntu 18.04 64 – bit
  • Numpy 1.12.1

Euclidean distance

Euclidean distance, also known as Euclidean distance, is the most common distance measure, which measures the absolute distance between points in multi-dimensional space. Euclidean distance is an intuitive and common similarity algorithm for computing similarity (such as face recognition). The smaller the Euclidean distance is, the greater the similarity is. The larger the Euclidean distance, the smaller the similarity.

Definition from Chinese wikipedia

In mathematics, the Euclidean distance or Euclidean metric is the “ordinary” (i.e. straight line) distance between two points in a Euclidean space. Using this distance, Euclidean space becomes metric space. The associated norm is called the Euclidean norm. The older literature is called the Pythagorean metric.

The mathematical formula for Euclidean distance

Code implementation

We use numpy this scientific computation library to calculate Euclidean distance, the code is very simple

#! /usr/bin/env python # -*- coding: utf-8 -*- # @Date : 2018-08-17 16:31:07 # @Author : xugaoxiang ([email protected]) # @Link : link # @Version : 1.0.0 import numpy as np def get_edclidean_distance(vect1,vect2): Dist = np.sqrt(np.sum(np.square(vect1-vect2))) # dist = numpy.linalg.norm(vect1-vect2) return dist if __name__ = = "__main__ ': vect1 = np, array ([1, 2, 3]) vect2 = np. Array ([4 and 6]) print (get_edclidean_distance (vect1 vect2))Copy the code

Execution result display

5.19615242271
Copy the code

Manhattan distance

Wikipedia definition

Taxi geometry or Manhattan Distance or square distance is a term coined by Herman Minkowski in the 19th century. It is a term for Euclidean geometry that measures space. Used to indicate the sum of absolute wheelbases at two points in a standard coordinate system.

Imagine you’re in Manhattan and you’re driving from one intersection to another. The actual driving distance is the Manhattan distance. And that’s where the name Manhattan Distance comes from, also known as city block distance.

In the figure above, the green line is the Euclidean distance, the red line is the Manhattan distance, and the blue and yellow lines are equivalent Manhattan distances.

The Manhattan distance between two points a(x1,y1) and b(x2,y2) in a two-dimensional plane

Two n-dimensional vectors a(x11,x12… And x1k) and b (x21, x22,… ,x2k) Manhattan distance between

Code implementation

#! /usr/bin/env python # -*- coding: utf-8 -*- # @Date : 2018-08-20 16:10:23 # @Author : xugaoxiang ([email protected]) # @Link : link # @Version : 1.0.0 import OS import numpy as NP def get_Manhattan_distance (vect1, vect2): Dist = np.sum(np.abs(vect1-vect2)) # dist = np.linalg.norm(vect1-vect2) ord=1) return dist if __name__ == '__main__': vect1 = np.array([1, 2, 3]) vect2 = np.array([4, 5, 6]) dist = get_manhattan_distance(vect1, vect2) print(dist)Copy the code

The output

9
Copy the code

K nearest neighbor algorithm

K-nearest Neighbor, abbreviated as KNN, is a classification and regression algorithm and one of the simplest machine learning algorithms. The idea is that in a feature space, if most of the k nearest (that is, nearest in the feature space) samples near a sample belong to a certain category, that sample also belongs to that category.

If we look at the picture above, we know that there are two categories, the red triangle and the blue square, and now we want to decide which category does the green circle in the middle belong to? If you use KNN, you start with its neighbors, but how many neighbors do you have to look at?

  • K=3, the nearest three neighbors of the green dot are two red triangles and one blue square. The minority is subordinate to the majority. Based on statistical methods, the green circle is considered to belong to the red triangle category
  • K=5, are the nearest five neighbors of the green dot 2 red triangles and 3 blue squares, or are the minority subordinate to the majority? Based on statistical methods, the green circle is considered to belong to the blue square

As you can see, the choice of K has a big impact on our final result. The smaller the K value, the more susceptible it is to a single individual, and the larger the K value, the more susceptible it is to a particular distance. The distances mentioned here are commonly calculated using Euclidean and Manhattan distances as shown above

So how do you set K?

Unfortunately, there is no clear conclusion here. The value of K is affected by the problem itself and the size of the data set. In many cases, you need to make several attempts and then choose the best value.

The disadvantage of KNN

KNN needs to calculate the distance between all samples. In this case, the amount of calculation is very large and the efficiency is very low, and it is difficult to apply to a large data set.

Code sample

The sklearn library, which provides a complete KNN implementation, is also very simple to use and is installed via PIP install scikit-learn

from sklearn.neighbors import KNeighborsClassifier ... Knn_classifier = KNeighborsClassifier(knn_classifier =5) knn_classifier Pred = knn_classifier. Predict (X_test)Copy the code

The resources

  • Zh.wikipedia.org/wiki/%E6%AC…
  • Zh.wikipedia.org/wiki/%E6%9B…
  • scikit-learn.org/