Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
1. Euclidean Distance
(1) Definition
Euclidean distance is one of the most easy to understand distance calculation method, derived from the distance formula between two points in Euclidean space.
(2) Formula
1. Two n-dimensional vectors (x11,x12… X1n) and (x21, x22,… The Euclidean distance between,x2n) is:
2. Expressed as vector operation in the following form:
Ii. Manhattan Distance
(1) Definition
Imagine driving from one intersection to another in Manhattan. The actual driving distance is the Manhattan distance. This is where the name Manhattan Distance comes from, also known as City Block Distance.
(2) Formula
1. The Manhattan distance between two points (x1,y1) and (x2,y2) in a two-dimensional plane
2. Two n-dimensional vectors (x11,x12… X1n) and (x21, x22,… ,x2n)
Chebyshev Distance
(1) Definition
In chess, a king can move to any of the eight adjacent squares. King from grid (x1, y1) to the grid (x2, y2) steps at least always Max (| x2 – x1 | | y2 – y1 |) step. A similar distance measure is called Chebyshev distance.
(2) Formula
1. Chebyshev distance between two points (x1,y1) and (x2,y2) in a two-dimensional plane
(2) Two n-dimensional vectors (x11,x12… X1n) and (x21, x22,… The Chebyshev distance between,x2n)
The other equivalent form of this formula is
Standardized Euclidean Distance
(1) Definition
The standardized Euclidean distance is a scheme to improve the disadvantages of simple Euclidean distance. The idea of standard Euclidean distance: since the distribution of each dimension component of the data is not the same, ok! So let me normalize all the components to the point where the mean and variance are equal. How much does the mean and variance get normalized to? Assuming that the mean and standard deviation of sample set X are M and S, the “standardized variable” of X can be expressed as follows:
And the mathematical expectation of the standardized variable is 0 and the variance is 1. Therefore, the standardization process of sample sets can be described by the formula:
The normalized value is equal to the pre-normalized value minus the mean of the components divided by the standard deviation of the components
(2) Formula
After simple derivation, two n-dimensional vectors (x11,x12… X1n) and (x21, x22,… The formula of standardized Euclidean distance between,x2n) is:
If the reciprocal of variance is regarded as a weight, this formula can be regarded as a Weighted Euclidean distance.
5. Mahalanobis Distance
(1) Definition
There are M sample vectors X1~Xm, the covariance matrix is denoted as S, and the mean is denoted as vector μ. Then, the Mahalanobis distance between sample vector X and u can be expressed as:
Where, the Mahalanobis distance between vector Xi and Xj is defined as:
If the covariance matrix is the identity matrix (independent and identically distributed among each sample vector), the formula becomes:
That’s the Euclidean distance. If the covariance matrix is diagonal, the formula becomes the normalized Euclidean distance.
6. Cosine of included Angle
(1) Definition
In geometry, Angle cosine can be used to measure the difference between two vectors. Machine learning uses this concept to measure the difference between sample vectors.
(2) Formula
1. Cosine formula of the included Angle between vectors (x1,y1) and vectors (x2,y2) in two-dimensional space:
(2) Two n-dimensional sample points (X11,x12… X1n) and (x21, x22,… Similarly, for two n-dimensional sample points a(x11,x12… X1n) and b (x21, x22,… ,x2n), and you can measure their similarity using something like Angle cosine.
That is:
The cosine of the included Angle ranges from [-1,1]. The larger the Angle cosine, the smaller the Angle between the two vectors, the larger the Angle between the two vectors. The maximum Angle cosine is 1 when the two vectors are in the same direction, and the minimum Angle cosine is -1 when the two vectors are in opposite directions.
7. Hamming Distance
(1) Definition
The Hamming distance between two strings of equal length, S1 and S2, is defined as the minimum number of substitutions required to change one into the other. For example, the hamming distance between the string 1111 and 1001 is 2.
(2) Application
Information coding (to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible).