- 9 Distance Measures in Data Science
- Originally written by Maarten Grootendorst
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: chzh9311
- Proofread: samyu2000, PassionPenguin
Nine distance measures in data science
Many machine learning algorithms, both supervised and unsupervised, use the concept of distance measurement. For example, this is the case in k-NN, UMAP, HDBSCAN and other algorithms, which use Euclid distance and cosine similarity as two distance measures.
Understanding distance metrics may be more important than you think. Take the K-NN algorithm commonly used for supervised learning as an example, which uses Euclidean distance as the distance measure by default. That in itself is a great distance measure.
However, what happens if your data has high dimensions? Can Euclidean distance still solve the problem? What if the data contains geospatial information? Obviously not. In this case, using a half-positive vector distance might be more appropriate!
You should know when to use each distance metric, because it can help you turn a bad classifier into an accurate model.
In this article, we’ll introduce a variety of distance metrics and explore the best ways and scenarios to use them. More importantly, I’ll cover their shortcomings so you can understand when certain algorithms are not appropriate.
Note: Several long papers have detailed the use scenarios for most distance measures and their respective advantages and disadvantages. I’ll try to cover them as thoroughly as possible, but it’s impossible to cover everything. Therefore, this article should be seen as an overview of these distance measures.
1. Euclidean Distance
We’ll start with the most common distance measure, the Euclidean distance. The best way to think about it is the straight-line distance between two points.
The formula for calculating Euclidean distance is fairly straightforward. Given the Cartesian coordinates of two points, it can be calculated according to the Pythagorean theorem.
insufficient
Even as a common distance measure, Euclidean distance does not have scaling invariance. This means that the calculated distance may vary from unit to unit. In general, data needs to be standardized before distance measures can be used.
Moreover, with the increase of data dimension, the applicability of Euclidean distance will decrease. It has to do with the curse of dimensions, which is that higher-dimensional Spaces don’t behave as intuitively as we would expect from a 2-dimensional or 3-dimensional analogy. For a summary of the content, please refer to this q&A.
Applicable scenario
When the data dimension is low and the vector size needs to be measured, the Euclidean distance is suitable. When Euclidean distance is applied to low dimensional data, k-NN and HDBSCAN methods can get ideal results.
Although the Euclidean distance has some limitations and many other distance measurements have been invented for it, it is still the most commonly used. It fits perfectly with our intuition, is simple to implement and has good results in multiple use scenarios.
There are a lot of good things about Cosine Similarity.
Cosine similarity is often used to make up for the deficiency of Euclidean distance in higher dimensional space. The definition is cosine of the Angle between two vectors. It’s also equal to the inner product of normalizing two vectors to a unit vector.
If two vectors are in exactly the same direction, their cosine similarity is 1, and if they are parallel but in opposite directions, their cosine similarity is -1. Note that cosine similarity reflects differences in direction, not magnitude.
insufficient
One of the primary disadvantages of cosine similarity is that it ignores the length of a vector and only looks at its direction. This in practice means that numerical differences are not taken into account. Taking recommendation system as an example, cosine similarity cannot reflect the differences of rating scales of different users.
Applicable situation
We can usually use cosine similarity when our data is high dimensional and the length of the vector is not important. In the world of text analysis, this metric is used a lot when we use word counts as data. For example, the frequent occurrence of a word in an essay does not necessarily mean that the content of the essay is highly relevant to that word. It may also be the case that different articles have different lengths, so the length of the word count vector does not affect the final statistical result. So, ignoring the cosine similarity of length becomes our best choice.
3. Hamming Distance
Hamming distance refers to the number of digits between two vectors with different values. It is typically used to compare two binary strings of the same length. It can also indicate the similarity of two strings of equal length by counting the number of different characters at the same position between them.
insufficient
You should be able to take into account that hamming distances are difficult to apply when the dimensions of two vectors are different. You need to compare vectors of the same dimension to see where the positions are different.
Moreover, the Hamming distance does not care about the actual values but only whether they are equal. Therefore, it is not recommended to use hamming distance in the case that the value size needs to be emphasized.
Applicable scenario
A typical example of Hamming distance is error detection and correction in data transmission in computer networks. It can be used to estimate errors by representing the number of bits of transmission errors.
Also, you can use the Hamming distance to measure the distance between sub-type variables.
4. Manhattan distance
The Manhattan distance, often called taxi distance or city block distance, is used to calculate the distance between real valued vectors. Imagine the vectors represented by points on a unit grid. The Manhattan distance is the length of the path between two vectors if you can only move horizontally or vertically. Diagonals are not involved in calculating this distance.
insufficient
Although the Manhattan distance seems to be able to handle higher-dimensional data, this metric is less intuitive than the Euclidean distance, especially when used for higher-dimensional data.
Moreover, its calculation may be larger than the Euclidean distance, since it does not measure the shortest possible path. It’s not necessarily a problem, but you have to consider these factors.
Applicable scenario
When the attributes in your data set are discrete or binary, the Manhattan distance seems appropriate because it measures paths that are achievable within the values of those attributes. Instead, the Euclidean distance creates a straight line segment between two vectors that may not actually exist.
5. Chebyshev Distance
Chebyshev distance is defined as the maximum value of the data difference between the corresponding dimensions of two vectors. In other words, it can be understood simply as the maximum distance computed along the coordinate axis. Chebyshev distance is often called checkerboard distance because in chess, the minimum number of moves required for a king to move from one square to another is equal to Chebyshev distance.
insufficient
In general, Chebyshev distance applies to special situations, so it is difficult to be used as a general distance measure like Euclidean distance or cosine similarity. Therefore, it is generally not recommended unless it is appropriate.
Applicable scenario
As mentioned earlier, chebyshev distance can be used to indicate the minimum number of steps required for a king to walk from one square to another. Furthermore, it can be used as a good metric in games that allow 8-way movement.
In practice, Chebyshev distance is often used in warehouse logistics because it describes the time it takes for a bridge crane to move an object.
6. Minkowski Distance
Minkowski distance is a little more complicated than most measures. It is used in normed vector Spaces (n-dimensional real Spaces), which means that if distances in a vector space are represented as vectors of computable length, then minkowski distances can be applied to that space.
There are three requirements for this measurement:
- The zero vector — the zero vector has length 0 and all the other vectors have length positive. If we go from one point to another, the distance is always positive. However, if we go from this point to itself, then the distance is 0.
- Number times factor – When you multiply a vector by a positive number, its length changes but its direction stays the same. For example, if we travel a certain distance in a certain direction, and then add the same distance, the direction will not change.
- Triangle inequality — shortest line between two points.
Here is the minkowski distance formula:
The most interesting aspect of this distance measure is the use of the parameter PPP. We can adjust this parameter to change the Minkowski distance into some other distance measure.
PPP values are as follows:
- P =1p=1p=1 — Manhattan distance
- P =2p=2p=2 — Euclidean distance
- P =∞p=\infinp=∞ — Chebyshev distance
insufficient
The Minkowski distance has the same shortcomings as the distance measures it represents, so it is important to have a deep understanding of measures such as Manhattan distance, Euclidean distance, and Chebyshev distance.
Moreover, the PPP parameter can actually cause some trouble. Depending on your usage scenario, finding the right PPP value can be computationally inefficient.
Applicable scenario
The advantage of the PPP parameter is that it can be updated iteratively to find the best distance measure for the problem at hand. This gives you considerable freedom in your distance measurements, which is a great advantage if you are familiar with PPP and many distance measurements.
7. Jaccard Index
The Jacquard coefficient (or IoU) is a measure used to calculate the similarity and difference of a sample set. It is calculated by dividing the size of the intersection by the size of the union.
In practice, it represents the number of similar instances divided by the total number of instances. For example, if two sets have a common instance and there are a total of five different instances in the problem, the Jacquard exponent is 1/5=0.21/5=0.21/5=0.2.
To calculate the Jacquard distance, we simply subtract the Jacquard exponent from 1:
insufficient
One of the main disadvantages of Jacquard index is that it is seriously affected by the amount of data. Large data sets can have a great influence on this index because it can significantly increase the union while keeping the intersection basically unchanged.
Applicable scenario
The Jacquard index is often used when the data is binary or binary. When you have a deep learning model for image segmentation (such as car segmentation), the Jacquard coefficient can be used to calculate segmentation accuracy using authentic annotations.
Similarly, in document similarity analysis, it can measure how many word choices overlap between documents. Therefore, it can be used to compare collections of patterns.
8. Haversine distance
Semisortical distance is the distance calculated from the latitude and longitude of two points on a sphere. It computes the shortest path length at two points, which is very similar to the Euclidean distance. The main difference is that two points are on a sphere, so you can’t have a straight path.
insufficient
One of the drawbacks of this distance measure is that it assumes that both points are on the sphere. In fact, there are very few cases where the earth is not a perfect sphere, which makes some calculations very difficult. Instead, you can look at the Vincenty distance, which assumes that the carrier is an ellipsoid rather than a sphere.
Scope of application
As you may already be aware, half-vector distance is often used for navigation. For example, you can use it to calculate the flight distance between two countries. Note that if the two points themselves are not far apart, the half-positive vector distance is not very good. The curvature of the path will not matter much.
9. S ø rensen – Dice index
Similar to the Jacquard coefficient, the Sø rensen-dice index can also measure similarities and differences between sample sets. Even though their calculations are similar, the Sø rensen-dice index is a little more intuitive than the Jacquard coefficient, as it can be viewed as the ratio of overlap between two sets, between 0 and 1:
insufficient
Similar to jacquard’s indices, they both overemphasize the importance of sets containing few or no truth values. The result is that this set will dominate the average scores of multiple sets. Rather than divide the weights equally, this measure assigns a weight to a collection inversely proportional to its size.
Applicable scenario
The Sø rensen-dice index applies in a very similar way to the Jacquard index. It is usually used for image segmentation tasks or text similarity analysis.
Note: There are far more types of distance measures than the nine mentioned above. If you want to learn more about interesting metrics, I suggest you investigate one of the following: Mahalanobis, Canberra, Bray-Curtis distance, and KL-Divergence.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.