“This is the second day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.
Mahalanobis Distance is proposed by P. C. Mahalanobis, an Indian statistician, and represents the distance between a point and a distribution. It is an effective method to calculate the similarity of two unknown sample sets. Different from Euclidean distance, it takes into account the relationship between various characteristics. This paper introduces the related content of Mahalanobis distance.
Disadvantages of Euclidean distance
Distance measurement has a wide range of uses in various disciplines, When the data is expressed as vector x – > = (x1, x2,…, xn) T \ overrightarrow {\ mathbf {x}} = \ left (x_ {1}, x_ {2}, \ \ cdots, X_ {n} \ right) ^ {T} x = (x1, x2,…, xn) T and y – > = (y1, y2,…, yn) T \ overrightarrow {\ mathbf {y}} = \ left (y_ {1}, y_ {2}, \ \ cdots, Y_ {n} \ right) ^ {T} y = (y1, y2,…, yn) T, the most visible distance metric is the Euclidean distance:
{% raw %}
{% endraw %}
However, this measurement method does not take into account the difference and correlation among various dimensions, and different vectors have the same weight when measuring distance, which may interfere with the reliability of results.
Markov distance
To measure the distance between a sample and a distribution, the sample and distribution are normalized to a multidimensional standard normal distribution and then the Euclidean distance is measured
thought
- Rotation of variables according to principal components eliminates correlation between dimensions
- We standardize vectors and distributions so that all dimensions are normally distributed
derivation
- The distribution is characterized by NNN MMM dimension vectors, that is, a total of NNN data, each data is represented by a MMM dimension vector:
{% raw %}
{% endraw %}
- The mean value of XXX is μX{\mu _X}μX
- The covariance matrix of XXX is:
- In order to eliminate the correlation between dimensions, a matrix QTQ^TQT of M ×mm \times mm×m is used to perform coordinate table transformation for XXX, and the data is mapped to a new coordinate system, denoted by YYY:
At this point, we expect that under the action of QTQ^TQT, different dimensions are independent of each other in the vector representation of YYY, and the covariance matrix of YYY should be a diagonal matrix (all elements except the diagonal elements are 0).
- Mean value of Y: uY=QTuXu_{Y}=Q^{T} u_{X}uY=QTuX
- Covariance matrix of Y:
{% raw %}
{% endraw %}
-
From this we can see that when QQ Q is a matrix of sigma X\Sigma_{X} sigma X eigenvectors, sigma Y\Sigma_{Y} sigma Y must be a diagonal matrix with eigenvalues corresponding to each eigenvector. Since σ X\Sigma_{X} σ X is a symmetric matrix, it must be possible to obtain QQQ by eigendecomposition, and QQQ is an orthogonal matrix.
-
σ Y\Sigma_{Y} σ Y diagonal elements mean the variance of each vector in YYY, so they are non-negative. From this perspective, it can be shown that the eigenvalue of the covariance matrix is non-negative.
-
In fact, the covariance matrix itself is semi-positive, and the eigenvalues are non-negative
-
Irrelevant and independent questions:
- Here we show that the correlation coefficient between vectors after transformation is 0, that is, there is no correlation between vectors
- In fact, independence is a stronger constraint than irrelevance, which often does not lead to independence
- But under the Gaussian distribution, irrelevance and independence are equivalent
Now let’s normalize the vectors
-
When we subtracted the mean, the vector has become a 0 vector of the mean, just short of standardizing the variance to 1
-
After Y=QTXY=Q^TXY=QTX transformation, the covariance matrix of YYY has become a diagonal matrix, and the diagonal element is the variance of the data of each dimension in YYY, so we only need to divide the data of each dimension in YYY by the standard deviation of the data of this dimension.
-
We denoted the data after de-correlation, 0-mean and standardization as ZZZ:
{% raw %}
\begin{aligned} Z &= \left[ {\begin{array}{*{20}{c}} {\frac{1}{{{\sigma _1}}}}&{}&{}&{}\\ {}&{\frac{1}{{{\sigma _2}}}}&{}&{}\\ {}&{}& \ddots &{}\\ {}&{}&{}&{\frac{1}{{{\sigma _n}}}} \end{array}} \right](Y – {\mu _Y}) \\&= \Sigma _Y^{ – \frac{1}{2}}{Q^T}(X – {\mu _X}) \\ &= ({Q^T}{\Sigma _X}Q)_{}^{ – \frac{1}{2}}{Q^T}(X – {\mu _X}) \end{aligned}{% endraw %}
-
The Mahalanobis distance is the Euclidean distance between the corrected vector ZZZ and the distribution center (origin) :
{% raw %}
{% endraw %}
The resources
- Baike.baidu.com/item/%E9%A9…
- zhuanlan.zhihu.com/p/109100222