Author: Han Xinzi @ ShowMeAI tutorial address: www.showmeai.tech/tutorials/8… This paper addresses: www.showmeai.tech/article-det… Statement: All rights reserved, please contact the platform and the author and indicate the source


1. Scalar

A scalar is just an individual number. It has only numerical magnitude, no direction (some have positive and negative points), and the operations follow the general algebraic rules.

  • Variable names are usually in lower case.
  • Mass MMM, rate VVV, time TTT, resistance ρ\rhoρ and other physical quantities are data scalars.

2. Vector

A vector is a quantity that has magnitude and direction and is morphologically a row of numbers.

  • Vector names are usually given in bold and lowercase; Script adds an arrow to the right of the letter.

  • The elements in a vector are ordered, and each element can be identified by index.

  • There are two ways to express elements of a vector explicitly (note the square brackets).

  • A vector can be regarded as a directed line segment in space, and each component of the vector corresponds to the length of the projection of the vector on a different coordinate axis.

Application in AI: In machine learning, the representation of a single data sample is done in the form of vectorization. Vectorization method can help AI algorithm complete in a more efficient way in the process of iteration and calculation.

3. Matrix

A matrix is a two-dimensional array in which each element is determined by two indexes. Matrices are crucial in machine learning. They’re everywhere.

  • The matrix is usually given variable names in bold uppercase.

Application in AI: The sample is represented in matrix form: MMM strip data/sample, NNN feature data set, is an m×nm \times nm×n matrix.

4. Tensor

Tensors defined in geometric algebra are generalizations based on vectors and matrices.

  • Scalars, you can think of them as tensors of zero order
  • Vectors, you can view them as first order tensors
  • Matrices can be viewed as tensors of second order

  • The image is represented in matrix form: a color image is represented as a third-order tensor of H×W×CH \times W \times CH×W×C, where HHH is the height, WWW is the width, and CCC is usually taken as 3 to represent the three color channels of the color graph.
  • On the basis of this example, this definition is further extended, that is, a data set containing multiple images is represented by four-order tensors (sample, height, width, channel), where sample represents the number of images in the data set.
  • Video is represented by a five-order tensor (sample, frame rate, height, width, channel).

Application in AI: Tensors are a very important concept in deep learning. Most data and weights are stored in the form of tensors, and all subsequent operations and optimization algorithms are also based on tensors.

5. Norm

Norm is an enhanced concept of distance; Simply put, norm can be understood as distance.

In mathematics, norms include “vector norm” and “matrix norm” :

  • Vector Norm represents the size of a Vector in a Vector space. All vectors in a vector space have a size, and that size is measured in terms of norms. Different norms can be used to measure this size, just as meters and feet can be used to measure distances.

  • Matrix Norm, representing the size of the change caused by the Matrix. For example, the vector X\ boldSymbol {X} =B\ boldSymbol {A}\boldsymbol{X} =B\boldsymbol{B}AX=B can be changed to B\ boldSymbol {B}B, The matrix norm measures the magnitude of this change.

Calculation of vector norm:

For p−\ Mathrm {p} -p− norm, If x = [x1, x2,…, xn] T \ boldsymbol {x} = \ left [x_ {1}, x_ {2}, \ \ cdots, x_ {n} \ right] ^ {\ mathrm {T}} x = [x1, x2,…, xn] T, The p−\mathrm{p} -p− norm of the vector x\ boldSymbol {x}x is ∥ ∥ p = x (∣ x1 ∣ p + ∣ x2 ∣ p +. + ∣ xn ∣ p) 1 p \ | \ boldsymbol {x} \ | _ = {p} \ left (\ left | x_ {1} \ right | ^ {p} + \ left | x_ {2} \ right | ^ {p} + + \ \ cdots left | x_ {n } \ right | ^ ^ {p} \ right) {\ frac {1} {p}} ∥ ∥ p = x (∣ x1 ∣ p + ∣ x2 ∣ p +. + ∣ xn ∣ p) p1.

The L1 norm: ∣ ∣ x ∣ ∣ 1 = ∣ x1 ∣ + ∣ x2 ∣ + ∣ x3 ∣ +. + ∣ xn ∣ | | \ boldsymbol {x} | | _ {1} = \ left | x_ + \ | {1} \ right left | x_ + \ | {2} \ right left | x_ {3} \ right | + + \ \ cdots left | x_ {n} \ right | ∣ ∣ x ∣ ∣ 1 = ∣ x1 ∣ ∣ x2 ∣ + + ∣ x3 ∣ +. + ∣ xn ∣

  • P =1\mathrm{p} =1p=1, is L1 norm, is x\boldsymbol{x}x vector absolute value sum of each element.

  • L1 norm has many names, such as Manhattan distance, minimum absolute error and so on.

The L2 norm: ∥ ∥ x = 2 (∣ x1 ∣ 2 + 2 + ∣ x2 ∣ ∣ x3 ∣ 2 +. + ∣ xn ∣ 2) half \ | \ boldsymbol {x} \ | _ {2} = \ left (\ left | x_ {1} \ right | ^ {2} + \ left | x_ {2} \ right | ^ {2} + \ left | x_ { 3} \ right | ^ {2} + + \ \ cdots left | x_ {n} \ right | ^ ^ {2} \ right) 1/2} {∥ ∥ x 2 = (∣ x1 ∣ 2 + 2 + ∣ x2 ∣ ∣ x3 ∣ 2 +. + ∣ xn ∣ 2) 1/2

  • When p=2\mathrm{p} =2p=2, the L2 norm is the square root of the sum of squares of the elements of x\boldsymbol{x}.

  • The L2 norm is the most commonly used norm, and the Euclidean distance is one of the L2 norm.

Application in AI: L1 norm and L2 norm are common in machine learning, such as “calculation of evaluation criteria”, “regularization terms used to limit model complexity in loss functions”, etc.

6. Eigen-decomposition

By breaking mathematical objects down into their constituent parts, some of their properties can be found, or they can be better understood. For example, integers can be decomposed into prime factors by 12=2×3×312=2 \times 3 \times 312=2×3×3 to obtain “a multiple of 12 is divisible by 3, or 12 is not divisible by 5”.

Similarly, we can decompose the “matrix” into a set of “eigenvectors” and “eigenvalues” to discover functional properties that are not obvious when the matrix is represented as an array element. Eigen-decomposition (Eigen-decomposition) is a widely used matrix decomposition method.

  • Eigenvector: the eigenvector of square matrix A\boldsymbol{A}A, refers to the non-zero vector that is equivalent to scaling the vector after multiplying with A\boldsymbol{A} nu =\lambda \nuAν=λν.

  • Eigenvalues: The scalar λ\lambdaλ is called the corresponding eigenvalue of this eigenvector.

When using eigen decomposition to analyze the matrix A\ boldSymbol {A}A, the matrix Q\boldsymbol{Q}Q composed of eigenvector ν\nuν and the vector λ \ boldSymbol {Lambda} λ composed of eigenvalues are obtained. We can rewrite A\boldsymbol{A}A as: A=Q λ Q−1\boldsymbol{A} = \boldsymbol{Q} \boldsymbol{Lambda} \boldsymbol{Q}^{-1}A=Q λ Q−1

7. Singular Value Decomposition (SVD)

There are preconditions for the eigendecomposition of matrices. Only diagonalizable matrices can be eigen decomposed. In fact, a lot of matrices don’t satisfy this condition, so what do we do?

This paper generalizes the eigen factorization of matrices and obtains a method called singular value factorization of matrices, that is, an ordinary matrix is decomposed into singular vectors and singular values. By singular value decomposition, we get some information similar to eigendecomposition.

Decompose the matrix A\boldsymbol{A}A into the product of three matrices A=UDV−1\boldsymbol{A} = \boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^{-1}A=UDV−1.

  • If A\ BoldSymbol {A}A is an M ∗nm*nm∗ N matrix, then U\ BoldSymbol {U}U is an M ∗mm*mm∗ M matrix, DDD is an M ∗nm*nm∗ N matrix, and VVV is A N ∗nn*nn∗ N matrix.

  • UVD\ boldSymbol {U} \ boldSymbol {V} \boldsymbol{D}UVD these matrices have special structures:

    • U\ Boldsymbol {U}U and V\ boldSymbol {V}V are orthogonal matrices. The column vectors of the matrix U\ Boldsymbol {U}U are called left singular vectors and the column vectors of the matrix V\ Boldsymbol {V}V are called right singular vectors.

    • D\boldsymbol{D}D is diagonal matrix (note that D\boldsymbol{D}D is not necessarily square matrix). The elements on the diagonal of the diagonal matrix D\ boldSymbol {D}D are called singular values of the matrix A\ boldSymbol {A}A.

Application in AI: Perhaps one of the most useful properties of SVD is to extend the matrix inverse to a non-square matrix. And you will also see svD-based algorithm applications in the recommendation system.

8. Moore-penrose inverse/Pseudoinverse (Moore-Penrose Pseudoinverse)

Suppose in the following problem, we want to solve the linear equation Ax=y\ boldSymbol {A} x=yAx=y by the left inverse B\ boldSymbol {B}B of the matrix A\ boldSymbol {A}A: X = Byx = \ boldsymbol yx = By {B}. Whether there is A unique mapping that maps A\ boldSymbol {A}A to B\ boldSymbol {B}B depends on the form of the problem:

  • If the matrix A\boldsymbol{A}A has more rows than columns, then the above equation may have no solution;

  • If the matrix A\boldsymbol{A}A has less rows than columns, then the above equation may have more than one solution.

The Moorean-Penrose pseudo-inverse enables us to solve this case, where the pseudo-inverse of the matrix A\ boldSymbol {A}A is defined as:


A + = lim a 0 ( A T A + Alpha. I ) 1 A T \boldsymbol{A}^{+}=\lim _{a \rightarrow 0}\left(\boldsymbol{A}^{T} \boldsymbol{A}+\alpha \boldsymbol{I}\right)^{-1} \boldsymbol{A}^{T}

But the actual algorithm for calculating pseudo inverses is not based on this formula, but instead uses the following formula:


A + = U D + V T \boldsymbol{A}^{+}=\boldsymbol{U} \boldsymbol{D}^{+} \boldsymbol{V}^{T}

  • The matrices U\boldsymbol{U}U, D\boldsymbol{D}D and VT\boldsymbol{V}^{T}VT are obtained by singular value decomposition of matrix A\boldsymbol{A}A;

  • The pseudo-inverse D+ boldSymbol {D}^{+}D+ of the diagonal matrix D\ boldSymbol {D}^{+}D+ is obtained by inverting its nonzero elements and transpose them.

9. Common distance measures

In machine learning, most operations are based on vectors. If a data set contains N feature fields, each sample can be represented as an N-dimensional vector. By calculating the distance between the corresponding vectors of the two samples, the similarity of the two samples can be reflected in some scenarios. Other algorithms, like KNN and K-means, rely heavily on distance measures.

Let’s say there are two NNN dimensional variables:


A = [ x 11 . x 12 . . . . . x 1 n ] T A=[ x_{11}, x_{12},…,x_{1n} ] ^{T}


B = [ x 21 . x 22 . . . . . x 2 n ] T B=[ x_{21} ,x_{22} ,…,x_{2n} ] ^{T}

Some commonly used distance formulas are defined as follows:

1) Manhattan Distance

Manhattan distance, also known as city block distance, is mathematically defined as follows:


d 12 = k = 1 n x 1 k x 2 k d_{12} =\sum_{k=1}^{n}{| x_{1k}-x_{2k} | }

Python implementation of Manhattan distance:

import numpy as np
vector1 = np.array([1.2.3])
vector2 = np.array([4.5.6])

manhaton_dist = np.sum(np.abs(vector1-vector2))
print("The Manhattan distance is, manhaton_dist)
Copy the code

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

2) Euclidean Distance

Euclidean distance is actually L2 norm, mathematically defined as follows:


d 12 = k = 1 n ( x 1 k x 2 k ) 2 d_{12} =\sqrt{\sum_{k=1}^{n}{( x_{1k} -x_{2k} ) ^{2} } }

A Python implementation of Euclidean distance:

import numpy as np
vector1 = np.array([1.2.3])
vector2 = np.array([4.5.6])

eud_dist = np.sqrt(np.sum((vector1-vector2)**2))
print("Euclidean distance is, eud_dist)
Copy the code

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

3) Minkowski Distance

Strictly speaking, Minkowski distance is not a distance, but a definition of a group of distances:


d 12 = k = 1 n ( x 1 k x 2 k ) p p d_{12} =\sqrt[p]{\sum_{k=1}^{n}{( x_{1k} -x_{2k} ) ^{p} } }

In fact, when p=1, p=1p=1, it’s the Manhattan distance; When p=2p=2p=2, it is the Euclidean distance.

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

4) Chebyshev Distance

Chebyshev distance is the infinite norm, mathematically expressed as follows:


d 12 = m a x ( x 1 k x 2 k ) d_{12} =max( | x_{1k}-x_{2k} |)

The Python implementation of Chebyshev distance is as follows:

import numpy as np
vector1 = np.array([1.2.3])
vector2 = np.array([4.5.6])

cb_dist = np.max(np.abs(vector1-vector2))
print("Chebyshev distance is", cb_dist)
Copy the code

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

5) Cosine Similarity

The value range of cosine similarity is [-1,1], which can be used to measure the difference of two vector directions:

  • The larger the Angle cosine is, the smaller the Angle between the two vectors is.
  • When the directions of the two vectors coincide, the cosine of the included Angle is 1.
  • When the two vectors are in diametrically opposite directions, the cosine of the Angle between them is minimized to -1.

Machine learning uses this concept to measure differences between sample vectors, and its mathematical expression is as follows:


c o s Theta. = A B A B = k = 1 n x 1 k x 2 k k = 1 n x 1 k 2 k = 1 n x 2 k 2 cos\theta =\frac{AB}{| A | |B | } =\frac{\sum_{k=1}^{n}{x_{1k}x_{2k} } }{\sqrt{\sum_{k=1}^{n}{x_{1k}^{2} } } \sqrt{\sum_{k=1}^{n}{x_{2k}^{2} } } }

Python implementation of included Angle cosine:

import numpy as np
vector1 = np.array([1.2.3])
vector2 = np.array([4.5.6])

cos_sim = np.dot(vector1, vector2)/(np.linalg.norm(vector1)*np.linalg.norm(vector2))
print(Cosine similarity is zero, cos_sim)
Copy the code

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

6) Hamming Distance

The Hamming distance defines the number of different bits in two strings. For example, the hamming distance between the string ‘1111’ and ‘1001’ is 2. In information coding, the hamming distance between codes should be as small as possible.


d 12 = k = 1 n ( x 1 k The radius x 2 k ) d_{12} = \sum_{k=1}^{n} \left ( x_{1k} \oplus x_{2k}\right )

Python implementation of hamming distance:

import numpy as np
a=np.array([1.1.1.1.1.1.0.1.1.0.1.1.1.0.0.0.0.1.1.1.0])
b=np.array([1.1.1.1.0.1.1.0.1.0.1.0.1.0.1.0.0.1.1.0.1]) hanm_dis = np.count_nonzero(a! =b)print("Hamming distance is, hanm_dis)
Copy the code

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

7) Jaccard Index

The proportion of the intersection elements of the two sets AAA and BBB in the union set AAA and BBB is called the Jackard coefficient of the two sets, which is represented by the symbol J(A,B)J(A,B)J(A,B). The mathematical expression is:


J ( A . B ) = A studying B A B J( A,B ) =\frac{| A\cap B| }{|A\cup B | }

The Jekard similarity coefficient is an index to measure the similarity of two sets. It can generally be used to measure the similarity of samples.

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

8) Jaccard Distance

The opposite concept to the Jacquard coefficient is the Jacquard distance, which is defined as:


J sigma = 1 J ( A . B ) = A B A studying B A B J_{\sigma} =1-J( A,B ) =\frac{| A\cup B | -| A\cap B | }{| A\cup B | }

Python implementation of jeckard distance:

import numpy as np
vec1 = np.random.random(10) >0.5
vec2 = np.random.random(10) >0.5vec1 = np.asarray(vec1, np.int32) vec2 = np.asarray(vec2, np.int32) up=np.double(np.bitwise_and((vec1 ! = vec2),np.bitwise_or(vec1 ! =0, vec2 ! =0)).sum()) down=np.double(np.bitwise_or(vec1 ! =0, vec2 ! =0).sum())
jaccard_dis =1-(up/down)
print("The Jacquard distance is, jaccard_dis)
Copy the code

Go to our online programming environment to run the code: blog. Showmeai. Tech/python3 – com…

ShowMeAI related articles recommended

  • Graphical linear algebra and matrix theory
  • Graphic information theory
  • Graphical calculus and optimization

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master