As a standard programmer, you should have some basic mathematical literacy, especially now many people are learning about artificial intelligence and want to seize the opportunity of a wave of artificial intelligence. Many programmers may not be able to answer basic math questions like this.

  • matrixA(m,n)With matrixB(n,k)What’s the product C dimension?
  • If I flip a coin, heads are 1, tails are 0, what is the mathematical expectation of E of x?

As a proud programmer, you should master the basic knowledge of mathematics, so that you are more likely to create a great product.

Linear algebra

A vector, “vector,” is an ordered array of real numbers that has both magnitude and direction. An n-dimensional vector A is composed of n ordered real numbers, denoted as a = [A1, A2, · · ·, an]

matrix

Linear mappingA matrix usually represents a mapping f of an N-dimensional linear space V to an m-dimensional linear space W: v -> w

Note: For writing convenience,X.TThe vector X transpose. Here:X(x1,x2,... , xn) T, y (y1, y2,... ym).TThese are all column vectors. Two vectors in a linear space v and w. A (m, n) is Am*nIs a matrix describing a linear mapping from v to W.

Transpose swaps matrix columns and columns.

Addition If A and B are both m by n matrices, then the addition of A and B is also an m by n matrix, each of which is the sum of the corresponding elements of A and B. [A + B]ij = aij + bij .

Multiplication If A is A k by m matrix and B is an m by n matrix, the product AB is A k by n matrix.

Diagonal matrix A diagonal matrix is a matrix in which all elements outside the main diagonal are zero. The elements on the diagonal can be 0 or other values. An N × N diagonal matrix A satisfies: [A]ij = 0 if I ≠ = J ∀ I, J ∈ {1, · · ·, n}

Eigenvalues and EigenVectors If A scalar λ and A nonzero vector v satisfy Av = λv, λ and v are called the eigenvalues and eigenvectors of the matrix A, respectively.

Matrix factorization A matrix can usually be represented by some more “simple” matrices, called matrix factorization.

Singular value decomposition Singular value decomposition of an M by n matrix A

Where U and V are the orthogonal matrices of M × M and N × N respectively, σ is the diagonal matrix of M × N, and the element on the diagonal line is called singular value.

Eigendecomposition The Eigendecomposition of an N-by-n square matrix A is defined as

Where Q is an n-by-n square matrix, each column of which is an eigenvector of A, and ^ is A diagonal matrix, each diagonal element of which is an eigenvalue of A. If A is A symmetric matrix, then A can be decomposed into

Where Q is an orthogonal matrix.

Differential and integral calculus

The derivative is a function f: R → R in the real number field for both its domain and its range, if f(x) is in some neighborhood of ∆x at the point x0, the limit

If f(x) exists, it is said to be differentiable at x0, and f prime (x0) is called its derivative, or derivative. If the function f(x) is differentiable at every point in some interval contained in its domain, then we can also say that the function f(x) is differentiable in this interval. A continuous function is not necessarily differentiable, but a differentiable function must be continuous. For example function | | x is continuous function, but not guide the point x = 0.

Derivative principle

The addition rule y = f(x),z = g(x)

The multiplication rule

Chain rule A rule for finding the derivative of a complex function, which is a common method of calculating derivatives in calculus. If x ∈ R, y = g(x) ∈ R, z = f(y) ∈ R, then

The Logistic function

The Logistic function is a commonly used S-shaped function, which was named by the Belgian mathematician Pierre Francois Verhulst when he studied the population growth model in 1844-1845. It was originally used as an ecological model. Logistic function is defined as:

When the parameters are (k = 1, x0 = 0, L = 1), the logistic function is called the standard logistic function, denoted as σ(x).

Standard Logistic functions are widely used in machine learning and are often used to map the numbers of a real number space to the interval (0, 1). The derivative of the standard logistic function is:

Softmax function

The Softmax function maps multiple scalars to a probability distribution. For K scalar x1, · · ·, xK, softmax function is defined as

In this way, we can convert K variables x1, · ·, xK into a distribution: z1, · ·, zK, satisfying

When the input of softmax function is k-dimensional vector x,

Where, 1K = [1, · · ·, 1]K×1 is the full 1 vector of k-dimension. The derivative is

Mathematical optimization

Discrete optimization and continuous optimization: Mathematical optimization problems can be divided into discrete optimization problems and continuous optimization problems according to whether the value field of input variable X is a real number field.

Unconstrained optimization and constrained optimization: Continuous optimization problems can be divided into unconstrained optimization problems and constrained optimization problems according to whether there are constraints on variables. ### optimization algorithm

Global optimum and local optimum

Hessian matrix

In operations Research, the gradient descent algorithm was also used to calculate the gradient step length in the previous article

The gradient

The original meaning of gradient is a vector (vector), indicating that the directional derivative of a function at this point obtains the maximum value along this direction, that is, the function changes fastest along this direction (the direction of the gradient) at this point, and the change rate is the largest (is the modulus of the gradient).

The Gradient Descent Method, also known as Steepest Descend Method, is often used to solve the minima problems of unconstrained optimization.

The process of gradient descent is shown in figure. A curve is a contour line (level set), that is, a curve whose function F is a set of different constants. The red arrow points in the opposite direction of the gradient at the point (perpendicular to the contour line passing through the point). Along the gradient descent direction, the local optimal solution of the value of the function F is finally reached.

If we want to solve a maximum problem, we need to search iteratively in the positive direction of the gradient, gradually approaching the local maximum point of the function, this process is called the gradient ascent method.

Probability theory

Probability theory mainly studies the law of quantity in a large number of random phenomena, and its application is very extensive, almost in all fields.

Discrete random variable

If the possible values of random variable X are finite enumerable and there are n finite values {x1, · · ·, xn}, X is called a discrete random variable. In order to understand the statistics of X, we have to know the probability that it takes every possible value xi, which is X sub I

P (x1), · · ·, p(xn)

Is called the probability distribution or distribution of the discrete random variable X, and satisfies

Common discrete random probability distributions are:

Bernoulli distribution

The binomial distribution

Continuous random variables are different from discrete random variables. Some random variables X have values that are not enumerable and consist of all real numbers or part of intervals, such as

X is a continuous random variable.

Probability density function The probability distribution of a continuous random variable X is generally described by the probability density function P (X). P (x) is integrable and satisfies:

Uniform distribution If a and b are finite numbers, the probability density function of uniform distribution on [a, b] is defined as

Normal distribution, also known as Gaussian distribution, is the most common distribution in nature and has many good properties. It has a very important influence in many fields. Its probability density function is

Where σ > 0, µ and σ are constants. If the random variable X obeys a probability distribution with parameters µ and σ, it is denoted as

whenµ = 0, σ = 1Is called the standard normal distribution. Probability density function diagram of uniform distribution and normal distribution:

The cumulative distribution function for a random variable X, the cumulative distribution function is the probability that the random variable X is less than or equal to X.

Taking continuous random variable X as an example, the cumulative distribution function is defined as:

Where P (x) is the probability density function, and the cumulative distribution function of the standard normal distribution is:

Random vector a random vector is a set of random variables. If X1, X2, · ·, Xn are n random variables, then [X1, X2, · ·, Xn] is called an n-dimensional random vector. One-dimensional random vectors are called random variables. Random vectors are also divided into discrete random vectors and continuous random vectors. Conditional probability distribution For discrete random vector (X, Y), given X = X, the conditional probability of random variable Y = Y is:

For two-dimensional continuous random vector (X, Y), given X = X, the conditional probability density function of random variable Y = Y is

Expectation and variance

Expectation For discrete variable X, its probability distribution is P (x1), · · ·, P (xn), and the expectation or mean of X is defined as

For continuous random variable X, the probability density function is P (X), and its expectation is defined as

Variance The variance of the random variable X is used to define the dispersion of its probability distribution, which is defined as

Standard deviation the variance of the random variable X is also called its second moment. The root variance or standard deviation of X.

Covariance The covariance of two continuous random variables X and Y is used to measure the overall variation between the distributions of the two random variables, and is defined as

Covariance is also often used to measure the linear correlation between two random variables. Two random variables are said to be linearly independent if their covariance is 0. There is no linear correlation between two random variables, which does not mean that they are independent, and there may be some nonlinear functional relationship. Conversely, if X and Y are statistically independent, the covariance between them must be 0.

Random process

A stochastic process is a set of random variables Xt, where T belongs to an index set T. The index set T can be defined in the time domain or the space domain, but it is generally in the time domain and is represented by real or positive numbers. When t is real, the random process is continuous random process. When t is an integer, it is a discrete random process. Many examples of everyday life, including stock fluctuations, voice signals, and changes in height, can be regarded as random processes. Common time – dependent stochastic process models include Bayesian process, random walk, Markov process and so on.

Markov process refers to a random process in which the conditional probability distribution of future states depends only on the current state given the present state and all past states.

Where X0:t represents the set of variables X0, X1, · · ·, Xt, X0:t represents the state sequence in the state space.

Markov chain Discrete time Markov processes are also known as Markov chains. If a Conditional probability of a Markov chain

The use of Markov can see the front of an interesting article written: girlfriend’s mind you can guess? Markov chains tell you that there are random processes and there are Gaussian processes, which are more complicated, so I won’t go into details here.

Information theory

Information theory is an interdisciplinary field of mathematics, physics, statistics, computer science and so on. Information theory, first proposed by Claude Shannon, mainly studies the methods of quantification, storage and communication of information. Information theory is also widely used in machine learning-related fields. Such as feature extraction, statistical inference, natural language processing and so on.

Self information and entropy

In information theory, entropy is used to measure the uncertainty of a random event. Assuming that a random variable X (value set is C, probability distribution is P (X), X ∈ C) is encoded, self-information I(X) is the information amount or coding length of variable X = X, defined as I(X) = − log(p(X)), then the average coding length of random variable X, namely entropy, is defined as

Where, when P (x) = 0, we define 0log0 = 0 entropy as the average coding length of a random variable, namely, the mathematical expectation of self-information. The higher the entropy is, the more information the random variable has. The lower the entropy, the less information. If the variable X if and only if p(X) = 1 at X, then entropy is 0. In other words, for a given piece of information, its entropy is zero and the amount of information is zero. If its probability distribution is uniform, the entropy is maximum. Assuming that a random variable X has three possible values x1, x2, and x3, the entropy corresponding to the different probability distributions is as follows:

Joint entropy and conditional entropy For two discrete random variables X and Y, assume that the set of values of X is X; The set of Y values is Y, whose Joint probability distribution satisfies p(x, Y), then the Joint Entropy of X and Y is

The conditional entropy of X and Y is zero

Mutual information mutual information is a measure of the reduction in uncertainty of one variable when another is known. The mutual information of two discrete random variables X and Y is defined as

Cross entropy and divergence Cross entropy corresponds to a random variable with a distribution of P (x), and entropy H(p) represents its optimal coding length. Cross entropy is the length of encoding information with real distribution P according to the optimal encoding of probability distribution Q, which is defined as

In the case of given P, if q and P are closer, the cross entropy is smaller; If q and P are further apart, the cross entropy is higher.