This paper is the basic material for the MACHINE learning course CS229 at Stanford University. It is also the mathematical basis of the major artificial intelligence courses at Stanford. This paper is the linear algebra part, and the original file download [1]

Originally written by Zico Kolter, modified by Chuong Do, Tengyu Ma

Translation: Huang Hai Guang [2]

Note: Please watch github[3] for updates on linear algebra and probability theory.

Linear algebra review and reference

1. Basic concepts and symbols

Linear algebra provides a compact way to represent and manipulate systems of linear equations. For example, the following equations:

These are two equations and two variables, and as you know from high school algebra, you can find a unique solution to the sum (unless the equation degrades in some way, for example, if the second equation is just a multiple of the first, but in the case above, there is really only one unique solution). In matrix notation, we can express it more compactly:

As we can see, linear equations of this form have many advantages (such as significant space savings).

1.1 Basic Symbols

We use the following notation:

  • Is a matrix with rows and columns formed by real numbers.
  • Represents a vector with one element. In general, vectors will represent column vectors: that is, matrices with rows and columns. If we want to express row vectors explicitly: matrices with rows and columns – we usually write (transpose here).
  • Represents the first element of the vector
  • We use symbols (or, etc.) to represent elements in the first row and column:
  • We use or to represent the first column of the matrix:
  • We use or to represent the first row of the matrix:
  • In many cases, it is important and convenient to think of a matrix as a collection of column or row vectors. In general, it is mathematically (and conceptually) clearer to operate on vectors rather than scalars. There is no general convention for the representation of columns or rows used in a matrix as long as the notation is clearly defined.

Matrix multiplication

The two matrices are multiplied, where, then:

Among them:

Note that for the matrix product to exist, the number of columns in must equal the number of rows in. There are many ways to look at matrix multiplication, and we’ll start by examining some special cases.

2.1 Vector-vector multiplication

Given two vectors, usually called the inner product or dot product, the result is a real number.

Note: always true.

Given a vector, (it doesn’t matter if they have the same dimension or not), it’s called the cross product, and when theta, it’s a matrix.

To give an example of how the cross product can be used: let’s represent a dimension vector whose elements are all equal to 1, and also consider a matrix whose columns are all equal to some vector. We can use the cross product to represent the matrix compactly:

2.2 Matrix-vector multiplication

Given a matrix, a vector, their product is a vector. There are several ways to look at matrix vector multiplication, and we’ll look at each of them in turn.

If we write in line, then we can express it as:

In other words, the first is the inner product of the first row sum of, i.e.

Similarly, A can be written as A column, then the formula is as follows:

In other words, is a linear combination of the columns of, where the coefficients of the linear combination are given by the elements of.

So far, we’ve been multiplying the column vector on the right hand side, but we could also multiply the row vector on the left hand side. This is written for theta. As before, we can express it in two possible ways, depending on whether we express it by row or column.

In the first case, we express it in columns:

That tells us that the fourth entry is equal to the inner product of the first column of the sum.

Finally, according to the row representation, we get the final representation of the vector-matrix product:

So we see a linear combination of rows of phi, where the coefficients of the linear combination are given by the elements of phi.

2.3 Matrix-matrix multiplication

With this knowledge, we can now look at four different (in different forms, but with the same result) matrix-matrix multiplication: the multiplication defined at the beginning of this section.

First, we can think of matrix-matrix multiplication as a set of vector-vector products. From the definition: the most obvious view is that the element of phi is equal to the inner product of the columns of phi and phi. As shown in the following formula:

, , , , here, theta here, so they can compute the inner product. We use columns instead of rows. Or, we could write it in columns, in rows, in which case we’re taking the sum of cross products. The formula is as follows:

In other words, it’s equal to the sum of all of the cross products of the first column and the first row. So, in this case, the dimension of the cross product of theta and theta is the same as the dimension of theta.

Second, we can also think of matrix-matrix multiplication as a matrix vector product. If we represent the columns, we can view the columns of theta as the matrix vector product of the columns of sum. The formula is as follows:

The first column here is given by matrix vector products, and the vector on the right is zero. These matrix vector products can be explained using the two ideas given in the previous section. And then finally, we have a similar idea, where we represent the rows of phi as the matrix vector product between phi and the rows. The formula is as follows:

Here the first row is given by the matrix vector product of the vectors on the left:

It seems a bit much to dissect matrix multiplication to such a great extent, especially when all these ideas come right after the initial definition (in a line of mathematics) that we gave at the beginning of this section.

The immediate advantage of these different approaches is that they allow you to operate on levels/units of vectors rather than scalars. In order to fully understand linear algebra without getting lost in complex indexing operations, the key is to manipulate as many concepts as possible.

Virtually all linear algebra deals with matrix multiplication of some kind, and it is necessary to take some time to get an intuitive understanding of the ideas presented here.

In addition, it is necessary to understand some basic properties of higher-level matrix multiplication:

  • Associative law of matrix multiplication:
  • Distributive law of matrix multiplication:
  • Matrix multiplication is usually not commutative; That is, usually. (Suppose, for example, that matrix products do not even exist if and are not equal!)

If you are not familiar with these attributes, take the time to verify them for yourself. For example, to check the correlation of matrix multiplication, suppose,. Notice, so. Similarly,, so. Therefore, the dimensions of the obtained matrices are consistent. To show that matrix multiplication is relevant enough to check whether the first element is equal to the first element. We can verify this directly using the definition of matrix multiplication:

3 Operations and attributes \

In this section, we introduce several operations and properties of matrices and vectors. I hope to review a lot of these topics for you, and these notes will serve as a reference on these topics.

3.1 Identity matrix and diagonal matrix

The identity matrix,, is a square matrix where the diagonal elements are 1 and the rest are 0:

For all, there are:

Note that the representation of the identity matrix is ambiguous in the sense that it has no specified dimension. Usually, the dimension of is inferred from the context in order to make matrix multiplication possible. For example, in the equation above, where I is the matrix, and where is the matrix.

A diagonal matrix is a matrix in which all elements outside the diagonal are 0. The diagonal matrix is usually expressed as:, where:

It’s obvious: identity matrix.

3.2 transpose

Transpose means flipping the rows and columns of the matrix.

Given a matrix:

, its transpose is the matrix, where the elements are:

In fact, we’ve already used the transpose in describing row vectors, because the transpose of a column vector is naturally a row vector.

The following properties of transpose are easy to verify:

3.3 Symmetric Matrix

If, then the matrix is symmetric. If it’s antisymmetric. It’s easy to prove that for any matrix, the matrix is symmetric, the matrix is antisymmetric. Thus, any square matrix can be expressed as the sum of symmetric matrix and antisymmetric matrix, so:

The first matrix to the right of the formula above is symmetric, while the second matrix is antisymmetric. It turns out that symmetric matrices are used a lot in practice, and they have a lot of nice properties, and we’ll see them in a minute. Usually, the set of all symmetric matrices of size is expressed as, so it means that are symmetric matrices;

3.4 Trace of matrix

The trace of a square matrix, expressed as (or just, if parentheses are obviously implied), is the sum of the diagonal elements in the matrix:

As described in the CS229 handout, traces have the following properties (shown below) :

  • For the matrix, then:
  • For the matrix, then:
  • For the matrix,, then:.
  • For the matrix, and is a square matrix, then:
  • For the matrix,, is a square matrix, then: similarly, the product of more matrices also has this property.

Here, the first and last two equations use the trace operator and the definition of matrix multiplication, with emphasis on the fourth equation, which uses the commutability of scalar multiplication to reverse the order of the terms in each product, and the commutability and dependence of scalar addition to rearrange the order of the summation.

3.5 norm

The norm of a vector is the “length” of a vector that is informally measured. For example, we have the common Euclidean or norm,

Note:

More formally, a norm is a function that satisfies four properties () :

  1. For all PI, (non-negative).
  2. If and only if, (definite).
  3. For all,, then (positive homogeneity).
  4. For all of them, (triangle inequality)

Other examples of norms are norms:

And the norm:

In fact, all three norms proposed so far are examples of a family of norms parameterized by real numbers and defined as:

It is also possible to define norms for matrices, such as the Frobenius norm:

There are many more norms, but they are beyond the scope of this review material.

3.6 Linear correlation and rank

A set of vectors is said to be linearly independent if no vector can be represented as a linear combination of the rest. Conversely, a vector belonging to that group is said to be linearly dependent if it can be represented as a linear combination of the rest. That is, if:

For some scalar values, either the vectors are linearly dependent; Otherwise, the vectors are linearly independent. For example, the vector:

Is linearly dependent because:.

The column rank of a matrix is the size of the largest subset of columns that make up a linearly independent set. Due to the variety of terminology, this is often referred to simply as the number of linearly independent columns of. Similarly, the row rank is the maximum number of rows that make up a linearly independent set. For any matrix, it turns out that the rank of the columns is equal to the rank of the rows (although we will not prove this), so the two quantities collectively called the rank of are denoted by. Here are some basic properties of rank:

  • For,, if, : is said to be full rank.
  • For,
  • For that,
  • For,

3.7 Inverse of the square matrix

The reciprocal of the square matrix is expressed as, and is such a unique matrix:

Note that not all matrices have inverses. For example, a non-square matrix has no inverse by definition. However, for some square matrices, there may still be cases where there may not be. In particular, if there is, we say it’s reversible or nonsingular, otherwise it’s irreversible or singular. In order for square matrix A to have an inverse, it must be full rank. As we will soon discover, there are many other necessary and sufficient conditions besides full rank. Here are the properties of the inverse; Suppose, and non-singular:

  • Therefore, the matrix is usually expressed as. As an example of how to use inverses, consider a system of linear equations, where, if, are non-singular (i.e. invertible), then. (If it’s not a square matrix, does this formula work?)

3.8 Orthogonal matrix \

If, then the two vectors are orthogonal. If, then the vector is normalized. A square matrix is orthogonal if all of its columns are orthogonal to each other and normalized (the columns are then called orthogonal) (note the difference in meaning when talking about vectors).

It follows from the definitions of orthogonality and normality:

In other words, the inverse of an orthogonal matrix is its transpose. Note that if it is not a square matrix: that is,,, but its columns are still orthogonal, then, but. We usually just use the term “orthogonal” to describe the previous case, where we have square matrices. Another nice property of orthogonal matrices is that operating on vectors with orthogonal matrices does not change their Euclidean norm, namely:

For any of these, theta is orthogonal.

3.9 Range and null space of the matrix

A set of vectors is the set of all vectors that can be represented as a linear combination of theta. That is:

It turns out that if I have a set of linearly independent vectors, each of them, In other words, any vector can be written as a linear combination of theta.

The projection of the vector onto (we’re assuming here) the vector, as close as you can get to the Euclidean norm.

We express the projection as, and can formally define it as:

The range of the matrix (sometimes called the column space), expressed as, is the span of the columns. In other words,

Making some technical assumptions (i.e., full rank and), the projection of the vector to the range is given by the following equation:

This final equation should look very familiar, because it is almost identical to the formula we got in the course (which we will get again soon) : the least squares estimation for parameters. Looking at the definition of a projection, it’s obvious that this is actually our goal for minimization in least squares problems (except the norm square is a little different here, which doesn’t affect finding the optimal solution), so naturally these problems are very related.

When only one column is included,, this gives the special case of a vector projected onto a line:

The null space of a matrix is the set of all vectors equal to 0 when multiplied, i.e.

Note that the magnitude of the vector in is, and the magnitude of the vector in is, so the magnitude of the vector in and is. In fact, there are many more examples. Proof:

In other words, and are disjoint subsets, the entire space that they span together. This type of set is called the orthogonal complement, and we’re going to call it.

3.10 the determinant

The determinant of a square matrix is the function:, and is expressed as. Or (somewhat like the trace operator, we usually omit the parentheses). Algebraically, we can write an explicit formula for the determinant. Therefore, we first provide a geometric explanation of the determinant and then explore some of its specific algebraic properties.

Given a matrix:

Consider the set of points formed by taking all possible linear combinations of row vectors, where the linear combinations have coefficients between 0 and 1; In other words, the set is a linear combination constrained by coefficients, satisfying. In terms of form,

It turns out that the absolute value of the determinant of is a measure of the “volume” of the set.

For example: a matrix (4) :

The rows of its matrix are:

The collection corresponding to these rows is shown in Figure 1. For a two-dimensional matrix, it usually has a parallelogram shape. In our example, the value of the determinant is (which can be calculated using the formula shown later in this section), so the area of the parallelogram is 7. (Check for yourself!)

In three dimensions, a set corresponds to an object called a parallelepiped (a three-dimensional box with sloping edges such that each face has a parallelogram). The absolute value of the determinant of the row defined matrix S gives the three-dimensional volume of the parallelepiped. In higher dimensions, a set is an object called a dimensional parallel tangent.

A graphical representation of the determinant of the matrix given in FIG. 1 :(4). Here, the and are vectors corresponding to rows, and the set corresponds to the shaded region (i.e., the parallelogram). The absolute value of the determinant is the area of the parallelogram. \

Algebraically, the determinant satisfies the following three properties (which all other properties follow, including the general formula) :

  1. The determinant of the identity is 1, (geometrically, the volume of the unit hypercube is 1).
  2. Given a matrix, if we multiply a row in by a scalar, then the determinant of the new matrix is

Geometrically, if you multiply an edge of the set by a coefficient, the volume will also increase by a coefficient.

  1. If we swap any two rows in the sum, then the determinant of the new matrix is, for example:

You may be surprised to learn that there aren’t many functions that satisfy these three properties. In fact, such functions do exist and are unique (we won’t prove them here).

Several attributes derived from the above three attributes include:

  • For,
  • For,
  • For, if and only if it is singular (e.g. irreversible), then:
  • For non-singular at the same time, then:

Before we give a general definition of the determinant, let’s define phi for phi is the matrix produced by deleting the first row and the first column. The general (recursive) formula for the determinant is:

For, the initial case is. If we completely expand this out to PI, it’s going to be equal to different terms. Therefore, for matrices greater than, we hardly write the complete determinant equation unambiguously. However, determinant equations for matrices of size are fairly common, and it is recommended to know them well:

The classical adjoint matrix of a matrix (often called adjoint matrix) is expressed as, and is defined as:

(Note the changes in the index). You can see that for any non-singular,

While this is a good “explicit” inverse matrix formula, we should note that numerically there are many more efficient ways to compute an inverse matrix.

3.11 Quadratic and semi-positive definite matrices

Given square matrices and vectors, scalar values are called quadratic. Write it more clearly, so we can see:

Note:

The first is equal because the transpose of the scalar is equal to itself, and the second is equal because we average two things that are equal to themselves. From this, we can conclude that only the symmetrical part of is contributes to the formation of quadratic forms. For this reason, we often implicitly assume that matrices appearing in quadratic form are symmetric matrices. We give the following definitions:

  • For all non-zero vectors, the symmetric matrix is positive definite (PD). This is usually expressed as (or), and the set of all positive definite matrices is usually expressed as.
  • For all vectors, the symmetric matrix is positive semidefinite (PSD). This is written as (or only), and the set of all positive semidefinite matrices is usually expressed as.
  • Similarly, a symmetric matrix is negative definite (ND), expressed as (or) if it is non-zero for all.
  • Similarly, the symmetric matrix is negative semidefinite (NSD), if for all, is expressed as (or).
  • Finally, a symmetric matrix is indeterminate if it is neither positive semidefinite nor negative semidefinite, i.e., if it exists, then and.

Obviously, if it’s positive definite, it’s negative definite, and vice versa. Similarly, if it is semidefinite, then it is semidefinite, and vice versa. If the effect is indeterminate, then yes is indeterminate.

An important property of positive and negative definite matrices is that they are always full rank and therefore invertible. To see why this is, suppose some matrix is not full rank. Then, the first column of the hypothesis can be expressed as a linear combination of the other columns:

For some. Set, then:

But that means that for some non-zero vectors, alpha must therefore be neither positive nor negative definite. If it is positive or negative definite, it must be full rank. Finally, there is one type of positive definite matrix that occurs frequently and therefore deserves special mention. Given a matrix (not necessarily symmetric or even square), a matrix (sometimes called a Gram matrix) is always semidefinite. Furthermore, if (and for convenience, we assume full rank), is positive definite.

3.12 Eigenvalues and eigenvectors

Given a square matrix, we consider that under the following conditions, are the eigenvalues of, and are the corresponding eigenvectors:

Intuitively, this definition means that I’m going to multiply the vector 1, 2 to get a new vector that points in the same direction, but scales by the coefficients. It is worth noting that, for any eigenvector and scalar,, is also an eigenvector. Therefore, when we talk about eigenvectors associated with them, we usually assume that eigenvectors are normalized to length 1 (this still creates some ambiguity, since both and are eigenvectors, but we have to accept this).

We can rewrite the above equation to state the combination of eigenvalues and eigenvectors of is:

But only when there is a non-null null space, which is also singular, can there be a non-zero solution, namely:

Now, we can use the previous definition of the determinant to extend the expression to a (very large) polynomial in, where, of degree is. It is often called the eigenpolynomial of a matrix.

We then find the (possibly complex) roots of this characteristic polynomial and represent them. These are all eigenvalues of the matrix, but we notice that they may not be obvious. To find the eigenvectors corresponding to the eigenvalues, we need only solve the linear equation, which, because it is singular, is guaranteed to have a non-zero solution (but may also have many or infinitely many solutions).

It should be noted that this is not actually used to numerically compute eigenvalues and eigenvectors (remember that the full expansion of the determinant has terms), which is a mathematical controversy.

The following are the properties of eigenvalues and eigenvectors (all assuming that there are eigenvalues) :

  • The trace of is equal to the sum of its eigenvalues
  • The determinant of is equal to the product of its eigenvalues
  • The number of nonzero eigenvalues equal to the rank of
  • Assuming nonsingular, its eigenvalues are and its eigenvectors are. Then is the eigenvalue of the associated eigenvectors, i.e. (To prove this, take the eigenvector equation, multiply both sides by the left.)
  • The eigenvalues of a diagonal matrix are actually diagonal elements

3.13 Eigenvalues and eigenvectors of symmetric matrices

In general, the structure of the eigenvalues and eigenvectors of the general square matrix can be expressed very subtly. Fortunately, in most scenarios of machine learning, it is enough to deal with symmetric real matrices, and the eigenvalues and eigenvectors of the symmetric real matrices processed have significant properties.

In this section, we assume a real symmetric matrix with the following properties:

  1. All eigenvalues of are real numbers. Let’s write it in terms of theta.
  2. There is a set of eigen vectors, for all, eigen vectors with eigen values and sum. They’re unit vectors and they’re orthogonal to each other.

Let be an orthogonal matrix containing columns:

Let be a diagonal matrix containing elements on the diagonal. Using the matrix-matrix vector multiplication method in Equation (2) in Section 2.3, we can verify:

Considering that the orthogonal matrix satisfies, using the above equation, we can get:

This new representation is usually called diagonalization of a matrix. The term diagonalization comes from this: by this representation, we can usually effectively think of symmetric matrices as diagonal matrices, which is easier to understand. We will elaborate on the basis of the definition by eigenvectors through several examples.

Background: A vector that represents another basis.

Any orthogonal matrix defines a new basis (coordinate system) of belonging, meaning as follows: For any linear combination of vectors that can be expressed as, the coefficients are:

In the second equation, we use matrix and vector multiplication. In fact, this is the only one that exists:

In other words, vectors can be another representation of vectors, with respect to the basis of the definition.

Diagonalize matrix vector multiplication. With the setup above, we will see that the left-multiplied matrix can be regarded as a basis for the left-multiplied diagonal matrix with respect to the eigenvectors. Let’s say I have a vector, a basis for theta. Let’s call it the matrix cross product. Now let’s calculate the basis for: then, using the sum equation, we get:

We can see that the left-multiplied matrix in the original space is equal to the left-multiplied diagonal matrix with respect to the new basis, that is, only scaling the corresponding eigenvalues for each coordinate. It’s also much easier to multiply matrices multiple times on a new basis. For example, suppose. Using the original base can be a nightmare, but using the new base is much easier:

Diagonalized quadratic form. As a direct consequence, the quadratic form can also be simplified on the new basis.

(Recall that in the old notation, a sum of terms was involved, not the terms in the equation above.) Using this view, we can also prove that the positive characterization of a matrix depends entirely on the sign of its eigenvalues:

  1. If all of alpha, then the matrix is positive definite, because for any alpha,
  2. If all of alpha, then the matrix is positive semidefinite, because for any alpha,
  3. Similarly, if all or, then the matrices are respectively negative or semi-negative definite.
  4. Finally, it is indeterminate if it has both positive and negative eigenvalues, such as λ and. And that’s because if we take the sum of satisfactions and the sum of all, then we take the sum of satisfactions and the sum of all, then

A common application of eigenvalues and eigenvectors is to maximize some function of the matrix. Especially for matrices, consider the following maximization problem:

In other words, we want to find the vector (norm 1) that maximizes the quadratic form. Assume that the order of eigenvalues is, the optimal value of this optimization problem is, and is one of the maximum values of any corresponding eigenvector. (If, then there is a unique eigenvector corresponding to the eigenvalue, which is the unique maximum of the optimization problem above.) We can prove this by using the diagonalization technique: note that derivation by formula, and using the formula:

, we can rewrite the above optimization problem as:

Then, we get the upper bound of the target as:

In addition, setting makes the above equation true, which corresponds to setting.

Matrix calculus

Although the topics in the previous chapters are usually covered in standard linear algebra courses, one topic that seems to be covered less (and we will use it extensively) is the extension of calculus to vector setup. Even though all the actual calculus we use is relatively trivial, the notation often makes things seem much more difficult than they are. In this section, we will introduce some basic definitions of matrix calculus and provide some examples.

4.1 the gradient

The hypothesis is a function that takes a matrix of dimensions as input and returns a real value. Then the gradient (relative to) is the partial derivative matrix, defined as follows:

Namely, the matrix:

Note that the dimension of is always the same as the dimension of. In the special case, if it’s just vectors, then

It is important to remember that the gradient of a function is defined only if the function is real-valued, that is, if the function returns a scalar value. For example, with respect to phi, we cannot take the gradient of phi because this quantity is a vector value. It is derived directly from the equivalent property of partial derivatives:

  • For,

In principle, a gradient is a natural extension of a partial derivative over a multivariable function. In practice, however, it is sometimes difficult to use gradients for symbolic reasons. For example, let’s say I have a matrix of fixed coefficients, let’s say I have a vector of fixed coefficients. Let be the defined function, therefore. But now if I look at the expression,

How should this expression be interpreted? There are at least two possibilities: 1. In the first explanation, recall. Here, we will interpret as the gradient at the evaluation point, so:

2. In the second interpretation, we treat quantity as a function of the input variable. More formally, set. And then in this explanation:

Here we can see that the two explanations are really different. One interpretation produces dimension vectors as a result, and the other produces dimension vectors as a result! How do we solve this problem?

The key here is to be clear about the variables we want to distinguish. In the first case, we distinguish between the function and its parameters, and then replace the parameters. In the second case, we differentiate the composition directly with x.

So let’s write the first case as theta, and the second case as theta.

It’s very important to keep the notation clear, as you’ll see when you do your course work.

This is row (column) of the Hesse matrix, so:

To put it simply: we can say that because:, as long as we understand, this is actually taking the gradient of each element, not the gradient of the entire vector.

And finally, notice that although we can take gradients for matrices, for this course we’re only going to consider taking Hessian matrices for vectors. This makes it a lot easier (in fact, none of the calculations we do require us to find a Hessian equation for a matrix), because a Hessian equation for a matrix would have to take partial derivatives of all the elements of the matrix, which is pretty cumbersome to express as a matrix.

4.2 Hessian matrix

Suppose is a function that takes a vector in and returns a real number. So for the Hessian matrix, write:, or simply, the partial derivative of the matrix:

In other words, its:

Note: Hessian matrices are usually symmetric:

Like the gradient, the Hesse matrix is defined only if it is real.

It is natural to assume that the gradient is similar to the first derivative of a vector function, and that the Hesse matrix is similar to the second derivative (the notation we use also implies this relationship). This intuition is usually correct, but there are a few caveats to keep in mind. First, for a real valued function of a variable, its basic definition is: the second derivative is the derivative of the first derivative, namely:

However, for a function of a vector, the gradient of the function is a vector, and we cannot take the gradient of a quantity, that is:

This expression above makes no sense. Therefore, the Hesse matrix is not the gradient of the gradient. However, this is almost true in the following case: if we look at the first element of the gradient and take the gradient of phi with respect to phi we get:

This is row (column) of the Hesse matrix, so:

To put it simply: we can say that because:, as long as we understand, this is actually taking the gradient of each element, not the gradient of the entire vector.

And finally, notice that although we can take gradients for matrices, for this course we’re only going to consider taking Hessian matrices for vectors. This makes it a lot easier (in fact, none of the calculations we do require us to find a Hessian equation for a matrix), because a Hessian equation for a matrix would have to take partial derivatives of all the elements of the matrix, which is rather cumbersome to represent as a matrix.

4.3 Gradient and Hessier matrix of quadratic and linear functions

Now let’s try to determine the gradient and Hessian matrices of a few simple functions. It should be noted that all gradients given here are special cases of the gradients given in the CS229 handout.

For, let some known vectors, then:

So:

From this we can easily see. This should be compared to a similar case in univariate calculus, where. I’m going to consider the quadratic function. Remember this:

To take the partial derivative, we will consider the terms including and factors respectively:

This last equation is symmetric (we can safely assume it is quadratic). Notice that the first element of is the inner product of the first row of the sum. As a result,. Again, this should remind you of a similar fact in univariate calculus, namely.

Finally, let’s look at the quadratic Hessier matrix (obviously, the hessier matrix for linear functions is zero). In this case:

Therefore, it should be clear that this should be perfectly understandable (again similar to the univariate fact).

To summarize briefly:

  • (If it’s a symmetric matrix)
  • (If it’s a symmetric matrix)

4.4 Least square method

Let’s use the equations from the previous section to derive the least squares equation. Suppose we get the matrix (let’s assume full rank for simplicity) and the vector, so that. In this case, we’re not going to be able to find the vector because, so we want to find a vector that gets as close as possible to the square of the Euclidean norm.

Using the formula “, we can get:

According to the gradient of, and using the properties derived in the previous section:

Set the last expression to zero, then solve, and get the normal equation:

This is the same thing we got in class.

4.5 Gradient of the determinant

Now let’s consider a case where we find the gradient of a function with respect to the matrix, that is, for theta, we want to find theta. Recall our discussion of determinants:

So:

It can be seen from here that it is derived directly from the property of adjoint matrix:

Now let’s think about the delta function. Notice that we have to limit the field of omega to a positive definite matrix, because that ensures that the logarithm of omega, therefore, is real. In this case, we can use the chain rule (nothing strange, just the ordinary chain rule in univariate calculus) to see:

It is clear from this that:

We can delete the transpose from the last expression, because it’s symmetric. Note the similarity to the single-valued case, where.

4.6 Eigenvalue optimization

Finally, we use matrix calculus to solve optimization problems in a way that leads directly to eigenvalue/eigenvector analysis. Consider the following equation constrained optimization problems:

For symmetric matrices. The standard method for solving an optimization problem with equality constraints is to use the Lagrange form, an objective function containing equality constraints. In this case, the Lagrange function can be given by the following formula:

Where, is called the Lagrange multiplier associated with the equality constraint. It is certain that the Lagrangian gradient must be zero at phi (this is not the only condition, but it is required) to be the optimal point of the problem. In other words,

Notice, this is just a linear equation. This shows that the only point that can be maximized (or minimized) is the eigenvector of.

The resources

[1]

The original file download: cs229.stanford.edu/summer2019/…

[2]

Huang Haiguang: github.com/fengdu78

[3]

Making: github.com/fengdu78/Da…

This article was first published on the public account “Machine Learning Beginners”