This is the 12th day of my participation in the August More Text Challenge. For details, see:August is more challenging

This article is the sixth part of the notes series of Ng’s machine learning course. It mainly focuses on the common algorithm of data dimension reduction — PCA principal component analysis (PCA), while extending another algorithm — ICA independent component analysis (ICA).

Principal component analysis, PCA

Before we go into principal component analysis, let’s first understand what dimension reduction is.

What is dimension reduction

According to the definition of watermelon book, dimension reduction is to transform the original high-dimensional attribute space into a low-dimensional subspace through some mathematical transformation. In this subspace, as the sample density increases dramatically, the distance calculation becomes much easier.

In fact, dimension reduction is usually a projection from a higher dimension to a lower dimension.

For example, as shown in the figure below, for three-dimensional data, the reduction to two-dimensional means that the data is projected from the original three-dimensional space to a two-dimensional plane, which achieves dimensional reduction.

PCA (Principal Component Analysis)

Principal component analysis is the most common dimension reduction algorithm. PCA can extract the main components from the redundant features, and improve the training speed of the model without too much loss of model quality.

Projection Error

The projection error is the length of the perpendicular line between the feature vector and the direction vector when the data is projected to a vector passing through the origin (i.e. the direction vector).

The goal of PCA is to find a Vector direction so that the mean square error of the projection error is as small as possible.

This is obviously a case of going down to one dimension. The full description of PCA is:

To reduce the n-dimensional data to k-dimensional data, the goal is to find such a set of vectors u1, U2,u3… ,uku_1,u_2,u_3,… ,u_ku1,u2,u3,… UK, which minimizes the total projection error of all data projected on this set of vectors. In fact, this set of vectors should be orthogonal in our original space. This new orthogonal feature of K dimension is also called the principal component, which is a k dimension feature reconstructed on the basis of the original N dimension feature.

Algorithm process

  1. The mean is normalized.

  2. Calculate the covariance matrix:


    Σ = 1 m i = 1 n ( x ( i ) ) ( x ( i ) ) T = 1 m X T X \Sigma = \dfrac{1}{m} \sum\limits_{i=1}^{n}(x^{(i)})(x^{(i)})^T = \dfrac{1}{m} X^TX

  3. The eigenvector of the covariance matrix σ \Sigma Sigma is calculated by singular value decomposition:


    ( U . S . V T ) = S V D ( Σ ) (U,S,V^T)=SVD(\Sigma)

  4. The first KKK vectors are selected from UUU to obtain a matrix with n×kn\times kn× K dimension, which is represented by UreduceU_{reduce}Ureduce. KKK means we want to reduce the data from the NNN dimension to the KKK dimension.

  5. Compute the new eigenvector Z (I)z^{(I)}z(I):


    z ( i ) = U r e d u c e T x ( i ) z^{(i)}=U^T_{reduce} * x^{(i)}

Of course, the end result is k×1k \times 1k×1 dimension.

In general, to obtain the principal components, first calculate the covariance matrix of the data matrix, then obtain the eigenvectors of the covariance matrix through singular value decomposition, and then select the matrix composed of k eigenvectors with the largest eigenvalue, that is, the largest variance.

In summary, PCA, as a dimensional-reduction method of unsupervised learning, only needs eigenvalue decomposition to compress and de-noise data. Therefore, it is widely used in practical scenarios.

Independent component analysis, ICA

The above PCA is a process of information extraction, which is used to reduce the dimension of the original data, while the Independent Component Analysis (ICA) mentioned next is a process of information unmixing.

Problem is introduced into

The premise of ICA is that the observed variable is a linear combination of several statistically independent components.

Let’s start with the classic Ocktail party problem. Here’s the problem: In a room with NNN people having a party, they can talk at the same time. NNN sound receivers are placed in different corners of the room, and each receiver can simultaneously pick up overlapping sounds of NNN individual voices at each moment. Each receiver is not the same distance from each person, so the overlap of sounds received by each receiver is different. After the party, we got MMM sound samples. Each sample is a set of sound data collected from NNN receivers at specific moment III. How to separate the respective voices of NNN speakers from this MMM sample set?

Let’s look at this problem description in detail. SSS is used to represent the sound signal source emitted by each person at all times. It is a matrix of N ×mn\times Mn ×m, and each row represents the sound signal sequence of a person at all times.


s = [ s 1 . s 2 . . . . . s n ] T s=[s_1,s_2,…,s_n]^T

XXX is a linear combination of NNN individual voice data sampled at each moment. It’s also an n by mn times mn by m matrix. In other words, when there is MMM time, a total of SAMPLES of MMM group are sampled, and each sample is of NNN dimension. So the superscript iii here is going to be at some point, xix^{I}xi is a component, and it’s going to be a linear combination of all the NNN sound signals received at time III.


x = [ x ( 1 ) x ( 2 ) x ( m ) ] x ( i ) = [ x 1 ( i ) . x 2 ( i ) . . . . . x n ( i ) ] T x=\begin{bmatrix} |&|&\dots&|\\x^{(1)} & x^{(2)}&\dots&x^{(m)}\\|&|&\dots&|\end{bmatrix} \\ x^{(i)}=[x_1^{(i)},x_2^{(i)},…,x_n^{(i)}]^T

Therefore, we have the following model:


x = A s [ x ( 1 ) x ( 2 ) x ( m ) ] = A [ s 1 ( 1 ) s 1 ( 2 ) s 1 ( m ) s 2 ( 1 ) s 2 ( 2 ) s 2 ( m ) s n ( 1 ) s n ( 2 ) s n ( m ) ] x=As\\ \begin{bmatrix} |&|&\dots&|\\x^{(1)} & x^{(2)}&\dots&x^{(m)}\\|&|&\dots&|\end{bmatrix}=A\begin{bmatrix} s_1^{(1)}&s_1^{(2)}&\dots&s_1^{(m)}\\s_2^{(1)}&s_2^{(2)}&\dots&s_2^{(m)}\\\vdots&\vdots&\ddots&\vdots \\s_n^{(1)}&s_n^{(2)}&\dots&s_n^{(m)}\end{bmatrix}

Where, AAA is an unknown mixed matrix, obviously AAA is n×nn\times nn×n dimension, and must be full rank.

The current situation is that A, sA, sA and S are unknown, while XXX is known. We should try to derive SSS and AAA according to XXX. This process is also called blind signal separation. That sounds amazing. Let’s take our time.

algorithm

Without loss of generality, we can assume that both the mixed variables and the independent components have zero means; If the original data is not zero-mean, we can standardize the observed variable XXX to make the model a zero-mean model.

First of all, let’s make A transformation, W = A – 1 W = A ^ {1} W = A – 1, the s (I) = A – 1 x (I) = Wx (I) s ^ {} (I) = A ^ {1} x ^ {} (I) = Wx ^ {(I)} s (I) = A – 1 x (I) = Wx (I), The WWW can be expressed as [w1T, w2T,…, wnT] T [w_1 ^ T, w_2 ^ T,…, w_n ^ T] ^ T [w1T, w2T,…, wnT] T, so for each source signal components are:


s j ( i ) = w j T x ( i ) s_j^{(i)}=w^T_jx^{(i)}

Then, we assume that the sound signal sjs_jSj emitted by each person is independent, and there is a probability density ps(sj) P_S (s_j)ps(sj), then the joint probability density of source signal III at a given moment is:


p ( s ) = j = 1 n p s ( s j ) p(s)=\prod_{j=1}^np_s(s_j)

So we have p of s, p of s, p of s, and we want to get the probability of the sampled signal, so how do we get that?

Let’s recall the probability theory, we know that the probability density can be derived from the cumulative distribution function, so let’s take the cumulative distribution function first:


F x ( a ) = P ( x Or less a ) = P ( A s Or less a ) = P ( s Or less W a ) = F s ( W a ) F_x(a)=P(x\le a)=P(As\le a)=P(s\le Wa)=F_s(Wa)

Then take the derivative:


F x ( a ) = F s ( W a ) = W p s ( W a ) = p x ( a ) F’_x(a)=F’_s(Wa)=|W|p_s(Wa)=p_x(a)

So there are:


p ( x ) = p s ( W x ) W = W j = 1 n p s ( w j T x ) p(x)=p_s(Wx)|W|=|W|\prod_{j=1}^np_s(w_j^Tx)

Based on maximum likelihood estimation

Likelihood function:


L ( W ) = p s ( W x ) W = W j = 1 n p s ( w j T x ) L(W)=p_s(Wx)|W|=|W|\prod_{j=1}^np_s(w_j^Tx)

Given the training sample x(I)=(x(1),x(2)… ,x(m)x^{(i)}=(x^{(1)},x^{(2)},… ,x^{(m})x(i)=(x(1),x(2),… ,x(m), and calculate the logarithmic likelihood:


l n L ( W ) = i = 1 m { j = 1 n l n    p s ( w j T x ( i ) ) + l n W } lnL(W)= \sum_{i=1}^m\{ \sum_{j=1}^nln\; p_s(w^T_jx^{(i)})+ln|W|\}

Classical assumptions and uncertainties of ICA

Classical assumptions

1. The components are independent of each other

This is one of the most basic and important principles of ICA, and the interesting thing is that once you give this hypothesis, you can solve the model in a certain way. The explanation for this is that if any sequence of random variables (X1,x2… Xn) are statistically independent of each other, which means that we can’t get any information about the random variable xj from the rest of the variables.

2. One hypothesis that ICA is very strong is that the independent components obey a non-Gaussian distribution.

This is because, if the source signal is Gaussian, that is, the independent components are All Gaussian, then their joint probability distribution will be uniform and their density will be completely symmetric, as shown in the two-dimensional Gaussian distribution shown in the figure below. On the outside you can see that the distribution of any orthogonal transformation of the Gaussian variable (x1,x2)(x_1,x_2)(x1,x2) has exactly the same distribution as (x1,x2)(x_1,x_2)(x1,x2). Because the random variable of Gaussian distribution has the property that the higher-order cumulant is 0, there may be an infinite number of A after decomposition by ICA.

If the source signal is non-Gaussian, then the decomposition performed by ICA is unique. This is why at most one component is allowed to follow the Gaussian distribution in standard INDEPENDENT component analysis.

3. Assume that the mixed matrix A is square

This is obvious, to make A invertible, to make it easy to compute.

uncertainty

XXX with the same statistical characteristics may come from two different systems:


x = A s = ( A Alpha. ) ( Alpha. 1 s ) x=As=(A\alpha)(\alpha^{-1}s)

It is also understood that the sampled signal has nonidentifiable and noisy signals. The above expression shows that, by A linear transformation, the mixing matrices A and S are not unique, so we cannot uniquely identify the original signal.

Factors that cannot be determined by ICA

  • The variance of the independent components cannot be determined
  • Cannot determine the order of the independent components

summary

Neither PCA nor ICA requires you to make specific assumptions about the distribution of the source signal.

If the observed signal is Gaussian, then the source signal is Also Gaussian, and then PCA and ICA are equivalent.

Most ICA algorithms require data preprocessing: PCA is used to obtain Y, and then each component of Y is normalized (that is, each component is divided by its own standard deviation) to obtain Z. After pretreatment, z satisfies the following properties:

  • The components of Z are not related;
  • Each of the components of z has a variance of 1.