- Reducing Dimensionality from Dimensionality Reduction Techniques
- Elior Cohen
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: haiyang – tju
- Proofread by: TrWestdoor, Kasheemlew
In this paper, I will try my best to clarify three dimensionality reduction methods commonly used in dimensionality reduction technology: PCA, T-SNE and autoencoder. The motivation for this article is that these methods are often used as black boxes and can sometimes be misused. Understanding them will help you decide when and how to use them.
I’ll use TensorFlow to introduce the internal structure and corresponding code of each method (not including t-SNE) from scratch. Why use TensorFlow? Since it is mainly used for deep learning, let’s use it for some other challenging tasks 🙂 the code for this article can be found in this note.
motivation
When dealing with real problems and real data, we are often faced with data with millions of dimensions.
Although the data is more expressive in its original high-dimensional structure, it may sometimes need to be reduced in dimension. The general need to reduce dimensions is usually associated with visualization (down to 2 or 3 dimensions so we can draw it), but this is not always the case.
Sometimes, in practical applications we may think that performance is more important than precision, so that we can reduce 1000 dimensions of data to 10 dimensions, so that we can process it faster (such as in distance calculations).
Sometimes the need for latitude reduction is real and has many applications.
Before we begin, if you had to choose one dimensional reduction technique for the following situations, which one would it be?
-
Your system uses cosine similarity to measure distance, but you need to visualize it and show it to board members with no technical background who may not be familiar with the concept of cosine similarity at all — how do you do that?
-
You need to compress the data into as small a dimension as possible, and the constraint you get is to keep about 80% of the data. What do you do?
-
You have a database that contains certain types of data that have been collected over a long period of time, and data (of similar types) are constantly being added.
You need to reduce the dimensionality of your data, whether it’s the data you already have or the data that’s constantly coming in. Which approach do you choose?
I hope this article has helped you better understand dimension reduction so that you can deal with similar problems.
Let’s start with the PCA approach.
PCA method
The Principal Component Analysis (PCA) approach is probably the oldest dimension reduction technique in the book.
Therefore, PCA has been well studied, and many methods can also get the same solution. We will discuss two of them here, namely eigendecomposition and singular value decomposition (SVD), and then we will implement SVD method in TensorFlow.
From now on, X is the data matrix we will be dealing with in the shape of (n, P), where n is the number of samples and P is the dimension.
Given X, both of these approaches try to process and factor X in their own way, and then we can multiply the factorized results to represent the most information in a smaller dimension. I know this sounds scary, so I’m going to leave out most of the math here, only the parts that help understand the pros and cons of these methods.
So eigenfactorization and singular factorization are two different matrix factorization methods. Let’s see how they are used in principal component analysis and how they are related.
Take a look at the flow chart below and I’ll explain it in a minute.
FIG. 1 PCA workflow diagram
Why do we care? Because there is something very fundamental about these two processes, they can tell us a lot about principal component analysis (PCA).
As you can see in this way, the two methods are purely linear algebra method, this basically tells us, using PCA method is to observe the real data from different angles – this is unique to the PCA method, and the other methods are the data from low-dimensional random said, and tried to make it act like high dimensional data.
It is also worth noting that all operations are linear, so the SVD approach is very fast.
Given the same data, PCA always gives the same answer (the other two methods do not).
Notice in the SVD method how do we choose the parameter r (r is the dimension we want to descend to) to preserve the larger sigma values for the lower dimensions? There’s something special about sigma.
Sigma is a diagonal matrix of P diagonal values (also known as singular values), and their magnitude indicates their importance to the preservation of information.
So we can choose to reduce the dimensions, and reduce them to an approximate number of dimensions to be retained. Given a percentage of the number of dimensions, then I’ll demonstrate it in code (such as our ability to reduce dimensions under the constraint of missing up to 15% of the data).
As you can see, using TensorFlow to do this is fairly simple — all we need to do is write a class that contains a FIT method and a reduce method, and we need to provide parameters to the dimensions that we want to reduce.
Code (PCA)
Let’s take a look at what the fit method looks like, where self.X contains the data participating in the calculation and its type is self.dType =tf.float32.
def fit(self):
self.graph = tf.Graph()
with self.graph.as_default():
self.X = tf.placeholder(self.dtype, shape=self.data.shape)
# execute SVD method
singular_values, u, _ = tf.svd(self.X)
Create a sigma matrix
sigma = tf.diag(singular_values)
with tf.Session(graph=self.graph) as session:
self.u, self.singular_values, self.sigma = session.run([u, singular_values, sigma],
feed_dict={self.X: self.data})
Copy the code
So the purpose of method FIT is to create sigma and U that we’ll use later. We start with the line Tf.svD, which gives us the corresponding singular value, the value of the diagonal represented as sigma in Figure 1, and it also computes the U and V matrices.
Then Tf.Diag is TensorFlow’s method of converting one-dimensional vectors into diagonal matrices, where singular values are converted into diagonal matrices sigma.
At the end of the FIT method we can calculate the singular value, the matrix sigma and U.
Now let’s implement the Reduce method.
def reduce(self, n_dimensions=None, keep_info=None):
if keep_info:
# Singular value normalization
normalized_singular_values = self.singular_values / sum(self.singular_values)
# Create a trapezoidal cumulative sum that holds information for each dimension
ladder = np.cumsum(normalized_singular_values)
Get the first index that exceeds the given information threshold
index = next(idx for idx, value in enumerate(ladder) if value >= keep_info) + 1
n_dimensions = index
with self.graph.as_default():
Delete the relevant part from sigma
sigma = tf.slice(self.sigma, [0, 0], [self.data.shape[1], n_dimensions])
# PCA method
pca = tf.matmul(self.u, sigma)
with tf.Session(graph=self.graph) as session:
return session.run(pca, feed_dict={self.X: self.data})
Copy the code
As shown above, the reduce method accepts arguments keep_info or n_dimensions (I don’t write input checks here, input checks are mandatory). If n_dimensions is supplied, the method simply reduces the data dimension to that value, but if keep_info is supplied, which is a floating point number from 0 to 1, The method then needs to retain the percentage of the original data (say 0.9 — 90% of the original data is retained). In the first “if” statement, we normalize the data and check how many singular values we need to use, and then basically evaluate n_dimensions from keep_info.
In the diagram, we just slice the sigma matrix to correspond to as much of the required data as possible, and then we perform matrix multiplication.
Let’s try it on the iris data set, which contains three iris data sets of shapes (150,4).
from sklearn import datasets import matplotlib.pyplot as plt import seaborn as sns tf_pca = TF_PCA(iris_dataset.data, Target) tf_pca.fit() pca = tF_pca.reduce (keep_info=0.9)# Results in 2 dimensions
color_mapping = {0: sns.xkcd_rgb['bright purple'], 1: sns.xkcd_rgb['lime'], 2: sns.xkcd_rgb['ochre']}
colors = list(map(lambda x: color_mapping[x], tf_pca.target))
plt.scatter(pca[:, 0], pca[:, 1], c=colors)
Copy the code
Figure 2 is presented in two dimensions by PCA method on Iris data set
Not bad, right?
T – SNE method
T-sne is a relatively new approach to PCA, derived from a paper in 2008 (link to original paper).
It is more complicated to understand than the PCA method, so bear with me. Our representation of t-SNe is as follows: X represents the raw data, P is a matrix that stores the closeness (distance) between X points in the higher (original) space, and Q is also a matrix that stores the closeness (distance) between data points in the lower space. If we have n data samples, Q and P are both n by n matrices (that is, the distances from any point to any point including itself).
Now t – SNE method has its own “unique ways” (we will soon introduce) to measure the distance between the object, a measure of the distance between the data points in high dimensional space method, and the other a kind of method is in low dimension space measuring the distance between the data points, and the third method is the method of measuring the distance between P and Q. A point from the original literature, x_j x_i with another point of similarity by _p_j | I given, among them, if under x_i centered gaussian distribution, the probability density according to the proportion of selected x_j adjacent points.
“What?” Don’t worry, as I said, the T-SNe has its own distance measurement, so let’s take a look at the distance (closeness) measurement formula and find some explanations for how the T-SNe behaves.
At a high-level level, this is how the algorithm works (note that unlike PCA, it is an iterative algorithm).
Figure 3 t-SNE workflow
Let’s take it step by step.
The algorithm takes two inputs, the data itself and the complexity (Perp).
Complexity is simply how you want to balance the focus between the local (turn-off point) structure and the global structure of the data during optimization — this article recommends keeping it between 5 and 50.
Higher complexity means that a data point considers more points as its neighbors, and lower values mean that fewer points are considered.
Complexity really affects visualizations, and be careful because it can be misleading in visualizations of low-dimensional data — so I highly recommend reading this very good article on proper use of T-SNE, which deals with the effects of different complexity values.
Where does this complexity come from? It is used to calculate σ_i in equation (1) because they have a monotone connection made up of a binary search tree.
So σ _I is basically calculated in a different way, using the complexity data we provide for the algorithm.
So let’s see what the formula for t-SNe tells us.
Before we look at equation (1) and equation (2), it is important to know that p_II is set to 0 and q_II is set to 0 (this is just a given value, we apply it to two similar points and the equation does not output 0).
So let’s look at equation (1) and equation (2), and note that if the two points are close (in the high-latitude representation), then the numerator will produce a value of about 1, whereas if they are far apart, then we will get an infinitesimally small value — which will help us understand the cost function later.
Now we’ve seen some of the properties of T-SnE.
One is that the interpretation of distance in t-SNE diagrams is problematic because of the way the correlation equation is established.
This means that the distance between clusters and cluster size can be misleading, and can also be affected by the level of complexity chosen (again, I’ll refer to the methodology in the article mentioned above to see the data visualization of these phenomena).
The other thing to notice in equation 1 is how do we calculate the Euclidean distance between the points? Here is very strong, we can switch the distance metric into any other we want to use the method of distance measurement, such as cosine distance, Manhattan distance and so on, as long as keep the space measurement), and keep the affinity of low dimensional structure is the same – this will in Euclidean distance way, lead to draw the distance is more complex.
For example, if you are a CTO and you have some data and you want to use cosine similarity as a distance measure, and your CEO wants you to do some presentations right now to illustrate the properties of that data, I’m not sure you have the time to explain cosine similarity to the board and how to manipulate the clustered data, You simply plot cosine similarity clusters, like the Euclidean distance clusters used in the T-SNE algorithm — that’s all I’m saying, that’s pretty good.
In code, you can do this using the TSNE method in SciKit-learn by providing a distance matrix.
Ok, so now we know that p_ij over q_ij is going to be very large if x_i and X_j are very close together, and conversely, it’s going to be very small if they’re very far apart.
Let’s see the effect of our loss function (which we call the Kullback — Leibler divergence) by graphing it, and examine the formula (3) without the sum.
FIG. 4 Loss function of T-SNE excluding summation part
It’s hard to understand, but I did put the names of the axes here. As you can see, the loss function is asymmetric.
When a point is near a higher-dimensional space (the P-axis), it produces a large generation value, but is represented by points farther apart in a lower-dimensional space, and similarly, points farther away in a higher-dimensional space produce a smaller generation value, but is represented by points closer together in a lower-dimensional space.
This further illustrates the distance interpretation ability of T-SNE plots.
Use the T-SNE algorithm on iris data sets to see what happens at different complexities
model = TSNE(learning_rate=100, n_components=2, random_state=0, perplexity=5)
tsne5 = model.fit_transform(iris_dataset.data)
model = TSNE(learning_rate=100, n_components=2, random_state=0, perplexity=30)
tsne30 = model.fit_transform(iris_dataset.data)
model = TSNE(learning_rate=100, n_components=2, random_state=0, perplexity=50)
tsne50 = model.fit_transform(iris_dataset.data)
plt.figure(1)
plt.subplot(311)
plt.scatter(tsne5[:, 0], tsne5[:, 1], c=colors)
plt.subplot(312)
plt.scatter(tsne30[:, 0], tsne30[:, 1], c=colors)
plt.subplot(313)
plt.scatter(tsne50[:, 0], tsne50[:, 1], c=colors)
plt.show()
Copy the code
FIG. 5 T-SNE algorithm is used on iris data set with different complexity
As we understand mathematically, we can see that the data does cluster well given a good value of complexity, and note the sensitivity of hyperparameters (we cannot find the cluster without providing the learning rate of gradient descent).
Before we go on, T-SNE is a very powerful method if you can use it correctly. Don’t use what you’ve learned in a negative way, just know how to use it.
Next comes the autoencoder.
autoencoder
PCA and T-SNE are one method, while autoencoder is a class of methods. Automatic encoder is a kind of neural network, the network’s goal is through the use of less hidden nodes output node (encoder) to predict the input (as similar as possible training the network with the output and input the result), by using fewer hidden nodes (encoder output node) to encode and input node as much information as possible.
The basic implementation of the autoencoder of our 4-dimensional Iris data set is shown in FIG. 6, where the connection between the input layer and the hidden layer is called “encoder”, and the connection between the hidden layer and the output layer is called “decoder”.
FIG. 6 Structure of basic autoencoder on iris data set
So why is an autoencoder a method cluster? Because our only limitation is that the input and output layers have the same dimensions, internally we can create whatever we want to use and best encode high-dimensional data.
The autoencoder starts with some random low-dimensional representations (z), calculates the optimal solution by changing the weights between the input layer and the hidden layer, and between the hidden layer and the output layer, and uses a gradient descent method.
So far, we’ve been able to learn some important things about autoencoders, because we control the internal structure of the network, so we can design encoders that can select very complex relationships between features.
Another advantage of the autoencoder is that at the end of the training we have weights that point to the hidden layer, and we can train a particular input, and if we come across another data later, we can use this weight to lower its dimension without retraining — but be careful, Such a scheme works only if the data points are similar to the training data.
In this case, the mathematics of the autoencoder may be simple, and it is not very useful to study this because the mathematics of each internal architecture and cost function we choose is different.
But if we spend some time thinking about how to optimize the weight of the autoencoder, we will see that the definition of the cost function is very important.
Since the autoencoder uses the cost function to determine the effect of its prediction, we can use this ability to emphasize what we want. Whether we want Euclidean distance or some other distance measure, we can reflect it on the encoded data through cost functions, using different distance methods, using symmetric and asymmetric functions and so on.
More convincingly, since autoencoders are essentially neural networks, we can even weight classes and samples during training to give more meaning to certain phenomena in the data.
This gives us a lot of flexibility to compress the data.
Autoencoders are very powerful, and in some cases show some very good results compared to other methods (such as Google’s “PCA versus autoencoder method”), so they are certainly an effective method.
Let’s use TensorFlow to implement a basic autoencoder and use the iris dataset to test use and map it
Code (autoencoder)
Once again, we need to implement fit and Reduce methods
def fit(self, n_dimensions):
graph = tf.Graph()
with graph.as_default():
# input variable
X = tf.placeholder(self.dtype, shape=(None, self.features.shape[1]))
# network parameters
encoder_weights = tf.Variable(tf.random_normal(shape=(self.features.shape[1], n_dimensions)))
encoder_bias = tf.Variable(tf.zeros(shape=[n_dimensions]))
decoder_weights = tf.Variable(tf.random_normal(shape=(n_dimensions, self.features.shape[1])))
decoder_bias = tf.Variable(tf.zeros(shape=[self.features.shape[1]]))
# Encoding part
encoding = tf.nn.sigmoid(tf.add(tf.matmul(X, encoder_weights), encoder_bias))
# Decode part
predicted_x = tf.nn.sigmoid(tf.add(tf.matmul(encoding, decoder_weights), decoder_bias))
Define the cost function and optimizer with least square error
cost = tf.reduce_mean(tf.pow(tf.subtract(predicted_x, X), 2))
optimizer = tf.train.AdamOptimizer().minimize(cost)
with tf.Session(graph=graph) as session:
Initialize global variable parameters
session.run(tf.global_variables_initializer())
for batch_x in batch_generator(self.features):
self.encoder['weights'], self.encoder['bias'], _ = session.run([encoder_weights, encoder_bias, optimizer],
feed_dict={X: batch_x})
Copy the code
There is nothing special here, the code is basically self-explanatory and we save the weight of the encoder in the bias so that we can reduce the data in the following reduce method.
def reduce(self):
return np.add(np.matmul(self.features, self.encoder['weights']), self.encoder['bias'])
Copy the code
Heh, it’s that simple 🙂
Let’s see how this works (batch size 50,1000 iterations)
Figure 7 output of simple autoencoder on iris data set
Even without changing the internal structure, we might get different results using parameters such as different batch sizes, number of rounds, and different optimizers — and this is just the beginning.
Note that I just randomly selected values for some of the hyperparameters, and in a real scenario we will measure how well we do this by cross-validating or testing data and finding the best Settings.
conclusion
Posts like this usually end with comparison charts, pros and cons, etc. But that’s the opposite of what I want to achieve.
My goal is to reveal the internal implementations of these methods so that the reader can understand the strengths and weaknesses of each method. I hope you enjoy this reading and learn something new.
From the beginning of the article to the above three questions, do you feel much more comfortable now?
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.