Machine Learning for Humans, Part 3: Unsupervised Learning
By Vishal Maini
Translator: The heart of the machine
Clustering and dimensionality reduction: K-means clustering, hierarchical clustering, principal component analysis (PCA), singular value decomposition (SVD).
How can we discover the underlying structure of a data set? How can we most usefully generalize and group them? How can we effectively represent data in a compressed format? These are the goals of unsupervised learning, called “unsupervised” because it starts with unlabeled data.
The two types of unsupervised learning tasks we will explore here are: 1) clustering data into different groupings by similarity; 2) Reducing dimensionality to compress data while preserving data structure and usefulness.
Examples of unsupervised learning methods that might be useful:
- An AD platform needs to divide the U.S. population into different groups based on similar demographics and buying habits so that advertisers can reach their target customers with relevant ads.
- Airbnb needs to group its listings into communities so that users can more easily access them.
- A data science team needs to reduce the number of dimensions of a large data set in order to simplify modeling and reduce file sizes.
Unlike supervised learning, it is not easy to find indicators to evaluate the merits of unsupervised learning algorithms. “Performance levels” tend to be subjective and vary from field to field.
clustering
An interesting real-world application of clustering is Personicx, a life-stage clustering system from marketing data provider Acxiom. The service divides US households into 70 different clusters, grouped into 21 life-stages, that can be used by advertisers for targeted Facebook ads, display ads and direct mail.
Personix is part of the demographic clustering
Their white paper shows that they use clustering and principal component analysis, both of which are covered in this section.
As you can imagine, these clusters can be very useful to advertisers who want to (1) understand their existing customer base and (2) use relevant demographic characteristics, interests, and lifestyle habits to target advertising to potential new customers in order to efficiently spend their advertising dollars.
In fact, all you need to do is check out Acxiom’s “Which Cluster do I Belong to?” By answering a few simple questions in the tool, you can find out which cluster you personally belong to.
Let’s look at several clustering methods to see how such a task can be accomplished.
K-means clustering
“The race of the Center of Gravity has a hundred rings, and above that, the power of hope.”
The goal of clustering is to group data points so that the data points in different clusters are not similar, but the data points in the same cluster are similar.
Using K-means clustering, we want to cluster our data points into K-groups. The larger the K, the smaller the groupings created, the more granular they are; When K is smaller, the grouping is larger and less granular.
The output of the algorithm is a set of “labels” that assign each data point to one of the K groups. In k-means clustering, these groups are defined in such a way that a centroID is created for each group. These centers of gravity are like the heart of the cluster, and they can “capture” the nearest point and add it to their cluster.
You can think of these centers of gravity as the people who are the center of attention at the party. They’re like magnets. If there is only one such person, everyone will surround him; If you have a lot of these people, you get a lot of smaller centers.
The steps of k-means clustering are as follows:
- Define K centers of mass. Initially these centers of gravity are random (there are more efficient algorithms for initializing centers of gravity)
- Find the nearest center of gravity and update the cluster assignment. Assign each data point to one of the K clusters. Each data point is assigned to the cluster of centers of gravity closest to them. The “proximity” measure here is a hyperparameter — usually the Euclidean distance.
- Move the center of gravity to the center of their cluster. The new position of the center of gravity of each cluster is obtained by calculating the average position of all data points in that cluster.
Steps 2 and 3 are repeated until the position of the center of gravity does not change significantly with each iteration (i.e., until the algorithm converges).
This is a simplified version of how k-means clustering works! A visual demonstration of the algorithm can be viewed here, and you can read it like a comic book. Each data point on the plane is colored according to its nearest center of gravity. You can see that the centers of gravity (the larger blue, red, and green dots) are random at first and then quickly adjusted to get their respective clusters.
Another real application of k-means clustering is to classify handwritten numbers. Suppose we have an image of a number represented by a long vector of pixel brightness. Assume these images are black and white and 64×64 pixels in size. Each pixel represents a dimension. These images then live in a world with 64×64=4096 dimensions. In this 4096-dimensional world, K-means clustering allows us to group these images by proximity and assume that the images that are close together are all the same number. This algorithm can get quite good results in number recognition.
Hierarchical clustering
Or five. Or 20? Well, we can decide later.”
Hierarchical clustering is similar to regular clustering, except that your goal is to build a hierarchy of clustering. This can be very useful if you don’t know how many clusters you end up with. For example, suppose you want to group projects on an online marketplace like Etsy or Amazon. On the home page, you only need a few large groups for easy navigation, but as your categories become more specific, you need a higher level of granularity, a more distinct clustering of items.
On the output side of the algorithm, in addition to cluster allocation, you also need to build a nice tree structure to help you understand the hierarchy between these clusters. You can then choose the number of clusters you want from the tree.
The steps of hierarchical clustering are as follows:
- First of all, from the
N
One cluster for each data point.
- Merge two clusters closest to each other into one. Now you have
N-1
A clustering.
- Recalculate the distances between these clusters. There are many ways to do this (see this tutorial for more details). One approach (average-linkage clustering) treats the distance between two clusters as an average of all the distances between their respective elements.
- Repeat steps 2 and 3 until you have a cluster containing N data points. You get a tree like the one shown below (also known as a tree).
- Select a number of clusters and draw a horizontal line through the tree. For example, if you want to
K=2
You should draw a horizontal line at a distance of about 20,000, and you will get a cluster containing data points 8, 9, 11, 16 and another cluster containing other data points. In general, the number of clusters you get is the number of intersections between horizontal lines and vertical lines in the tree.
Source: Solver.com. For more details on hierarchical clustering, see this video.
Dimension reduction
“The attitude toward the non-essential parts that need to be cut out is not to take in more every day, but to eliminate as much as possible every day.” – Bruce lee
Dimensionality reduction looks a lot like compression. This is to reduce the complexity of the data while preserving as much of the relevant structure as possible. If you have a simple 128×128×3 pixel image (length × width ×RGB value), then the data has 49,152 dimensions. If you can reduce the dimensionality of the image space without destroying too much of the meaningful content in the image, then you’ve done a good job of reducing the dimensionality.
We will look at two dimensional reduction techniques that are commonly used in practice: principal component analysis and singular value decomposition.
First, a little bit of linear algebra — look at Spaces and bases.
You should know the coordinate plane defined by the origin O(0,0) and the basis vectors I (1,0) and j(0,1). In fact, you could choose an entirely different foundation in which the math still works. For example, you could keep the origin O, but choose I ‘=(2,1) and j’=(1,2) as the basis vectors. If you have the patience to do the math, you’ll see that the point (2,2) in the I ‘, j’ coordinate system is labeled (6, 6).
Draw using Mathisfun’s “Interactive Cartesian Coordinates”
This means that we can modify the foundation of space. Now imagine a space with higher dimensions, say 50,000 dimensions. You can choose a basis for this space, and then choose only the 200 most important vectors based on that basis. These basis vectors are called principal components, and you can choose a subset of them to form a new space that has fewer dimensions than the original space but retains as much data complexity as possible.
To select the most important principal component, we need to examine the variance of the data and rank them by that metric.
Another way to think about PCA is that IT remaps the space that exists in our data into a more compact space. The transformed dimension is smaller than the original dimension.
Using only the first few dimensions of the remapped space, we can begin to understand the organization of this data set. This is the purpose of dimensionality reduction: to reduce complexity (dimensions here) while preserving structure (variances). Here’s a paper by Samer on using PCA (and techniques like diffusion mapping) to try to understand the wikileaks cables.
Singular Value Decomposition (SVD)
Suppose we represent our data as A large matrix A=m by n. SVD allows us to decompose this large matrix into a product of three smaller matrices; The three matrices are U=m x r, the diagonal matrix sigma = R x r, and V=r x n, where r is a very small value.
The values in this r by R diagonal matrix sigma are called singular values. The neat thing about these values is that they can be used to compress the original matrix. If you discard the smallest 20% of the singular values and the associated columns in the matrices U and V, you can save a lot of space while still representing the original matrix very well.
To get a better idea of what this means, let’s take a look at a picture of a puppy:
We will use code from Andrew Gibiansky’s article on SVD. First, we found that if we sorted these singular values (the values of the matrix sigma) by size, the first 50 singular values would contain 85% of the size of the entire matrix sigma.
Based on this fact, we can discard the next 250 values (setting them to 0) and keep only the “rank 50” version of the dog image. Here, we create dog photos with rank 200, 100, 50, 30, 20, 10, and 3. Obviously, the picture got smaller. But let’s say we think the dog with rank 30 is still pretty good, now let’s see how much compression we’ve achieved. The original image matrix has 305*275 = 83,875 values, while the image with rank 30 has 305*30+30+30*275=17,430 values. The number of values decreased almost five times, but the quality decreased very little. The reason for this calculation is that when we perform the U sigma ‘V operation, part of the U and V matrix is also discarded by multiplying by 0 (σ ‘is a modified version of the sigma, which contains only the first 30 values).
Unsupervised learning is often used for data preprocessing. In general, this means compressing data in some sort of average-retention fashion, such as PCA or SVD; The data can then be used in deep neural networks or other supervised learning algorithms.
Please continue!
Now that you’ve finished this chapter, you’ve got a bad unsupervised learning joke that will never be mentioned again. This is:
Person-in-joke-#1: Y would u ever need to use unsupervised tho?
Person-in-joke-#2: Y? There ‘s no Y.
Here’s chapter 4: Neural networks and deep learning.
3 a K – Means clustering
Play around with this clustering demo to build an intuition of how the algorithm works. After that, take a look at the implementation of the k-means clustering for this handwritten number, and the related tutorial.
For a good reference to SVD, there’s nothing better than Andrew Gibiansky’s article.