This is the 31st day of my participation in the August More Text Challenge

Notes of Andrew Ng’s Machine Learning —— (13) Unsupervised Learning

This paper mainly introduces unsupervised learning and K-means clustering algorithm.

Clustering

Unsupervised Learning: Introduction

In supervised learning, we are given a set of labeled training set ( (
( x , y ) (x,y)
given)) to fit a hypothesis to it:

In contrast, in the unsupervised learning problems, we are given data that does not have any labels associated with it (
x x
only, no
y y
):

The slide above shows that here’s a set of points add in no labels. So, in unsupervised learning what we do is we give this sort of unlabeled training set to an algorithm and we just ask the algorithm find some structure in the data for us.

Given this data set one type of structure we might have an algorithm find is that it looks like this data set has points grouped into two separate clusters (drawn in green) and so an algorithm that finds clusters like the ones I’ve just circled is called a clustering algorithm.

K-Means Algorithm

In the clustering problem we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us.

The K-Means Algorithm is by far the most popular and widely used clustering algorithm.

Given an unlabeled data set to group (cluster) them into K (let’s say two, for example) clusters, what the K-Means Algorithm do is:

  1. Randomly initialize two points, called the cluster centroids.
  2. Cluster assignment step: going through each of the examples, for each of data dots depending on which cluster centroid it’s closer to, one point is going to be assigned to one of the K cluster centroids (color them into the same color, illustratingly).
  3. Move centroid step: take the K cluster centroids, and we are going to move them to the average of the points colored the same colour.

We keep on move centroid then redo a cluster assignment. Doing this looply until the cluster centroids not change any further (i.e. the K-Means has converged). And, it’s done a good job finding the clusters in the data.

K-means algoithm

Input:


  • K K
    (number of clusters)

  • {x (1), x (2),…, x (m)} \ {x ^ {} (1), x ^ {(2)}, \ \ cdots, x ^ {(m)} \} {x (1), x (2),…, x (m)} (training set)

    (I) x ∈ Rnx ^ {(I)} \ \ R ^ in nx (I) ∈ Rn (drop x0 = 1 x_0 = 1 x0 = 1 convention)

Output:


  • K K
    subsets

Algorithm:


Randomly initialize  K  cluster centroids  mu 1 . mu 2 . . . . mu k R n Repeat  { for  i = 1  to  m : // Cluster assignment step c ( i ) : = k   s.t.  min k x ( i ) mu k 2 /* set  c ( i )  the index (from  1  to  K ) of cluster centroid closest to  x ( i ) * / for  k = 1  to  K : // Move centroid step if cluster centroid  mu i  has 0 points assigned to it : eliminate that cluster centroid, continue // end up with  K 1  clusters mu k : = average (mean) of points assigned to cluster  k } \begin{array}{l} \textrm{Randomly initialize $K$ cluster centroids $\mu_1, \mu_2,… \mu_k \in \R^n$}\\ \textrm{Repeat }\{\\ \qquad \textrm{for $i=1$ to $m$:}\qquad\textrm{// Cluster assignment step}\\ \qquad\qquad c^{(i)} := k \ \textrm{ s.t. } \min_k||x^{(i)}-\mu_k||^2 \\ \qquad\qquad \textrm{/* set $ c^{(i)}$ the index (from $1$ to $K$) of cluster centroid closest to $x^{(i)}$ */}\\ \qquad \textrm{for $k=1$ to $K$:}\qquad\textrm{// Move centroid step}\\ \qquad\qquad \textrm{if cluster centroid $\mu_i$ has 0 points assigned to it}:\\ \qquad\qquad\qquad \textrm{eliminate that cluster centroid, continue}\\ \qquad\qquad\qquad \textrm{// end up with $K-1$ clusters}\\ \qquad\qquad \mu_k:= \textrm{average (mean) of points assigned to cluster $k$}\\ \}\\ \end{array}

K-Means for non-separated clusters

K-Means can do well in both well-separated cluster and non-separated clusters:

K-means Optimization Objective

Notations:

  • C (I) c ^ {} (I) (I) = index of cluster c (1, 2,.., K1, 2, \ \ cdots, K1, 2,…, K) to which example x (I) x ^ {} (I) x (I) is currently assigned.
  • μk\mu_kμk = cluster centroid KKK (μk∈Rn\mu_k\in\R^nμk∈Rn).
  • Mu_ {c^{(I)}}μc(I) = cluster centroid of cluster to which example x(I)x^{(I)}x(I) has been assigned.

For example, let’s say we have a x(1)x^{(1)}x(1) that is currently assigned to cluster 555, so in this case, Our c (1) = 5 c ^ {} (1) (1) = 5 = 5 c and u (1) = c u 5 \ mu_ {c ^ {(1)}} \ mu_5 mu = c (1) = 5, mu Where cluster centroid of cluster 555 is notated as μ5\mu_5μ5.

Optimization objective:


J ( c ( 1 ) . . c ( m ) . mu 1 . . mu K ) = 1 m i = 1 m x ( i ) mu c ( i ) 2 J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)=\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}-\mu_{c^{(i)}}||^2

min 1 c ( 1 ) . . c ( m ) . mu 1 . . mu K J ( c ( 1 ) . . c ( m ) . mu 1 . . mu K ) \min_{ \begin{array}{c} {1}c^{(1)},\cdots,c^{(m)},\\ \mu_1,\cdots,\mu_K \end{array}} J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)

Our cost function
J J
is also called distortion.

In the K-means algorithm, What it is doing in our cluster assignment step is to minimize JJJ wit C (1),… c(m)c^{(1)},\cdots, C ^{(m)} C (1) Deterring the JJJ wit μ1,,,μk\mu_1,\cdots,\ mu_K μ1,… , mu k \ mu_1,… , \ mu_k mu 1,… , mu k.

Random initialization

There is a way to random initialization our
K K
cluster centroids:

  1. Should have
    K < m K<m
  2. Randomly pick
    K K
    training examples.
  3. Set μ1,… μk\mu_1,\cdots, mu_kμ1,… μk equal to these KKK examples.

I.e. for we pick KKK distinct random integers i1,…, iki_1, \ \ cdots, i_ki1,…, ik from {1,…, m} \ {1 \ cdots, m \} {1,…, m}. Set 1 mu = x (i1), mu (i2), 2 = x., mu k = x (ik) \ mu_1 = x ^ {(i_1)}, \ mu_2 = x ^ {(i_2)}, \ \ cdots, \ mu_k = x ^ {(i_k)} 1 mu = x (i1), mu = 2 x (i2),…, mu k = x (ik).

Local optima

Beacuse of the random initialization, we sometimes get a good cluster (get a global optima, like the one at the right top of the picture) and we may also get a bad one (get a local optima, like the two at the right bottom of the picture).

To avoid the K-means algorithm stop at a loacal optima, we can try this:


For  i = 1  to  100  <or 50 1000> { Randomly initialize K-means. Run K-means. Get  c ( 1 ) . . c ( m ) . mu 1 . . mu k Compute cost function (distortion): J ( c ( 1 ) . . c ( m ) . mu 1 . . mu K ) } pick clustering that gave lowest  J . \begin{array}{l} \textrm{For $i=1$ to $100$ <or 50~1000> \{}\\ \qquad\textrm{Randomly initialize K-means.}\\ \qquad\textrm{Run K-means. Get $c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_k$}\\ \qquad\textrm{Compute cost function (distortion):}\\ \qquad\qquad J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)\\ \textrm{\}}\\ \textrm{pick clustering that gave lowest $J$.} \end{array}

Choosing the Number of Clusters

The most common way to choose the number of clusters
K K
is actually to choose it by hand.

Choosing the Elbow number is a capable way to choose the value of
K K
:

However the elbow method is not always make sense for a condition like this:

We can hardly find a elbow of it.

So, there is another thought to make it:

Sometimes, you are running K-means to get clusters to use for some later/downstream purpose. Elaluate K-means based on a metric for how well it performs for that later purpose.

T-shirt sizing for example:

In particular, what we can do is, think about this from the perspective of the T-shirt business and ask: “Well if I have five segments, then how well will my T-shirts fit my customers and so, how many T-shirts can I sell? How happy will my customers be?” What really makes sense, from the perspective of the T-shirt business, in terms of whether, I want to have Goer T-shirt sizes so that my T-shirts fit my customers better. Or do I want to have fewer T-shirt sizes so that I make fewer sizes of T-shirts. And I can sell them to the customers more cheaply. And so, the t-shirt selling business, that might give you a way to decide, between three clusters versus five clusters.