@[TOC]

Unsupervised learning

First of all, this kind of algorithm is typical unsupervised learning algorithm, such as PCA will dimension and so on. This kind of algorithm was mentioned in the previous machine learning article process, but I did not go to the specific analysis of its implementation, just to use Sklearn to do some applications. So must first clear K – means so is necessarily to make clear the meaning of unsupervised, here I will use the most simple language to as much as possible to describe the complex algorithm (although the K – means is not complicated, I think is really difficult for real life all kinds of quantitative parameters, modeling, namely how to get reasonable and efficient data set. For example, how to find a data set to analyze the classification of girls’ preferences for boys through KNN algorithm, which kinds of characteristics girls like boys, so as to improve male charm and achieve scientific separation. Of course this may not be ethically engineering, and there are many difficulties in creating such a data set.) So about unsupervised learning:

Unsupervised learning is a form of machine learning that does not require human input of labels. It is an alternative to strategies such as supervised and reinforcement learning. In supervised learning, the typical task is classification and regression analysis, and it needs to use the base prepared manually. . Unsupervised learning is mainly concerned with supervised learning and reinforcement learning, which can be distinguished from supervised learning and unsupervised learning by the interpretation of input.

This is the encyclopedia definition, so this is actually an example.Now this one has a bunch of dots on it, and I’m ordering you to sort it. How. I’m going to ask you to divide it into three categories, three categories with respect to a bunch of points, and now this is a point on a plane, which has only two dimensions, which is x and y. So this is the simplest example, and then there’s a more complicated oneLnstacart market basket analysisThis is a machine learning problem that allows you to categorize user behavior and find out where users like to gather. In fact, this is also a classification problem, the region classification, also just can carry on the plane processing of the region, of course, the specific operation is a little more complicated, we will use this to take an example.

The principle of analytic

So in the specific algorithm process we can imagine.

For this picture, let’s say we have three categories. What should we do?

Select center point

First of all, we must choose three central points. Since we are unsupervised learning, we do not know the result of our classification at all. Naturally, we do not know what the central points are at the beginning.

The clustering partition

Now that we have the center point, what do we do next? No doubt, it is natural to judge according to the distance. See which center point this point is closest to, the closer it is, then this point is in the same category as that center point. So there are several ways to divide the distance bar. One is the most typical Euclidean distance, that is, connecting point distance.Another is the Manhattan distance

demo

The following set of pictures can be very vividly illustrated

Pure math

Now, this is my most tedious moment, where we mathematically standardize what we said earlier.

Algorithm process

1) For the K-means algorithm, the first thing to pay attention to is the selection of K value. Generally speaking, we will select an appropriate K value based on prior experience of data. If there is no prior knowledge, an appropriate K value can be selected through cross verification.

2) After determining the number of k, we need to select k initialized centroids, like the random centroids in figure b above. Because of me? > is a heuristic method. The location selection of k initialized centroids has a great influence on the final clustering result and running time. Therefore, it is necessary to select appropriate K centroids and it is better that these centroids should not be too close.

Ok, now let’s summarize the traditional K-means algorithm flow.

The input is the sample set D={x1,x2… Xm}, cluster tree k of clustering, maximum number of iterations N

C={C1,C2… Ck}

1) Randomly select K samples from dataset D as initial K centroid vectors: {μ1,μ2… μk} 2) for n=1,2… ,N

A) divide clusters C initialize to Ct=∅t=1,2… K b) for I =1,2… M, calculate sample xi and each centroid vector μj(j=1,2… K) : the distance dij = | | xi – mu j | | 22, will xi markup minimal corresponding category by dij lambda I. Cλ I =Cλ I ∪{xi} C) for j=1,2… , k, for all the sample points in the Cj to recalculate the new centroid mu j = 1 | Cj | ∑ x ∈ Cjx e) if all the k centroid vector did not change, then go to step 3)

3) Partition of output cluster C={C1,C2… Ck}

The advantages and disadvantages

The main advantages of K-means are:

1) The principle is relatively simple, and the implementation is also very easy, with fast convergence speed.

2) Better clustering effect.

3) The algorithm is highly explainable.

4) The main parameter to be adjusted is only the cluster number K.

The main disadvantages of K-means are:

1) The selection of K value is difficult to grasp

2) It is difficult for non-convex data sets to converge

3) If the data of each implied category is unbalanced, for example, the data amount of each implied category is seriously unbalanced, or the variance of each implied category is different, the clustering effect is not good.

4) Using the iterative method, the results obtained are only locally optimal.

5) Sensitivity to noise and outliers.

Case Study (LnSTACart Market Basket Analysis)

This part of the words I will directly use sklearn to do, the algorithm principle is very simple, basically no difficulty to achieve their own. API:

from sklearn.cluster import KMeans

Evaluation:

from sklearn.metrics import silhouette_score

To explore the user’s preference for item category subdivision, in this case, it is to get the classification of the user’s favorite item through cluster analysis

This has already been done

Data set:

Link: pan.baidu.com/s/1P9xwvyYA… Extraction code: 6666

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
#1. Get the data
order_products=pd.read_csv("./data/order_products__prior.csv")
products=pd.read_csv("./data/products.csv")
orders=pd.read_csv("./data/orders.csv")
aisles=pd.read_csv("./data/aisles.csv")
# 2. Merge tables
# merge aisles with Products aisle and product_id
tab1=pd.merge(aisles,products,on=["aisle_id"."aisle_id"[]) :100]
tab2=pd.merge(tab1,order_products,on=["product_id"."product_id"[]) :100]
tab3=pd.merge(tab2,orders,on=["order_id"."order_id"[]) :100]
#3. Find the relationship between user_id and aisle
table=pd.crosstab(tab3["user_id"],tab3["aisle"])
data=table[:10000]# Take a portion to save time
# 4. PCA dimension reduction
Instantiate the converter class
transfer=PCA(n_components=0.95)
Call fit_transform # 4.2
data_new=transfer.fit_transform(data)

#5. The estimator process
estimator=KMeans(n_clusters=3)
estimator.fit(data_new)
y_predict=estimator.predict(data_new)
# Model evaluation - profile coefficient
silhouette_num=silhouette_score(data_new,y_predict)
print("Contour coefficient: \n",silhouette_num)

Copy the code