Public account: You and the cabin by: Peter Editor: Peter
Hello, I’m Peter
Many readers have asked me: are there any good cases of data analysis and data mining? The answer is yes, it’s all on Kaggle.
It’s just that you take the time to learn, even to play. Peter himself has no experience in participating in the competition, but he often visits Kaggle to learn the ideas and methods of solving problems of the leaders in the competition.
In order to document the good methods of the bigwigs, and to improve himself, Peter decided to start a column called Kaggle Case Sharing.
In the future, case analysis will be updated irregularly. The ideas are all from the Internet leaders, especially the Top1 sharing. Peter is mainly responsible for sorting out ideas and learning technologies.
Today I decided to share a case about clustering, using: supermarket user segmentation data set, the official website address please go to: supermarket
In order to facilitate everyone to practice, the public account back to the supermarket, you can get this data set ~
The Notebook is no. 1
Import libraries
# Data processing
import numpy as np
import pandas as pd
# KMeans clustering
from sklearn.cluster import KMeans
# drawing library
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected = True)
Copy the code
Data on EDA
Import data
First we import the data set:
We found that there were five attribute fields in the data, namely customer ID, gender, age, average income and consumption level
Data exploration
1. Data shape
df.shape
# the results
(200.5)
Copy the code
It’s 200 rows, 5 columns of data
2. Missing values
df.isnull().sum(a)# the results
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
Copy the code
As you can see, all fields are complete with no missing values
3. Data types
df.dtypes
# the results
CustomerID int64
Gender object
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object
Copy the code
The field types are int64 values except Gender, which is a string
4. Describe statistics
Description statistics is used to view the values of statistical parameters related to numerical data, such as number, median, variance, maximum value, and quartile
In order to facilitate subsequent data processing and presentation, two points are processed:
# 1. Set the drawing style
plt.style.use("fivethirtyeight")
# 2. Take out the three fields for key analysis
cols = df.columns[2:].tolist()
cols
# the results
['Age'.'Annual Income (k$)'.'Spending Score (1-100)']
Copy the code
Three property histograms
Check the histogram of ‘Age’, ‘Annual Income (k$)’ and ‘Spending Score (1-100)’ to observe the overall distribution:
# drawing
plt.figure(1,figsize=(15.6)) # Canvas size
n = 0
for col in cols:
n += 1 # subgraph position
plt.subplot(1.3,n) # subgraph
plt.subplots_adjust(hspace=0.5,wspace=0.5) Adjust width and height
sns.distplot(df[col],bins=20) # Draw the histogram
plt.title(f'Distplot of {col}') # titles
plt.show() # Display graphics
Copy the code
Gender factors
Sex statistics
See how many men and women there are in this data set. Whether gender has an impact on the overall analysis will be considered later.
Data distribution by gender
sns.pairplot(df.drop(["CustomerID"],axis=1),
hue="Gender".# group field
aspect=1.5)
plt.show()
Copy the code
Through the above bivariate distribution chart, we can observe that gender has little influence on the other three fields
The relationship between age and average income by gender
plt.figure(1,figsize=(15.6)) # Drawing size
for gender in ["Male"."Female"]:
plt.scatter(x="Age", y="Annual Income (k$)".# Specify two parsed fields
data=df[df["Gender"] == gender], # Data to be analyzed, under a gender
s=200,alpha=0.5,label=gender # Scatter size, transparency, label classification
)
# Horizontal and vertical axis, title Settings
plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")
plt.title("Age vs Annual Income w.r.t Gender")
# Display graphics
plt.show()
Copy the code
The relationship between average income and consumption score by gender
plt.figure(1,figsize=(15.6))
for gender in ["Male"."Female"] :# Explanation refer to above
plt.scatter(x = 'Annual Income (k$)',y = 'Spending Score (1-100)',
data=df[df["Gender"] == gender],
s=200,alpha=0.5,label=gender)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title("Annual Income vs Spending Score w.r.t Gender")
plt.show()
Copy the code
Data distribution by gender
Observe the data distribution by violin diagram and cluster scatter diagram:
# The Swarmplots
# Violinplot
plt.figure(1,figsize=(15.7))
n = 0
for col in cols:
n += 1 # Subgraph order
plt.subplot(1.3,n) # NTH subgraph
plt.subplots_adjust(hspace=0.5,wspace=0.5) Adjust width and height
Draw two graphs under a col, grouped by Gender
sns.violinplot(x=col,y="Gender",data=df,palette = "vlag")
sns.swarmplot(x=col, y="Gender",data=df)
# Axis and title Settings
plt.ylabel("Gender" if n == 1 else ' ')
plt.title("Violinplots & Swarmplots" if n == 2 else ' ')
plt.show()
Copy the code
The results are as follows:
- View the distribution of different fields for different genders
- Observe for outliers, outliers, etc
Attribute correlation analysis
Mainly observe the regression between two pairs of attributes:
cols = ['Age'.'Annual Income (k$)'.'Spending Score (1-100)'] # Correlation analysis of the three attributes
Copy the code
plt.figure(1,figsize=(15.6))
n = 0
for x in cols:
for y in cols:
n += 1 N increases with each loop and the subgraph moves once
plt.subplot(3.3,n) # 3 by 3 matrix, the NTH figure
plt.subplots_adjust(hspace=0.5, wspace=0.5) # Width and height parameters between subgraphs
sns.regplot(x=x,y=y,data=df,color="#AE213D") # Data and colors for drawing
plt.ylabel(y.split()[0] + "" + y.split()[1] if len(y.split()) > 1 else y)
plt.show()
Copy the code
The specific graph is:
The figure above shows two things:
- The main diagonal is the relationship between itself and itself, directly proportional
- Other graphs are inter-attribute, with scatter of data, as well as simulated trends
Clustering between two attributes
Here do not specifically explain the principle and process of clustering algorithm, the default has the basis
K value selection
We determined the k value by drawing a ELBOW diagram of the data. Information broadcast:
1. Parameter explanation from the official website: scikit-learn.org/stable/modu…
2, Chinese explanation reference: blog.csdn.net/qq_34104548…
df1 = df[['Age' , 'Spending Score (1-100)']].iloc[:,:].values # Data to be fitted
inertia = [] Empty list, used to store the sum of distances to the center of mass
for k in range(1.11) :The # k value is between 1 and 10 by default, and the experience value is 5 or 10
algorithm = (KMeans(n_clusters=k, # k value
init="k-means++".# Initial algorithm selection
n_init=10.# Random run times
max_iter=300.# Maximum number of iterations
tol=0.0001.# Tolerate minimum error
random_state=111.# Random seed
algorithm="full")) # Select auto, Full, elkan
algorithm.fit(df1) # Fit data
inertia.append(algorithm.inertia_) The sum of the centers of mass
Copy the code
Draw the relationship between the change of K value and the sum of the centroid distance:
plt.figure(1,figsize=(15.6))
plt.plot(np.arange(1.11), inertia, 'o') # Data is drawn twice with different marks
plt.plot(np.arange(1.11), inertia, The '-', alpha=0.5)
plt.xlabel("Choose of K")
plt.ylabel("Interia")
plt.show()
Copy the code
Finally, we find that k=4 is appropriate. Therefore, k=4 is used to carry out the real data fitting process
Clustering modeling
algorithm = (KMeans(n_clusters=4.# k=4
init="k-means++",
n_init=10,
max_iter=300,
tol=0.0001,
random_state=111,
algorithm="elkan"))
algorithm.fit(df1) # Simulation data
Copy the code
After performing the fit operation on the data, we get the label label and four centroids:
labels1 = algorithm.labels_ # Results of classification (4 categories)
centroids1 = algorithm.cluster_centers_ The position of the final center of mass
print("labels1:", labels1)
print("centroids1:", centroids1)
Copy the code
In order to show the classification effect of raw data, the case on the official website is the following operation, which I personally think is a little tedious:
Perform data merge:
Show the classification effect:
plt.figure(1,figsize=(14.5))
plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z,interpolation="nearest",
extent=(xx.min(),xx.max(),yy.min(),yy.max()),
cmap = plt.cm.Pastel2,
aspect = 'auto',
origin='lower')
plt.scatter(x="Age",
y='Spending Score (1-100)',
data = df ,
c = labels1 ,
s = 200)
plt.scatter(x = centroids1[:,0],
y = centroids1[:,1],
s = 300 ,
c = 'red',
alpha = 0.5)
plt.xlabel("Age")
plt.ylabel("Spending Score(1-100)")
plt.show()
Copy the code
If it were me, what would I do? Pandas+Plolty
Take a look at the results of the classification visualization:
px.scatter(df3,x="Age",y="Spending Score(1-100)",color="Labels",color_continuous_scale="rainbow")
Copy the code
The above process is clustered according to Age and Spending Score(1-100). On the official website, clustering of Annual Income (K $) and Spending Score (1-100) fields is also carried out based on the same method.
The effects are divided into five categories:
Clustering of 3 attributes
Cluster according to Age, Annual Income and Spending Score, and finally draw a 3D graph.
K value selection
The method is the same, but 3 fields are selected.
X3 = df[['Age' , 'Annual Income (k$)' ,'Spending Score (1-100)']].iloc[: , :].values Select 3 fields of data
inertia = []
for n in range(1 , 11):
algorithm = (KMeans(n_clusters = n,
init='k-means++',
n_init = 10 ,
max_iter=300,
tol=0.0001,
random_state= 111 ,
algorithm='elkan') )
algorithm.fit(X3) # Fit data
inertia.append(algorithm.inertia_)
Copy the code
Draw elbow diagram to determine K:
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , The '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
Copy the code
We finally choose K =6 to cluster
Construction simulation
algorithm = (KMeans(n_clusters=6.# determine the value of k
init="k-means++",
n_init=10,
max_iter=300,
tol=0.0001,
random_state=111,
algorithm="elkan"))
algorithm.fit(df2)
labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_
print(labels2)
print(centroids2)
Copy the code
Get the label and center of mass:
labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_
Copy the code
drawing
We finally chose Plotly to show the 3d clustering:
df["labels2"] = labels2
trace = go.Scatter3d(
x=df["Age"],
y= df['Spending Score (1-100)'],
z= df['Annual Income (k$)'],
mode='markers',
marker = dict(
color=df["labels2"],
size=20,
line=dict(color=df["labels2"],width=12),
opacity=0.8
)
)
data = [trace]
layout = go.Layout(
margin=dict(l=0,r=0,b=0,t=0),
title="six Clusters",
scene=dict(
xaxis=dict(title="Age"),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
)
fig = go.Figure(data=data,layout=layout)
fig.show()
Copy the code
The following is the final clustering effect: