Cluster analysis has become one of the most important methods in data analysis, machine learning and data science.

The general idea of clustering is to divide a group of objects with various features into several groups, called clustering.

There are many algorithms used for clustering, but I will introduce you to the most powerful and simplest algorithm, called K-means clustering.

This article will be explained from the following aspects:

  • The data set
  • Algorithm is introduced
  • Method of use
  • Cluster verification: contour analysis

Now, let’s start to introduce the details of this article.

The data set

This article will apply the application and verify the full functionality of the clustering algorithm on a Data set provided by San Francisco Open Data, monthly airline statistics at San Francisco International Airport.

In our work, we will use the following Python toolkits:

  • matplotlib(seaborn)
  • pandas
  • scikit-learn
  • numpy

Below, begin to introduce it in detail! First, as with every data science task, we have to familiarize ourselves with the data we’re dealing with.

df = pd.read_csv("Datasets/Air_Traffic_Passenger_Statistics.csv")
df.head()
Copy the code

As shown in the figure above, there are multiple columns in our data set, but for clustering analysis, we will use operating airlines, geographic regions, number of passengers and flights held by each airline.

So, first, to identify potential outliers and get a sense of our data, let’s use the Python data visualization library Seaborn to draw some graphs.

From the airline frequency chart, we can see that some samples have larger flights than others. These were united airlines and United airlines before July 1, 2013.

So, these airlines are potential outliers for us.

The reason why we try to find and eliminate outliers will be explained later.

Let’s take a look at the frequency chart of the countries that have flights.

Figure (figsize = (15,15)) sns.countplot(x = "GEO Region", palette = "Set3",data = df) plt.xticks(rotation = 90) plt.ylabel("Number of fights held") plt.show()Copy the code

As we can see, the United States is number one among countries because our anomaly airlines (United airlines and United airlines – before July 1, 2013) are American airlines. It is obvious that “countries” or “airlines” can be divided into groups based on the number of flights they hold.

Algorithm is introduced

K-means clustering is one of the most popular clustering algorithms nowadays. It was introduced in the 1950s by Hugo Steinhaus.

The main idea of the algorithm is to divide a set of points X in n-dimensional space into centroid C, so as to minimize the objective function (MSE of these points and corresponding centroid).

Let’s look at the specific implementation steps of the algorithm:

First, the center of mass is initialized randomly. The second step is to repeat the two steps until convergence (the center of mass is the same as the last iteration) :

  • Find the closest center of mass for each point in X and add this point to the group
  • Calculate the new centroid of each cluster and take the average value of each dimension

Therefore, this algorithm is sensitive to outliers because it needs to deal with distances.

Therefore, we must identify outliers and remove them from the dataset before performing cluster analysis. To find outliers more accurately, we will build scatter plots.

airline_count = df["Operating Airline"].value_counts() airline_count.sort_index(inplace=True) passenger_count = df.groupby("Operating Airline").sum()["Passenger Count"] passenger_count.sort_index(inplace=True) from Sklearn. preprocessing import scale x = airline_count. Values y = passenger_count. Values plt.figure(figsize = (10,10)) plt.scatter(x, y) plt.xlabel("Flights held") plt.ylabel("Passengers") for i, txt in enumerate(airline_count.index.values): a = plt.gca() plt.annotate(txt, (x[i], y[i])) plt.show()Copy the code

We can see that most airlines are grouped in the lower left of the chart, with some above them, as well as our two outliers united airlines and United Airlines. So let’s get rid of them.

df_1 = airline_count + passenger_count
df_1.sort_values(ascending = False, inplace = True)
outliers = df_1.head(2).index.values
airline_count = airline_count.drop(outliers)
airline_count.sort_index(inplace=True)
passenger_count = passenger_count.drop(outliers)
passenger_count.sort_index(inplace = True)
x = airline_count.values
y = passenger_count.values
Copy the code

We have prepared the data set for clustering. The problem, however, is determining the “optimal” number of groups to use, as described below.

Method of use

As discussed earlier, what we do is try to minimize the target function.

Obviously, the more center of mass we have, the smaller the target function is going to be.

Furthermore, when all points are themselves centroids, the objective function becomes zero – the global minimum. However, this does not meet our needs.

Let’s draw the relationship diagram of the value of the objective function according to the number of clusters:

Our task is to find the number of clusters, after which the objective function will not fall as fast or as fast as the human arm: if we imagine the graph as the human arm, the optimal number of clusters will be the elbow.

In the best case, the optimal number of clusters is 6.

Therefore, we can now perform cluster analysis on the data set.

Kmeans = kmeans (n_clusters=6) kmeans. Fit (X) y_kmeans = kmeans. Predict (X) plt.figure(figsize = (15,15)) plt.xlabel("Flights held") plt.ylabel("Passengers") plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=300, cmap='Set1') for i, txt in enumerate(airline_count.index.values): plt.annotate(txt, (X[i,0], X[i,1]), size = 7) plt.show()Copy the code

We have now finished clustering the data and made some assumptions and predictions based on the results.

We can see that American airlines and SkyWest airlines have many flights and carry many people, so we can infer that these two airlines have more aircraft capacity, or that people use these airlines more than other airlines.

Cluster analysis allows you to gain many insights into the data. Now, it is important for us to validate the results of the clustering (to see how well the partitions are).

Effect of validation

To understand the main features of this type of validation, we must understand what contour coefficients are.

For each point in the data set, we can use the formula above to calculate the contour coefficients, but it is not clear what a(I) and b(I) are. So let’s understand what the a(I) parameter means.

For each point, we can calculate a(I)– the average distance between the current point I and the cluster neighbors of I, the in-class distance, denoted by j (I or j should correspond to the same cluster).

Now, let’s understand what the b(I) parameter means.

For each point, we can calculate the minimum distance between b(I)– the distance between the current point I and a point J that is not in the same class as I, the interclass distance.

Using the above parameters, we can calculate the contour coefficient of each point in the data set. The value range of contour score is [-1, 1].

It’s useful to draw an outline, to figure out what it looks like.

In the contour diagram, we plot the contour coefficients for each point grouped into clusters and sorted within them.

Looking at the figure, we can judge whether the classification is good or not: if all the “bars” are the same width and the contour coefficient of each point in the group is greater than the global average, then the classification can be called good, otherwise the clustering effect is not perfect, but it is acceptable.

Now let’s draw a profile for the cluster (the code is based on the “profile analysis” example in the SciKit-Learn documentation).

from sklearn.metrics import silhouette_samples, Silhouette_score import matplotlib.cm as cm n_clusters = 6 plt.figure(figsize = (10,10)) plt.gca().set_xlim([-0.1,1]) plt.gca().set_ylim([0, len(X) + (n_clusters + 1) * 10]) clusterer = KMeans(n_clusters=n_clusters, random_state=10) labels = clusterer.fit_predict(X) print("The average silhouette_score is :{}".format(silhouette_score(X, labels))) sample_silhouette_values = silhouette_samples(X, labels) y_lower = 10 for i in range(n_clusters): ith_cluster_silhouette_values = \ sample_silhouette_values[labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / n_clusters) plt.gca().fill_betweenx(np.arange(y_lower, y_upper), 0, Silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) PLt.gca ().text(-0.05, Y_lower + 10 * size_cluster_i, STR (I)) y_lower = y_upper + 10 plt.gca().axvline(x=silhouette_score(x, labels), color="red", linestyle="--", label = "Avg silhouette score") plt.title("The silhouette plot") plt.xlabel("The silhouette coefficient values") plt.ylabel("Cluster label") plt.legend() plt.show()Copy the code

As we can see, the sizes of clusters are different, and not all clusters are composed of points with contour coefficients greater than the global average. However, according to our data and combined with the previous scatter diagram, although not perfect, the overall clustering effect is good.


Dry recommended

In order to facilitate everyone, I spent half a month’s time to get over the years to collect all kinds of iso technical finishing together, content including but not limited to, Python, machine learning, deep learning, computer vision, recommendation system, Linux, engineering, Java, content of up to 5 t +, I put all the resources download link to a document, The directory is as follows:

All dry goods to everyone, hope to be able to support it! https://http://pan.baidu.com/s/1eks7CUyjbWQ3A7O9cmYljA (code: 0000)