Make writing a habit together! This is the second day of my participation in the “Gold Digging Day New Plan · April More text challenge”. Click here for more details.

1. Experimental algorithm design

  1. Read watermelon data set
  2. K samples were randomly selected as the initial cluster center
  3. Calculate the distance between each sample and each cluster center, and assign each sample to the cluster center closest to it. At this time, all samples have been divided into GROUP K
  4. The clustering center was updated and the mean value of samples in each group was taken as the new clustering center of the group
  5. The second and third steps are repeated until the cluster center becomes stable or the maximum number of iterations is reached.

Experimental analysis

KNN classification was used on watermelon datasets

A simple analysis of watermelon data sets shows the following results:

Characteristics of the Data characteristics role
Serial number discrete Serial number
The density of continuous Characteristics of the
Sugar content continuous Characteristics of the
Good melon discrete The label

Therefore, the characteristic density and sugar content were selected for cluster analysis.

Two, k-means clustering analysis core code

  1. Import the required libraries

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    Copy the code

    In this experiment, I choose Pandas1 as the main tool to read the data set, NUMpy2 to accelerate the main mathematical operations, and Matplotlib3 to carry out data visualization analysis.

  2. Define k-means clustering KMeans

    1. define__init__()To initialize the classifier

      class KNNClassifier:
          def __init__(self, x: pd.DataFrame) :
              self.x = x
          ...
      Copy the code

      Where X represents data set features.

    2. Predefined distance functiondistanceAll()

      def distanceAll(center, rest) :
          distances = np.apply_along_axis(_distances, 1, rest, center)
          return distances.sum(a)def _distances(point: np.ndarray, centers: np.ndarray) :
          distances = np.apply_along_axis(_distance, 1, centers, point)
          return distances
      
      def _distance(x, y) :
          return np.sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y))
      Copy the code

      Here I made several optimizations, the specific optimization points are as follows:

      Avoid the use offor-loopTo speed up the operation

      In the first function, distanceAll, the center and rest are passed in as a multidimensional matrix, and the distance between the center and rest is implemented in pairs without using any for loops, which greatly improves running speed.

      reuse_distance(x, y) The calculation results

      The general formula for calculating Euclidean distance is:


      d = ( k = 1 m x k i x k j 2 ) d = \left( \sum_{k=1}^m \left | x_{ki} – x_{kj} \right |^2 \right)

      But the formula I’m using here is the expansion


      d = k = 1 m ( x k i 2 2 x x k i x x k j + x k j 2 ) d = \sum_{k=1}^m\left( \red{ x_{ki}^2} – 2\times x_{ki}\times x_{kj}+ \red {x_{kj}^2} \right)

      The red part of this formula is used many times in calculating Euclidean distance, so using this formula can take full advantage of NUMpy’s caching mechanism and reduce unnecessary repetitive computation.

    3. predefinedallocate()The core method finds the nearest cluster center for each point

      def allocateAll(center, rest) :
          # 2. Calculate the distance between each sample and each cluster center, and assign each sample to the cluster center closest to it
          allocates = np.apply_along_axis(_allocate, 1, rest, center)
          # sns.scatterplot(data=rest, x=0, y=1, hue=allocates)
          copied = rest.copy()
          copied["allocations"] = allocates
          groups = copied.groupby("allocations").groups
          # drawing
          fig = plt.figure()
          ax = rest.plot.scatter(x=0, y=1, c=allocates, colormap='viridis', legend=True)
          center.iloc[list(groups.keys())].plot.scatter(x=0,
                                                        y=1,
                                                        c=list(groups.keys()),
                                                        marker="x",
                                                        colormap='viridis',
                                                        s=200,
                                                        ax=ax)
          plt.show()
          return groups
      
      def _allocate(point: np.ndarray, centers: np.ndarray) :
          distances = np.apply_along_axis(_distance, 1, centers, point, "euclidean")
          nearest_center = np.argmin(distances)
          return nearest_center
      Copy the code

      At the same time, in the process of clustering each point to find the center, the drawing visualization method is also integrated. The visualization method here plots the subsequent clustering process.

    4. definetrain()Iterative training is performed on the training set

      class KMeans:.def train(self, k) :
              print(f" === k = {k}= = =")
              batch = self.x.shape[0]
              features = self.x.shape[1]
              # 1. Randomly select K samples as the initial clustering center
              index = np.random.randint(0, batch, size=k)
              centers: pd.DataFrame = self.x.iloc[index]  # Cluster center
              # rest: pd.DataFrame = self.x.loc[~self.x.index.isin(index)]
              allocations = allocateAll(centers, self.x)
              for i in range(10):
                  last_centers = centers
                  centers = np.empty((k, 2))
                  for label, points in allocations.items():
                      center = self.x.iloc[points]
                      new_center = np.average(center, axis=0)
                      centers[label] = new_center
                  if np.isclose(last_centers, centers).all() :print(f"k = {k}Convergence, stop!")
                      return distanceAll(pd.DataFrame(centers), self.x)
                  allocations = allocateAll(pd.DataFrame(centers), self.x)
      Copy the code

      In this code, I specified a maximum of 10 rounds per training, and in general, it only takes 5 iterations to converge to the cluster center.

      Code is divided into two parts, the first time the clustering center of randomly selected in the sample, for the first time after clustering, then according to the previous clustering results, select each category average point as the center of the loop iteration, the current iteration of the loop center differ with the previous round, not terminate iteration, return the WSS distance value.

3. Analysis of experimental data and results

K-means clustering was used on watermelon datasets

  1. Import the required libraries

    import matplotlib.pyplot as plt
    import pandas as pd
    
    from model import KMeans
    Copy the code

    Here, import KMeans and the plotting tool Matplotlib to draw THE WSS curve.

  2. Read the data set and build the model

    df = pd.read_csv("kmeansdata.csv")
    model = KMeans(df[["m"."h"]])
    Copy the code

    The watermelon data set is read in here, and features M and H are selected to build the model.

  3. KMeans model training, visualization, WSS curve analysis

    wss = []
    for i in range(2.10):
        wss.append(model.train(k=i))
    plt.plot(range(2.10), wss)
    plt.savefig("result.png")
    Copy the code

    Here, I select k values from 2 to 15, and use these K values respectively to conduct training on the KMeans model, save the WSS distance returned after each training, and finally conduct visual analysis on THE WSS distance.

    Visualization of training processk=3

    Firstly, three samples are randomly selected as clustering centers in the data set:

    It can be seen that the selected clustering center is downward, and then the first iteration is carried out:

    In each category, the center point is selected as the next clustering center, and then the category of each point is re-determined and the next iteration is carried out:

    As you can see, the center is shifted to the middle and the classification is more reasonable. One more iteration:

    After that, the iteration center does not change significantly, which means that the clustering center converges and the clustering of this round ends.

    WSS curve visualization

Iv. Summary and experience

  1. On simple data sets (such as watermelon data set), the clustering effect is better, and convergence can be achieved in a few iterations.
  2. According to the visual analysis of different k values, it can be found that when K =3 reaches “elbow”, k is the optimal value. K value greater than 3 will lose statistical significance due to too many categories, while k value too small will lead to too few categories, making the intra-class distance rise sharply.
  3. Using C interfaces to implement Python programs is better than usingPython-based-codingMore efficient.
  4. I have mastered some simple data visualization methods, learned to use some simple functions related to Pyplot in matplotlib library, and converted a large amount of data into pictures by simple data visualization methods, which greatly simplified our analysis and comparison of the result data and made it easier to obtain some rules and conclusions on the results.

5. Suggestions for improvement of the experimental process, methods and means

  1. During data set visualization, feature information of other dimensions will be lost if the first two dimensions are brutally selected for visualization analysis of high-dimensional features. Here, dimensionality reduction methods such as PCA4 can be selected to project high-dimensional features onto two-dimensional planes for visualization analysis.
  2. You can try more complex data sets.
  3. You can try to think of more distance functions.

References


  1. pandas.pydata.org/↩
  2. numpy.org/↩
  3. matplotlib.org/↩
  4. En.wikipedia.org/wiki/Princi…↩