Kmeas feature processing experience

This article is written by my colleague Xie Sifa. Kmeans clustering algorithm is one of the ten algorithms of data mining. Here is a summary of some experiences in using Kmeans, mainly in the processing of features.

Enumerative feature

According to the principle of Kmeans, enumeration type features are not applicable to Kmeans clustering. For example, if a character is the hero type, the value ranges from 1 to 5. First of all, the clustering process will take the mean value of the features, and it is meaningless to take the mean value of enumeration features. In addition, Kmeans clustering logarithmic points according to the distance of space. Thus, at the algorithmic level, instances of value 1 are closer to instances of value 2 than instances of value 5, but in practical sense, there is no near and far relationship between instances of different types. In other algorithms, such as LR, enumeration features will be one-hot encoded, and image type features will be converted into a feature vector of length 5, and the value of the corresponding position will be set to 1 according to the type. However, this method is not suitable for Kmeans clustering. In Kmeans, in order to prevent the distance between samples left and right of the data of some large-value attributes, features are normalized and the characteristic values are generally converted to 0-1. After normalization, the features with only 0/1 value will become strong features, which has a great influence on clustering. An intuitive way of thinking about it, let’s say there are only two dimensions, one of which is 0/1. Then, after normalization, the distribution of samples will be located on two line segments, roughly as follows:

For the clustering of these data, if the initial points are distributed on both sides, the data of the two line segments will be separated and only the clustering will be performed on the two line segments respectively. Two points on different line segments, the distance between them is greater than or equal to 1, greater than the distance between two points in the line segment. Similarly, in 3d, if there is a feature that only values 0/1, then the data is distributed on two square faces, and clustering is also likely to be carried out separately within each face. Therefore, if one-HOT coding is adopted, those 0/1 features will largely determine the clustering results, and the importance of other features becomes less. But sometimes such enumeration data is very important and can be used to describe instances well, so what we do is to cross analyze the clustering results and enumeration data after clustering. Look at the distribution of enumeration values within each class.

The long tail characteristics

When we conduct user behavior analysis, we will use data related to user behavior as features, such as login times, login duration and number of friends. This data tends to have a long tail problem. In addition, there is a linear relationship between some features. If the value of one feature is large, the value of other features is also large. For example, active users who log in long and have a large number of friends are likely to be large. At this time, the feature distribution is shown as follows. Extreme users affect kmeans’s characterization of ordinary users.

Review images

As you can see in the distribution below, 80% of the data is distributed in 1% of the space, while the remaining 20% is distributed in 99% of the space. In clustering, most of the data distributed in 1% space will be grouped into one category, and the rest will be grouped into one category. When the K value is constantly increased, the model generally subdivides the data in 99% of the space, because the spatial distance between these data is relatively large. However, the data distributed in 1% space is difficult to further subdivide, or even if subdivided, only a small amount of data is stripped out. The following figure shows the clustering result in a certain project. It can be seen that one type of users accounts for more than 90%, and with the increase of K, only a small part of the data in this type of users will be divided.

Review images

So in order to solve this problem, one possible way is to take the LOG of the feature to reduce the long tail problem. After these two methods are processed, players can be better classified. The figure below is the distribution of the data points in the figure above after taking LOG.

Review images

The classification result of k=4 after taking LOG of eigenvalue.

Review images

The disadvantage of the LOG method is that it makes the data unintuitive and hard to understand.

conclusion

Mr. Xie is single, you know.

Review images

Related Posts

Mongodb database installation and visualization tool ‘Robo 3T’ use

Nginx reverse proxy, load balancing and high availability cluster building

SpringBoot learning series (10) – Enable SpringBoot+JSP