Last year, the team wanted to use the company’s intelligent robot to build intelligent customer service. The first step was to collect and sort online user feedback data, classify the data and use it as training data for algorithm model. Therefore, we also had the opportunity to simply practice clustering algorithm in business.

K-means is one of the most commonly used algorithms in clustering problems, and its principle is relatively simple, so it has become the first method we try.

K-means

The core idea of K-means is “birds of a feather flock together and people flock together”. The algorithm itself divides N multidimensional observation points into K sets more reasonably through continuous iterative calculation.

The algorithm is roughly described as follows:

1. Initialize the calculation model and determine the K value, indicating that the multidimensional observation points are divided into K sets 2. Select the initial centroid for K sets. 3. Calculate the distance between each observation point and K centroids and divide the observation points into the nearest set. 4. Select new centroids for K sets to obtain better clustering effect 5. Repeat Steps 3 to 5 until the particles of the set tend to be stable or the maximum number of iterations is reachedCopy the code

In conjunction with business requirements, there are two characteristics to focus on in the process:

  • K value. This value determines how the final feedback problems can be divided into several categories. However, in business scenarios, K cannot be clearly inferred, so continuous testing is required to find the best value

  • Multidimensional view points. Our raw data is a user feedback problem, which is essentially text content, and we need to find the right way to transform text into multidimensional vectors

Textual vectorization

As we mentioned, in order to use K-means to cluster feedback problems, we need to find a method to transform the text into multi-dimensional vectors. This process is text vectorization, which can extract features in the text and describe the text content in the way of multi-dimensional vectors. There are many methods for vectorization of text features, and we use tF-IDF, which is relatively simple.

Term Frequency — Inverse Document Frequency (TF-IDF), where TF is the word Frequency, describing the evaluation rate of terms in the Document; IDF is the inverse text frequency index, which describes the distinguishing ability and importance of a word. The less the word appears in different documents, the greater the IDF value and the higher the importance.

(Word2vec was also tried later, and the final clustering result was better)

Clustering problem

Based on K-means and TF-IDF, we cluster the feedback problems. The code is relatively simple, the general logic is as follows:

Xls_question = [] for each in xls_result: xls_question.append(each.question) count_vectorizer = CountVectorizer(tokenizer=cut_without_stop_word) doc_features = count_vectorizer.fit_transform(xls_question) # tf-idf tfidf_transformer = TfidfTransformer() word_tfidf = tfidf_transformer.fit_transform(doc_features) # k-means cluster_count = int(sys.argv[1]) max_iteration = int(sys.argv[2]) n_init = int(sys.argv[3]) km_cluster = KMeans(n_clusters=cluster_count, max_iter=max_iteration, n_init=n_init) cluster_result = km_cluster.fit_predict(word_tfidf)Copy the code

Through constant adjustment of K value and continuous optimization of training data set, the final results are as follows:

Above and below the red box are two different categories, and the problems within the categories are similar.

In the end, most of the problems within the categories are related, but there are problems:

(1) There is no significant correlation between some classification problems, which may be caused by the large K value or the defects of the word cutting algorithm and text feature vectorization algorithm.

(2) Some internal problems of classification are highly similar, but some irrelevant problems are also mixed. The reason is that irrelevant problems contain some keywords, which are highly consistent with the current classification, so they are included.

Classification of theme

Having found the classification, there is still a question of how to get the subject of the classification.

The idea is simple: find the key words for each question in this category and see which one has more.

Then the problem is how to find keywords. My method is still tF-IDF. Based on the current classification, I find the 3 ~ 5 words with the highest TF-IDF in each question, and the results are as follows:

Basically, the key information in the classification can be summarized. The classification topic in the screenshot is “Retrieve password”.

At the same time, we can also find some problems from the figure, that is, “how” appears in the topic keywords, which may also be the words with high TF-IDF value. There are corresponding solutions to this problem:

(1) Common but invalid words are filtered out through no word filtering to avoid such common words affecting the overall result due to TF-IDF value.

(2) Conduct k-means again, cluster the found keywords again, divide them into two classes only, and find the category with the highest score as the topic keyword

Why do you want to do k-means again?

I observed the scores of keywords found in the classification, and found that from the score value, one group was high score and one group was low score, namely, the high-correlation vocabulary group and the low-correlation vocabulary group. Therefore, I want to do another grouping to find the high-score vocabulary.

Existing problems

Based on the above method, we have completed the clustering and topic analysis of feedback problems, but from the final results, there are still many problems:

  • Data cleaning

In the whole process, the most time-consuming is data cleaning, which requires filtering invalid data in feedback questions, including special characters and expressions. At the same time, for security and compliance reasons, people’s names, phone numbers, geographic locations, etc., need to be removed from the feedback questions.

  • Cut the word

Before using TF-IDF, it is necessary to perform word segmentation for feedback questions. I use the open source word segmentation jieba, which can achieve basic word segmentation, but the word segmentation results cannot be directly used.

First, the word segmentation results contain stop words, which have no meaning and need to be filtered out to avoid the impact of stop words on the analysis. Secondly, each product has its own keywords. For example, for Tencent, QQ, Yuewen, Weishi and Tencent Cloud are keywords.

  • K value

In the course of training, K value is also repeatedly tested and adjusted for many rounds. If the value of K is too large, problems with the same meaning may be grouped together. If K value is too small, there may be more irrelevant problems in the classification.