First, the past life of text keyword extraction

The expression of the article has different levels, from coarse to fine can be divided into types (classification), events (topic), representative words (key words) and so on, keywords is an important link of the expression of the article. Keywords in a text can be considered to “condense” the main idea and meaning of the whole article, which is a highly generalized natural language expression form of an article. More accurate text keywords can bring more accurate text content features and recall higher quality articles of the same type for the subsequent recommendation system; At the same time, high-quality keywords can also be used as classified labels for content operation and user recommendation, improving the work efficiency of editors and operation colleagues.

Text keyword extraction has attracted a lot of researchers’ attention. From the simplest tF-IDF to the unsupervised methods such as TextRank and LDA to the widely used neural network models such as Seq2Seq, there are numerous practices and exploration in the field of keyword extraction.

Second, the general situation of game text keywords extraction

In the company’s internal esports and game center and other comprehensive game products are deposited a large number of game guide, novice guide, promotion guide and other different types of game text, how to put the right game text on the right keywords, and push the content to the appropriate users has become an important topic.

In the exploration of game text keyword extraction, we try the graph-based unsupervised TextRank method and the supervised Seq2Seq neural network method, and make a preliminary comparison of the performance of the two methods.

The supervised method based on neural network usually needs a certain amount of annotation data to learn better parameters. In order to meet the needs of neural network training, combined with the actual situation of the project and data, we collected about 30,000 game texts with classification and labels in the game center of mobile QQ platform, and finally obtained 24,000 data samples after similar text weight removal and low-quality text filtering. The text length of the corpus ranges from several hundred to more than one thousand words, and each sample contains 3-6 artificially annotated keywords, with a total of about 90,000 < text, key word > pairs. The corpus contains texts of novice guidance, promotion guide, game introduction and other different contents, and covers popular games such as King of Glory, Tiantian Dazzle, Zhanshen and Tiantian Run. More than 24000 texts were randomly divided into 20000 training sets, 2000 validation sets and 2000 test sets. In the pre-processing process, jieba tool is used for text word segmentation, and the dictionary related to game words is imported before word segmentation to improve the accuracy of high score words.

Three, two types of text keywords extraction model

1. Game text keyword extraction method based on TextRank

The idea of TextRank algorithm is directly borrowed from PageRank page sorting algorithm, using the adjacency relation of words in k-length window to represent the link pointing relation in PR algorithm, which is exactly the same as PageRank’s iterative formula, namely


. Among them,


Is the importance weight of I,


Is a collection of words that refer to that word,


Refers to the word set pointing to the word. The formula completes the calculation through iteration, and finally reversely sorts the importance weight of each word, namely, the importance ranking of words.

The Textrank-based method is simple and effective, and the speed is within acceptable limits. However, this approach has two obvious shortcomings:

1. The source of keywords is limited. It is only a collection of all the words in this document, so it is difficult to learn more keywords.

Nor can abstract keywords be expressed in the way of “generation”.

2. Although TextRank considers information such as co-occurrence of keywords in a given distance window, it still tends to give higher weight to high-frequency words, so it has no great advantage compared with tF-IDF methods in practical use.

TextRank is simple in thought and easy to implement. There are modules that can be called directly in various NLP toolkits, such as Python-based Jieba and Java-based Hanlp.

2. Background of Seq2Seq

Since the Seq2Seq model was proposed, it has been widely used in the fields of neural machine translation, image and text description generation and text summarization. Keywords extraction and text summarization are in the same line, and some scholars have made various attempts in this task by using various neural network models.

Seq2Seq model is often called Encoder Decoder model, in which two Sequence correspond to Encoder (Encoder) and Decoder (Decoder) respectively. Encoder and Decoder are usually composed of ordinary RNN units. Or LSTM units and GRU units. Seq2Seq based on CNN or other network structures is not discussed in this paper. A typical Seq2Seq model is shown in Figure 1 below.


Figure 1 Schematic diagram of Encoder-Decoder model

In Figure 1, the part in the lower box is the Encoder of Seq2Seq model,


Is the input of the source sequence, corresponding to the source text in keyword extraction; The upper frame part is the Decoder of Seq2Seq model,


Is the output of decoder, corresponding to the output of keywords. Taking keyword extraction as an example, the output vector Y of each time step will be connected with a Softmax to calculate the probability of each word in the word list (Due to the large word list in the corpus, computing each word usually has the problem of high computational complexity, and many scholars put forward improvement schemes to solve this problem. Such as https://arxiv.org/pdf/1412.2007.pdf).

As can be seen from the Encoder box in the figure above,


Is the hidden layer vector calculated by each time step


Output for the previous hidden layer


And the current time step input


The result of joint action. The input vector C accepted by the Decoder part (Figure 1) comes from the output vector of the Encoder part, which is generally the hidden layer output of the last time step of the Encoder


In some work, it may also be the combination of several hidden layer vectors or function transformation, namely


. The Decoder part of the hidden layer state vector calculation method is


That is, the hidden layer for each time step


, using the same c of the same length. Decoder partial output vector


The calculation of is also dependent on the same information from Encoder, as shown in Figure 1, which is still the same C.

Can be seen from the above discussion, generally speaking, in the text, machine translation and other work, the Decoder partly accept input information from the source text is only a fixed length of the vector c (even if the input text is very long), by a fixed length of vector decode for the translation or the keywords based information, often leads to greater performance loss. The attention mechanism proposed by Bahdanau in 2014 can better solve this problem, and in the following two or three years, it has been widely used in translation, abstract, keyword extraction, image text generation and even emotion analysis.

The Seq2Seq model based on attention mechanism is similar to the traditional model in Encoder coding, and attention mechanism mainly focuses on Decoder, as shown in Figure 2. The lower part of the figure is the Encoder part of Seq2Seq. Of course, compared with Figure 1, the Encoder here has a bidirectional RNN part. The reason for this is that the network can learn from both directions, not attention.


FIG. 2 Seq2Seq model with attention mechanism

Similar to the normal Seq2Seq model, the model in Figure 2 also reads C from Encoder in the Decoder stage, but this C is not simply converted from the value of the last hidden layer node of Encoder. Instead, it can be obtained by reading the hidden layer node status of each Encoder time step and weighted sum. As shown in the following formula.


Combining the formula and Figure 2, it can be clearly seen that


T time corresponding to Decoder, from Encoder Encoder


The weights of the inputs at time, summed by linear weights, are obtained


, that is the part of Decoder that accepts Encoder vector at t moment. Figure 2 shows the hidden layer state vector of the Decoder section at this time


Is calculated as


.

And in the formula


, obtained by training corpus, and through a normalized formula will be decoder t time from different


The weight of


So that adds up to 1.

Seq2seq-attention model based on Copying mechanism

The attention-based model has been widely used in machine translation, intelligent question answering, text summarization and other tasks. However, for text summarization, keyword extraction and intelligent question and answer tasks, the design of Decoder is still difficult to avoid some shortcomings.

Take keyword extraction as an example, each time step of Decoder part will generate a word, which is calculated by Softmax, and the source of the word is a word list of size N


. From the perspective of calculating time cost, this vocabulary usually does not contain all the words in the training set, and a large number of low-frequency words in the training set can only be replaced by UNK. On the other hand, the words in the test set may not have occurred entirely in the training set, and certainly may not be within the predictable word list V. This leads to a problem, which is often referred to as OOV (out of Vocabulary) problem in text summarization and keyword extraction. Because of this phenomenon of OOV, when the new test corpus contains some important off-thesaurus (OOV) terms, these important OOV terms can only be predicted as UNK, whether extracting keywords or generating abstracts.

After discussing the limitations of the OOV problem on the Seq2Seq model, let’s return to the Copying mechanism. If a human is asked to extract a document summary or key words, he will not only use his own background knowledge and what he has learned in the past to write the summary, but also “copy” or “extract” some important words from the original text. We compare the summarization (keyword extraction) with the machine. The traditional model usually only uses the parameters learned from the training corpus to predict which word in the word list should be selected at a certain location during the generation. If several important keywords in the original text are not in the word list, then unfortunately, these words will not be generated in the keyword list. Jiatao Gu et al. borrowed the forms of “Copying” and “Copying” from human beings and introduced a Copying mechanism in the Seq2Seq with attention model, which greatly improved the influence of OOV on keyword extraction and summarization. This network model is also a traditional form in Encoder and has no change. However, in the Decoder Decoder section, a lot of calculations have been added about Copying mechanisms, as shown in Figure 3.


Figure 3. Seq2Seq model with Copying mechanism

As can be seen from Figure 3, the Decoder part of the model hides the layer state vector


Update and


Are significantly different from previous models. As shown in the lower right of Figure 3, the State Update process can be clearly divided into several parts: The first part is the solid line of the blue arrow on the left, which is information from Attentive


; The second part is


Is the same as the previous attention model; But the third part


It is not only the embedding produced by the last time step, but also splices with a selective Read vector. This vector is actually calculated in a similar way to the attention mechanism, which calculates the position related information of the decoder and the hidden layer state of each time step of the encoder. Provides information for the subsequent copy-mode.

Generate-mode and copy-mode in the upper right of Figure 3 are two probability calculation formulas for generating keyword text. The generate-mode on the left is the same as that of the classical model. Copy-mode calculates the probability of generating each word from the source sequence, and the sum of the two probabilities is


The probability of the target word of the position is shown in the formula.


Note that g and C in the formula represent two generation modes respectively. And g and C are calculated in the same way as the generated words


It is related to the relationship between two thesaurus V and X (source text thesaurus), which is clearly described in Figure2 in the original text and will not be repeated here.

Fourthly, the significance of Copying mechanism model in keywords extraction of game text

It can be seen from the above description that if one of the important keywords in a text is not one of the high-frequency words in the whole training set or in Vocabulary, then it is possible to solve this problem better by Copying directly from the original text in a situation where traditional attention-seq2seq is unable to do so. At the same time, as a classical generation model, compared with Textrank and other methods, it adds a lot of knowledge learned from thousands of training corpus, which can balance the advantages of “extraction” and “generation” models.

Let’s close with a simple example.

In League of Kings, the 5V5 match is popular among many players. Players can match by themselves or invite friends to open the game together. In 5V5, it is not individual show operation that can control the whole game, but more tacit cooperation between teams. Reasonable lineup collocation, clear division of labor on the battlefield, players of the hero’s operation skills…… (excerpts)

GroundTruth: 5V5 King Valley; League of Kings; A novice

Seq2Seq with Copying Mechanism (Top5)

TextRank (Top5)

1. King Academy

hero

2. The League of Kings

2. The enemy

3. 5V5 King’s Canyon

3. To kill

4. A novice

4. His teammates

5. 5V5 League of Kings

Development of 5.

  1. 1. Kings college
  2. hero
  3. 2. King union

2. The enemy

  1. 3. 5V5 King’s Canyon

3. To kill

  1. 4. A novice

4. His teammates

  1. 5. 5v5 league of Kings

Development of 5.

According to the test set, the Seq2Seq method based on DNN is 4% higher than TextRank P1@5, 8% higher than R1@5 and 5.4% higher than F1@5, which preliminatively proves the effectiveness of this method.

Next, the performance of keyword extraction needs to be further improved from data and model levels, and the robustness of results can also be increased from the perspective of ensemble of multi-model results.