Key words: NLG, The Decoding Strategy
1. Background and problem description
This paper mainly discusses the decoding strategy of text generation using language model, fully compares the two coding strategies of probability maximization coding and random coding, and analyzes the problems of previous methods. At the same time, Nucleus Sampling is proposed, which is proved to be a very simple and effective coding strategy after experimental analysis.
2. Existing solutions
Greedy Decoding
A very straightforward decoding strategy that selects argmax words at each step and then feeds the generated words to the next moment. This kind of chain decoding strategy, will occur error accumulation, often very poor effect.
1.Beam Search
At each step of decoding, trace TopK score sub-sequence (Hypotheses). When the decoding is complete (encounter EOS or reach maximum length), the sequence with the highest score is selected as the result. Take k=2 as an example:
2.Pure sampling
At each step of the decoding, a word is chosen at random from the dictionary.
3.Top-k Sampling
At each step of the decoding, a word is randomly selected from top-K according to the probability distribution.
If K =1 is greedy decoding, and k=n is Pure sampling. A larger K can increase diversity, but at the same time, it may lead to more errors. The k reduction makes the generation more generic, but at the same time safer.
- Softmax + Temperature
At each step of the coding, the language model uses Softmax to calculate a probability distribution vector for all tokens, where Softmax can be optimized to add temperatureEnhance the expression of Softmax:
Softmax with Temperature would look something like this.So when you set
, you can tilt the whole probability towards the higher probability, thus reducing the probability of tail selection.
Although the use of
3. Solution Overview
3.1 OPEN-ENDED VS DIRECTED GENERATION
The author first summarizes two types of text generation tasks. Open-ended generation, is a more liberal text generation task where the input only limits the range of the output. Examples are conditional Story generation and text continuation.
Directed Generation is the common, typical encoder-decoder mode, with data defined as input, outPU pairs, such as machine translation, text digest, etc. Beam Search is typically used for decoding strategies for such problems. Because the output is based on the input, it is easy to produce a repetition, * genericness * result. Even if the size of beam search is increased, such problems cannot be solved well.
3.2 Decoding language model
The author gives the model definition of open-ended generation. Given m length sequence x1… xmx_1… x_mx1… Xm is the input context, according to the input context, a typical one-way language model is used to generate an extended sequence of length n: P(x1:m+ N)=∏ I =1m+nP(xi∣x1… Xi – 1) P (x_ + n} {1: m) = \ prod ^ {m + n} _ {I = 1} {P (x_i | x_1… X_} {1} I -) P = ∏ (x1: m + n) I = 1 m + nP (xi ∣ x1… Xi – 1)
The above model is then used to generate words one by one using a specific decoding strategy.
3.3 Nucleus Sampling
This is a random decoding strategy. The core idea is to dynamically determine the lexical space of the sample according to the probability distribution, instead of only taking the first K intervals each time. The probability distribution of a given moment I: P (x ∣ x1: I – 1) P (x | x_ {1: I – 1}) P (x ∣ x1: I – 1), it is a moment I selection probability of each word, after softmax normalized probability is calculated. The sampling space V(p)V^{(p)}V(p) is taken as top−ptop-ptop− P: ∑ x ∈ V (p) p (x ∣ x1: I – 1) > p \ sum_ \ {x in V ^ {{(p)}}} {p (x | x_ {1: I – 1}) > p} ∑ x ∈ V (p) p (x ∣ x1: I – 1) > p
On the other
, and then re-normalize the sampling probability:
In fact, this strategy guarantees the selection of words with the highest probability, whose cumulative probability density is greater than the preset threshold p. The sampling space size of each step is dynamically adjusted according to the probability distribution. When a large p value is used, it is possible to ensure that the sampling space at each step is a Nucleus word.
Visualization:
3.4 Problems of each decoding strategy
- Maximization-based decoding
The authors find that it is not necessarily reasonable to choose the maximum conditional probability for each step!
In the figure above, the author compares the probability distribution of the selected words calculated by the language model, given the same quotation, using beam search and people. It can be observed that human beings do not always choose the words with maximum conditional probability. However, always selecting the maximum conditional probability will produce positive feedback and thus fall into repetition, so the beam Search algorithm is prone to repetition.
- Top-k sampling
top-k samplingIs to useFixed intervalTo cut off the probability distribution that he proposed in this paperNucleus SamplingThe only difference is how you pick the probability of top.top-k samplingThe big question is, how much k is appropriate?As can be seen from the figure above, a smaller value of k is not appropriate for the case where the probability distribution on the left side is fairly even. But on the right-hand side, where the probability distribution is fairly clear, k doesn’t work either. Generally speaking, a small value of k tends to generate useless common words; If k is larger, it is easier to generate wrong words because the probability of top-k needs to be normalized again.
4. Result analysis
Explain the metrics:
- Perplexity
The degree of confusion is used to evaluate the quality of language models. The results of different decoding strategies and artificial results are calculated separately, and the results of a good decoding strategy should be close to the confusion of artificial results.
5. Innovation or contribution
- A very simple and effective decoding strategy is proposed.
- The disadvantages and causes of various decoding strategies are analyzed.
6. Personal thinking
- It’s easy to come up with ideas for learning articles in experimental Settings.
- The analysis of several other decoding strategy problems is very clear, and the root causes are clear.
Let’s look at the Transformers code