This is the 26th day of my participation in the August More Text Challenge
In this article, I will introduce the skills in the Teacher Forcing training process
Take Seq2Seq as an example, in the training process, t0T_0T0 Decoder input is ”
“, the output may not be the correct result “the”, for example, output a wrong result “like”. So the question is, at t1t_1T1, should we continue with the correct word “the” as input, or should we use the output “like” from the previous t0T_0t0 as input?
In fact, the above question involves two completely different training methods
- Whatever the output was at the last moment, the input at the current moment is always specified, at the given target, all right
- The input at the current moment is related to the output at the previous moment. Specifically, the input at the current moment is the output at the previous moment
If we want to use a less rigorous analogy, the first kind of training is equivalent to a student sitting next to Xiao Ming when he is studying. When he finds Xiao Ming doing a sequence generation problem, he peeks at the correct answer of each step. So Xiaoming of course only needs to follow the idea of the answer of the previous step, and calculate the result of this step. This approach, compared with their own every step of guessing, of course, can effectively avoid further error amplification, at the same time in the early learning can also be through the guidance of students with excellent grades in this way to quickly learn a lot of knowledge.
But the first type of training has the following problems:
- All characters generated during decoding are subject to the constraints of ground-truth, and it is hoped that the results generated by the model must correspond to the reference sentences one by one. This constraint reduces model divergence and speeds up convergence during training. But on the one hand, it also kills the possibility of translation diversity
- This constraint can also lead to a problem called Overcorrect. Such as:
- Reference: “We should comply with the rule.”
- The model predicted midway through the decoding phase: “We should abide”
- However, as per regulations, the third ground-truth “comply” should be the input for step four. So the model might predict “with” at step four, based on the pattern of previous learning.
- The final generation of the model became “We should abide with”.
- In fact, the usage of “abide with” is incorrect, but due to the interference of ground-truth “Comply”, the model is in a state of overcorrection, resulting in incoherent statements
If you use the second method, as soon as one of the predictions is wrong, the later predictions will be more and more biased and difficult to converge
The Teacher Forcing lies in between the above two training methods. Specifically, at each moment in the training process, there is a certain probability that the output of the previous moment is used as the input, and there is also a certain probability that the correct target is used as the input
Refer to the pseudocode below
teacher_forcing_ratio = 0.5
teacher_forcing = random.random() < teacher_forcing_ratio
if teacher_forcing:
pass
else:
pass
Copy the code
Reference
- Forcing the Teacher and exposing Bias
- What is Teacher Forcing?