In the traditional Seq2Seq model, Decoder outputs a fixed number of targets, for example, the number of targets predicted by Decoder during translation is equal to the size of the dictionary. As a result, Seq2Seq cannot be used for some combinatorial optimization problems, such as convex hull problems, triangulation, traveling salesman problems (TSP), etc. The Pointer Network output dictionary size is equal to the length of the Encoder input sequence. Select one of the Encoder inputs based on the Attention value as the Decoder output.
1.Pointer Network
Seq2Seq model is a model containing Encoder and Decoder that can convert one sequence into another. However, the predicted output target size of Seq2Seq model is fixed. For some cases, the output target size will change, such as many combinatorial optimization problems.
The number of output targets of combinatorial optimization problem depends on the length of input sequence. For example, the traveling salesman problem contains 5 cities (1, 2, 3, 4, 5), and the number of targets is 5 when the output prediction is made. Pointer Network changes the way of traditional Attention, so it can be used for these combinational optimization problems. When predicting outputs, Pointer Network will get the probability of each city in the input sequence according to Attention (i.e., outputs are selected from the inputs).
1.1 the traditional Attention
Traditional Attention merges the Encoder’s output at each moment based on the Attention value, then mixes it with the Decoder’s current output to predict the output. As shown in the following formula, E represents the output of Encoder, D represents the output of Decoder, W and V are parameters that can be learned.
1.2 Attention of Pointer Network
Pointer Network does not fuse the Encoder output after calculating the Attention value, but takes Attention as the probability of output at each position in the input sequence P.
The difference between Pointer Network and Seq2Seq is shown in the figure below, which shows the convex hull problem. Whereas Seq2Seq’s Decoder predicts the output of each position (but the number of targets is fixed), Pointer Network’s Decoder directly calculates the probability of each position in the input sequence based on Attention. Take the most probable input position as the current output.
2. Experimental results
All at the bottom of the figure are the experimental results of Pointer Network. M is the number of training data midpoints, and N is the number of test data midpoints. Figure (a) shows Seq2Seq using LSTM. Seq2Seq training and testing must use the same number.
3. References
- Pointer Networks