This is the 15th day of my participation in the First Challenge 2022
This article was published in ACL in 2017. Lead author Calixto is from the ADAPT Centre at Dublin City University and is currently an assistant professor at the University of Amsterdam. Incorporating Global Visual Features into attention-based Neural Machine Translation
Preface
The background of this article is that after ACL-WMT 2016 set multi-mode machine translation as a shared task for the first time in early 2017, most of the tasks could not use visual features to improve the translation effect. On the other hand, the famous Transformer had not yet been proposed. At that time, the attention mechanism was still an emerging research hotspot and was being applied more and more in various fields, such as image description generation and machine translation.
The model with the best effect at that time was the global/regional image features extracted by VGG19 and RCNN proposed by Huang et al.
【 thesis Notes 】Attention-based Multimodal Neural Machine Translation – Excavation (Juejin. Cn)
The author uploaded this article to arXiv in January 2017, updating SOTA at the time, and reported the improved model just a month later, further improving the translation.
Damage to the key is guaranteed to be nothing. Key -Attentive Decoder: Multi-modal attention classic.
Motivation
As a new direction, the author’s motivation is simple: to make better use of visual information to assist machine translation and make the translation effect as good as possible. Many attempts have been made in the past, but the results were not ideal. Typical examples are the work of the author in 2016, as well as the work of Huang and others mentioned above. At the same time, attention mechanism also began to shine in IDG, NMT and other fields, the author hopes to build a multimodal machine translation model based on attention mechanism.
Method
The author uses VGG19 to extract visual features of images and designs three different methods to integrate images into attentional NMT:
- Image feature as word: Image feature as word in bidirectional RNN text sequence, projected image as first and/or last word of source sentence. This method is somewhat similar to the work of Huang et al., except that the author adds image features at the beginning and end of the sequence. The author’s intuition is that when the image feature is taken as the first word, the image feature is propagated into the representation of the source sentence in the forward RNN, and similarly, when the image feature is taken as the second word, the image feature is propagated into the representation of the source sentence in the backward RNN.
- The encoder is initialized by image features: the encoder’s hidden state is initialized by image, and the original attention NMT initializes the encoder’s hidden state by zero vector. The author initializes the encoder’s hidden state by two single-layer feedforward networks, and uses two projection matrices to project the image features into the same dimension of the hidden state as the initial hidden state.
- Image feature initialization decoder: Initializes the decoder hidden state with the image as additional information, ditto.
- In addition, the authors also evaluate a fourth approach: using images as different available contexts at each time step of the decoding.
Experiment
In the english-German translation task, model 2 and Model 3 both improved the baseline of pure text NMT and PBSMT, which was the first time that the model with pure neural network significantly outperformed PBSMT in all indicators. In the German-English translation task, all multimodal models performed better than the plain text baseline except model 1, whose BLEU score was lower than the plain text NMT. However, the differences among these multimodal models were not significant.
Several integrated decoding methods are also designed to integrate the three models mentioned above into one system. The models were trained individually, starting with the best-performing model (Model 3) and gradually adding in the less-performing models. The results show that the performance of the integrated decoding will continue to improve even if the models with worse performance are added, and the performance reaches the best when the maximum of four models are added.
I also do not know how to integrate, the author did not describe the specific experimental details in detail in the paper.
Error Analysis
The author also analyzed specific examples and showed the translation effect of these models with two examples. However, the author did not give a convincing explanation, and even the description of the phenomenon made me confused, so I will not repeat it again.
Summary
We design three new attention-based multimodal machine translation models, two of which innovatively take image features as the initial state of the encoder or decoder, and these models achieve SOTA results. Especially when these models are integrated, the effect can be further improved.
The authors also found that multimodal NMT models can benefit from reverse-translated data, which is also the direction of further research.