This is my 35th day of participating in the First Challenge 2022

This article is the best of naACL-HLT 2019. The authors are from Le Mans University and Imperial College London. Probing the Need for Visual Context in Multimodal Machine Translation

Background

As a shared task of WMT16, MMT has received a lot of work attention, common can be divided into three categories:

  • Multimodal attention using convolution features;
  • Cross-modal interactions using global features;
  • Using target – based detection network for regional features.

However, many researchers found that the contribution of visual features was not obvious. For example, Gr¨ Onroos et al. believed that the multimodal gain in their work was not large, and the greatest improvement should come from the external parallel text library. Lala et al. found that their multimodal language sense disambiguation method was not much different from that of single mode; Elliott’s MMT with unrelated image features also worked well. WMT18’s organisers concluded that the MMT scheme had so far yielded only limited gains.

Motivation

Some argue that the limited role of visual modes is due to the quality of the image features or the way the image is integrated. The author argues that no, the effect is limited because the source text in the Multi30K dataset is inherently simple and the plain text can be translated well. This paper aims to explore the contribution of visual information in MMT and refute the above views.

Method

The authors tested the conjecture in two ways:

  1. Several degradation mechanisms are introduced to deal with source sentences and the MMT performance is reevaluated.
  2. Irrelevant images were used to further explore the sensitivity of the model to visual information.

Model

For the plain text Baseline, the author uses the attention model proposed by Bahdanau et al., 2014;

For MMT model, the author chooses DIRECT and HIER. The former is based on multimodal attention, which splices the text and visual context and then performs linear projection. The latter is a layered extension of the former, with an extra layer of attention instead of splicing. In addition, the authors tested INIT, initializing both the encoder and decoder with visual features.

A new approach to multimodal machine translation – Juejin (HIER) may also be introduced later, and the link will be added later


So what’s the result? Next issue: ‘◡’●)~

Multimodal translation is just like that. Is there any use for visual information? (B) – Nuggets (juejin. Cn)