This is the 11th day of my participation in the First Challenge 2022

This paper was uploaded to arXiv on August 25, 2021 and has been accepted as an Oral paper by ACM MM 2021. The first author is from Renmin University of China. Product-oriented Machine Translation with cross-modal cross-lingual pre-training

UPOC2: Model Pre-training across Modes and Languages in the Field of Fashion (PART 1) – Diggin (juejin. Cn)

Pre-training Tasks

Three pre-training tasks are designed in this paper:

Multimodal Translation Language Modeling (MTLM)

Based on the combination of translation language modeling (TLM) and masking language modeling (PMT), 15% of the word markers in the two languages were randomly selected for prediction, 80% of which were replaced with [MASK], 10% were replaced with another random word (including foreign words), and 10% remained unchanged. The predictions were then made using image context information, words around the language, and all the words in the other language.

However, the author found that the visual aid to translation was weak in the training task, so the training task could not make good use of the visual modal information. So the author also designed other and training tasks.

Image-source Sentence Matching (ISM)

This is a common task in vision-and-language pretraining models for learning semantic alignment between visual and textual modes. The author predicts whether the image matches the source sentence. It is worth noting that, in order to avoid the ISM task being too simple, when constructing negative samples, sentences are extracted from similar products to be replaced, so that the model pays more attention to product details rather than product categories.

Product Attribute Prediction (ATTP)

Pre-training was performed on FACAD data sets. The input is source sentence and image, and the words representing product attributes in the source sentence are shielded, forcing the model to rely on images to predict attributes.

Fine-tuning

The image and the context of the source sentence are used to generate the target sentence, and the sub-attention mask of the target sentence is limited, so that the bidirectional pretraining model can be adapted like a monomial generator. Similar to MTLM, 15% word markers in the target language are randomly selected for prediction, 80% of which are replaced with [MASK], 10% are replaced with another random word (including foreign words), 10% remain unchanged.

Experiments

Pretraining task ablation experiment:

The first line is a fine-tuning task without pre-training;

From the second to the fourth lines, the pre-training task was gradually added, indicating that the model performance was continuously improved.

In line 5, the fashion-MMT (L) data set (turning noise) was added in the pre-training stage and fine-tuned on the fashion-MMT (C), which was significantly improved compared with 4. This suggests that machine-generated noise data also contributes to model performance.

Number of encoder layers:

Comparison with baseline:

The author selected the most advanced pure NMT text model based on Transformer and the multimodal graph (the most advanced MMT model, [paper notes] based on the graph multimodal fusion encoder: When GNN meets multimodal machine translation-mining (juejin. Cn), the experimental results are as follows: UPOC2 model is superior to the most advanced MMT model in both data sets.

Results on the Multi30k dataset show that the UPOC2 model achieves optimal performance even on traditional MMT tasks.

The Lower – resource:

In addition, the author also explored the impact of fine-tuning with only part of the data set on model performance, as shown in the following table:

It can be seen that the performance of UPOC2 fine-tuned with only 15000 triples is equivalent to that of the most advanced MMT model fine-tuned with all 36000 triples, indicating that the model in this paper can effectively reduce the dependence on a large amount of data.

Qualitative analysis:

The author also selected several examples for qualitative analysis. The following figure shows the translation results of the model in this paper, the most advanced NMT and MMT models and ground truth, as well as the attentional visualization results of source sentences and images in the translation process.

Summary

This paper built a large-scale bilingual product description data set, this is the largest multimodal data set for machine translation in the field of fashion, also designed a unified UPOC2 preliminary training and fine-tuning framework, as well as the three tasks, well for the source and target sentences semantic alignment, the semantic alignment between the images and text, The results are significantly better than the most advanced NMT and MMT. In addition, in the ablation experiment and the low-resource experiment, the authors also proved that the model can benefit from large-scale noise data and reduce the dependence of the model on a large number of manual labeling data.