Goal-directed visual dialogue is a relatively new task in the field of “vision-language” intersection, which requires the machine to accomplish specific vision-related goals through multiple rounds of dialogue. This task has both research significance and application value. Recently, Professor Wang Xiaojie and his team from Beijing University of Posts and Telecommunications cooperated with the TEAM of MEituan AI Platform NLP Center. The research paper “Answer-Driven Visual State Estimator for Goal-oriented Visual Dialogue-CommentCZ” was published by ACMMM, a leading conference in the international multimedia field 2020.

In this paper, they share their latest progress in goal-directed Visual dialogue, namely, they propose an Answer-driven Visual State Estimator (ADVSE) for the fusion of conversation history information and image information in Visual dialogue. Among them, the answer-driven Focusing Attention mechanism (ADFA) can effectively strengthen response Information, while Conditional Visual Information Fusion mechanism (Conditional Visual Information Fusion) can effectively strengthen response Information. CVIF) for adaptive selection of global and differential information. The estimator can be used not only to generate questions, but also to answer them. International open data set on visual dialogue GuessWhat? ! The experimental results show that the model has achieved the current leading level in question generation and answer.

background

A good visual dialogue model not only needs to understand the information from the two modes of visual scene and natural language dialogue, but also needs to follow a reasonable strategy to achieve the goal as quickly as possible. At the same time, the goal – oriented visual dialogue task has rich application scenarios. For example, intelligent assistants, interactive pickup robots, through natural language screening of large amounts of visual media information, etc.

Research status and analysis

In order to conduct goal-directed and visually-consistent conversations, AI agents should be able to learn visually-sensitive multimodal dialogue representations and dialogue strategies. Strub et al. [1] first proposed the use of reinforcement learning to explore dialogue strategies, and the subsequent work focused on reward design [2,3] or action selection [4,5]. However, most of them adopt a simple way to represent multimodal dialogues, encoding two modal information respectively, namely linguistic features encoded by RNN and visual features encoded by pre-trained CNN, and splicing them together.

Good multimodal dialogue representation is the cornerstone of strategy learning. In order to improve the representation of multimodal dialogues, various attention mechanisms have been proposed [6,7,8] to enhance multimodal interactions. Although much progress has been made, some important problems remain.

  1. In terms of language coding, the language coding methods of the existing methods cannot distinguish between different responses (Answer), which is usually encoded after Question. Since Answer is only a word of Yes or No, while Question contains a longer word string, The effect of Answer is very weak. But in fact, the Answer of the Answer largely determines the change direction of the following image focus area and the development direction of the dialogue. Yes and No answers will lead to completely different development directions. For example, when the answer to the first question “Is it a vase?” “Yes” means the questioner continues to focus on the vases and asks for features that can best distinguish multiple vases; When the answer to the third question, “Are parts red?” For “no”, the questioner no longer focuses on the red vase, but asks questions about the remaining candidates.
  2. The situation is similar in terms of vision and fusion, where existing visual coding methods either use static coding to remain constant throughout the conversation, splicing directly with dynamically changing language coding, or using QA-pair coding to guide attention mechanisms for visual content. Therefore, different answers cannot be effectively distinguished. And as I mentioned earlier, when the Answer Answer is not at the same time, will cause the image focus area have very different changes, in general, when the Answer is “yes”, the image will focus on the current object, further focus on its characteristic, when the Answer is “no”, may need to focus on image as a whole area again to find new possible candidate.

Response-driven visual state estimator

Therefore, this paper proposes a response-driven visual state estimator, as shown in Figure 2 below. The new framework includes response-driven attention updating (ADFA-ASU) and conditional fusion mechanism of visual information (CVIF) to solve the above two problems respectively.

In response driven attention updating, the threshold function is firstly used to polarize the attention guided by the current round of Question, and then the attention is reversed or maintained based on the different answers of the Question, so as to obtain the influence of the current question-answer on the dialogue state, and accumulate to the dialogue state. This approach effectively emphasizes the impact of Answer on the conversation state; CVIF fuses the overall information of the image and the difference information of the current candidate under the guidance of the current QA to obtain the estimated visual state.

Answer-driven Attention Updating (ADFA-ASU)

Conditional Fusion Mechanism of Visual Information (CVIF)

Response-driven visual state estimators are used for question generation and answer

ADVSE is a generic framework for goal-oriented visual dialogue. So we apply it to GuessWhat? ! In question generation (QGen) and answer (Guesser) modeling. We first combine ADVSE with classical hierarchical dialogue history encoder to obtain multimodal dialogue representation, and then combine multimodal dialogue representation with decoder to obtain problem generation model based on ADVSE. The multi-modal dialogue representation is combined with the classifier to obtain an ADVSE – based response model.

International open data set on visual dialogue GuessWhat? ! The experimental results show that the model has achieved the current leading level in question generation and answer. We first present the experimental results of comparing ADVSE-QGen and ADVSE-Guesser with the latest model.

In addition, we evaluated the performance of a combination of ADVSE-QGen and ADVSE-Guesser. Finally, we give the qualitative analysis of the model. The code for our model will soon be available from ADVse-GuessWhat.

conclusion

In this paper, we propose a response-driven visual state estimator (ADVSE) to emphasize the important influence of different responses on visual information in goal-directed visual dialogue. First, we capture the effect of response on visual attention through response driven focused attention (ADFA), where the retention or movement of problem-related visual attention is determined by the different responses of each turn.

In addition, in the Conditional Fusion mechanism for Visual Information (CVIF), we provide two types of visual information for different QA states and then fuse them as an estimate of the visual state, depending on the situation. Apply the proposed ADVSE to Guesswhat? ! Compared with the existing latest models of the two tasks, we can obtain higher accuracy and qualitative results. In the future, we will further explore the potential improvement of using both advSE-QGen and ADVSE-Guesser homologous.

reference

  • [1] FlorianStrub HarmdeVries, JeremieMary, BilalPiot, AaronC. Courville, and Olivier Pietquin. 2017. The End – to – End optimization of goal-driven and visually grounded dialogue systems. In Joint Conference on Artificial Intelligence.

  • [2] Pushkar Shukla, Carlos Elmadjian, Richika Sharan, Vivek Kulkarni, Matthew Turk, and William Yang Wang. 2019. What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for ComputationalLinguistics, Florence, Italy, 6442-6451. Doi.org/10.18653/v1…

  • [3] JunjieZhang,QiWu,ChunhuaShen,JianZhang,JianfengLu,andAntonvanden Hengel. 2018. Goal-Oriented Visual Question Generation via Intermediate Re- wards. In Proceedings of the European Conference on Computer Vision.

  • [4] Ehsan Abbasnejad, Qi Wu, Iman Abbasnejad, Javen Shi, and Anton van den Hengel. 2018. An Active Information Seeking Model for Goal-oriented Vision- and-Language Tasks. CoRR Abs / 1812.06398 (2018). ArXiv: 1812.06398 arxiv.org/abs/1812.06…

  • [5] EhsanAbbasnejad QiWu, JavenShi, andAntonvandenHengel. 2018. What ‘sto Know? Uncertainty as a Guide to Asking Goal-Oriented Questions. In Proceedings of the IEEE Conference on Computer Vision and The Pattern Recognition. 4150-4159.

  • [6] Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual Grounding via Accumulated Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7746 — 7755.

  • [7] Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making History Matter: History-Advantage Sequence Training for Visual Dialog. In Proceedings of the IEEE International Conference on Computer Vision. 2561–2569.

  • [8] BohanZhuang,QiWu,ChunhuaShen,IanD.Reid,andAntonvandenHengel. 2018. Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252 — 4261.

Author’s brief introduction

The authors of this article include Xu Zipeng, Feng Xiangxiang, Wang Xiaojie, Yang Yushu, Jiang Huixing, Wang Zhongyuan, etc. They are from the Team of Intelligent Science and Technology Center and Meituan Search and NLP Center, School of Artificial Intelligence, Beijing University of Posts and Telecommunications.

Recruitment information

Meituan search and NLP department, long-term recruitment of search, recommendation, NLP algorithm engineer, coordinate Beijing/Shanghai. Interested students are welcome to send their resume to: [email protected] (email: Search and NLP)

For more technical articles, please follow the official wechat account of Meituantech.