Abstract: With the continuous development of artificial intelligence technology, knowledge graph, as a knowledge pillar in the field of artificial intelligence, has been widely concerned by the academic and industrial circles for its powerful knowledge representation and reasoning ability. In recent years, knowledge graph has been widely used in semantic search, question answering, knowledge management and other fields.
The author | | ali Zhu Yushan source technology public number
A background
1. Multimodal knowledge map
With the development of artificial intelligence technology, knowledge graph, as a knowledge pillar in the field of artificial intelligence, has been widely concerned by the academic and industrial circles for its powerful knowledge representation and reasoning ability. In recent years, knowledge graph has been widely used in semantic search, question answering, knowledge management and other fields. Multimode and knowledge map is the main difference between the traditional knowledge map, traditional knowledge map is mainly focused on the text and the database entities and relationships, and multimodal knowledge map is on the basis of the traditional knowledge map, built a variety of modal entity (such as visual mode), as well as a variety of modal multi modal semantic relationship between entities. Current typical multimodal knowledge graphs are DBpedia, Wikidata, IMGpedia and MMKG.
Multimodal knowledge mapping (MKGS) has been widely used in many fields, such as natural language processing and computer vision. Although multimodal structural data is heterogeneous in the representation at the bottom level, different modal data of the same entity are unified in the semantics at the top level. Therefore, the fusion of multimodal data provides data support for the construction of unified language representation model under multiple modes at the semantic level. Second multimodal knowledge map technology can serve all kinds of downstream areas, such as multimodal entity links technology can merge multiple mode under the same entity, can be applied to news reading, in scenarios such as product identification, multimodal knowledge map completion technology and the remote monitoring can completion mode of knowledge map, perfecting the existing multimodal knowledge map, Multi-mode dialogue system can be used in e-commerce recommendation, commodity question and answer fields.
2. Multimodal pre-training
The successful application of pretraining techniques in the field of computer vision (CV) such as VGG, Google Inception and ResNet, as well as natural language processing (NLP) such as BERT, XLNet and GPT-3, has inspired more and more researchers to focus on multimodal pretraining. In essence, multimodal pretraining is expected to learn the relationship between two or more modes. Academia multimodal pre training scheme based on the Transformer module, more focused on graphic tasks in application, solution mostly the same, the main difference is that the model structure combined with the differences of training mission, multimodal downstream tasks in the process of the training can be a regular classification recognition, visual questions, visual understanding inference task and so on. VideoBERT is the first work of multimodal pretraining, which trains a large number of unlabeled video text pairs based on BERT. At present, multi-modal pretraining models for images and texts can be divided into two types: single-flow model and dual-flow model. VideoBERT, B2T2, VisualBERT, Unicoder-VL, VL-Bert, and UNITER use a single-stream architecture that models both image and text information using the self-attention mechanism of a single Transformer. On the other hand, LXMERT, ViLBERT and FashionBERT introduced a dual-flow architecture, first extracting the features of the image and text independently, and then using a more complex cross-attention mechanism to complete their interaction. To further improve performance, the VLP uses a shared multilayer Transformer for encoding and decoding for image captions and VQA. Based on the single-flow architecture, InterBERT adds two separate Transformer streams to the output of the single-flow model to capture modal independence.
3. Pre-training for knowledge enhancement
In recent years, more and more researchers began to pay attention to the combination of knowledge graph (KG) and pretrained language model (PLM) to achieve better performance of PLM. K-bert injects triples into sentences to produce a unified, knowledgeable linguistic representation. ERNIE integrates entity representation from the knowledge module into the semantic module, presenting heterogeneous information about tokens and entities into a unified feature space. KEPLER encoded text descriptions of entities as text embeddings, and treated description embeddings as entity embeddings. KnowBERT uses an integrated entity linker to generate knowledge-enhanced entity breadth representations through a form of word-to-entity attention. The KAdapter injects factual knowledge and language knowledge into RoBERTa, and provides neural adapters for each of these injects. DKPLM can dynamically select and embed knowledge based on text context, sensing both global and local KG information. JAKET proposes a joint pre-training framework that includes the generation of embedded knowledge modules for entities to generate context-aware embedding in graphs. KALM, ProQA, LIBERT et al. also explored the fusion experiment of knowledge graph and PLM in different application tasks. However, the current knowledge enhancement pretraining model is only for single mode, especially for text mode, and there is little work to integrate knowledge graph into multi-mode pretraining.
Multimodal commodity knowledge map and problems
With the development of artificial intelligence technology, knowledge graph, as a knowledge pillar in the field of artificial intelligence, has been widely concerned by the academic and industrial circles for its powerful knowledge representation and reasoning ability. Multimode and knowledge map is the main difference between the traditional knowledge map, traditional knowledge map is mainly focused on the text and the database entities and relationships, and multimodal knowledge map is on the basis of the traditional knowledge map, built a variety of modal entity (such as visual mode), as well as a variety of modal multi modal semantic relationship between entities. As shown in Figure 1, in the field of e-commerce, the multi-modal commodity knowledge graph usually includes image, title and structure knowledge.
The application scenarios of multi-modal commodity knowledge graph are very wide. Although multi-modal structural data is heterogeneous in the representation at the bottom, different modal data of the same entity are unified in the semantics at the top, so the fusion of multi-modal data is conducive to fully expressing commodity information. Multimodal commodity knowledge map technology can serve all kinds of downstream areas, such as multimodal entity links technology can merge multiple mode under the same entity, can be widely used in the product alignment, stars such as a particular scenario, the multimodal question answering system for electrical business recommendation, goods progress has a significant promoting role in the field of questions and answers. However, there is still a considerable lack of effective technical means to effectively fuse these multi-modal data to support a wide range of e-commerce downstream applications.
Figure 1
In recent years, some multi-modal pretraining techniques (such as VLBERT, ViLBERT, LXMERT, InterBERT, etc.) have been proposed, which are mainly used to mine the association between image modal and text modal information. However, the more modal pre training method directly applied to the electronic commerce will create problems in the scene, on the one hand, these models can’t modeling multimodal commodity structured information of knowledge map, on the other hand, in electricity multimodal knowledge map, modal loss and modal noise is two challenges (mainly the lack of text and images and noise). This will seriously degrade the performance of multimodal information learning. In a real e-commerce scenario, some sellers did not upload the picture (or title) of the product to the platform, and some sellers provided the picture (or title) of the product did not have the correct theme or semantics. Item-2 and Item-3 in Figure 2 show examples of modal noise and modal loss in Ali scenarios, respectively.
Figure 2
3 Solutions
In order to solve this problem, we regard product structured knowledge as a new mode independent of image and text, called knowledge mode, that is, for pre-training of product data, we consider three modes of information: image mode (product image), text mode (product title) and knowledge mode (PKG). As shown in Figure 2, the PKG contains triples of the form <h, r, t>. For example, < item-1, Material, Cotton> indicates that Item 1 is made of Cotton. The reason for this approach is that (1) THE PKG describes the objective characteristics of the product, it is structured and easy to manage, and usually does a lot of maintenance and standardization for the PKG, so the PKG is relatively clean and reliable. (2) Information contained in PKG and other modes overlapped and complemented each other. Taking Item-1 in Figure 2 as an example, it can be seen from the picture, title and PKG that Item-1 is a long-sleeve T-shirt; PKG, on the other hand, says the T-shirt is suitable for spring as well as fall, but you can’t tell from the picture and caption. Thus, when modal noise or missing modes are present, PKG can correct or complement other modes.
Four Model Architecture
We propose a novel multimodal knowledge awareness pretraining method K3M for e-business applications. The model architecture is shown in figure 3. K3M learns multi-modal information of the product through three steps :(1) encoding the independent information of each mode, corresponding to modal-encoding layer; (2) modeling the interaction between modes, corresponding to modal-interaction layer. (3) The supervised information optimization model of each mode corresponds to modal-Task Layer.
Figure 3
(1) Modal-encoding layer. When encoding the single information of each mode, we use Transformer based encoder to extract the initial features of image, text and triplet surface form for image, text and knowledge mode. The encoder parameters of text mode and knowledge mode are shared.
(2) Modal-interaction layer. When modeling the interaction between patterns, there are two processes. The first process is the interaction between text and image modes: Firstly, the co-attention Transformer learns the corresponding interactive features based on the initial features of the image and text modes. Secondly, in order to maintain the independence of a single mode, we propose to fuse the initial features and interactive features of the image and text modes through the initial interactive feature fusion module. The second process is the interaction between knowledge mode and the other two modes: first, the interaction result of image and text mode is used as the initial representation of the target product, and the representation of commodity attributes and attribute values is used as triplet relations and surface morphological features of tail entities. The information of commodity attributes and attribute values is then propagated and aggregated on the target product entity through the structure aggregation module. The final representation of a commodity entity can be used for a variety of downstream tasks.
(3) Modal-Task Layer. The pre-training tasks of image mode, text mode and knowledge mode are mask object model, mask language model and link prediction model respectively.
Experiment and practice
1 Experiment (Experiment of the paper)
K3M trained on 40 million products on Taobao, each of which contained a title, an image and a set of related triples. We evaluated the effect of K3M on three downstream tasks: commodity classification, product alignment, and multimodal question answering, and compared it with several commonly used multimodal pretraining models: the single-flow VLBERT model, and two dual-flow ViLBERT and LXMERT models. The experimental results are as follows:
Figure 3 shows the results of various models for commodity classification. It can be observed that: (1) The baseline model severely lacks robustness when modal absence or modal noise is present. When TMR was increased to 20%, 50%, 80%, and 100%, the performance of “ViLBERT”, “LXMERT”, and “VLBERT” decreased by 10.2%, 24.4%, 33.1%, and 40.2% on average from TMR=0%. (2) The effect of text mode with missing and noise on performance is greater than that of image mode. Compared with the “title noise” and “image noise” of the three baselines, the model performance decreased by 15.1%-43.9% with the increase of TNR, and by 2.8%-10.3% with the increase of INR, indicating that the role of text information is more important. (3) The introduction of knowledge graph can significantly improve the modal loss and modal noise problems. On the basis of no PKG baseline, the average improvement rates of “ViLBERT+PKG”, “LXMERT+PKG” and “VLBERT+PKG” were 13.0%, 22.2%, 39.9%, 54.4% and 70.1%, respectively, when increasing TMR from 0% to 100%. (4) The K3M achieves state-of-the-art performance. It improves the results of “ViLBERT+PKG”, “LXMERT+PKG” and “VLBERT+PKG” by 0.6% to 4.5% under various mode loss and modal noise Settings.
Figure 4 shows the results of the product alignment task. In this task, we can get observations similar to those in the item classification task. Moreover, for modal misses, model performance does not necessarily decrease as the miss rate increases, but rather fluctuates: when the miss rate (TMR, IMR, and MMR) is 50% or 80%, model performance is sometimes even lower than when it is 100%. In fact, the essence of this task is to learn a model to evaluate the similarity of multimodal information between two items. Intuitively, when two items in an aligned pair are both missing titles or images, their information looks more similar than when one item is missing titles or images and the other item is missing nothing.
Table 2 shows the ordering results of the multi-modal question answering tasks. In this task, we can also see observations similar to those in the goods categorization task.
2 Practice (Business application effect of Ali)
1. New retail shopping guide algorithm of Ele. me, the absolute value of AUC of offline algorithm increased by 0.2%; Online AB-test, flow rate 5%, 5 days: CTR increased by 0.296%, CVR increased by 5.214%, CTR+CVR increased by 5.51%;
2. Taobao main search for similar services, offline algorithm AUC increased by 1%, business feedback is a great improvement; Currently in online AB test;
3. The commodity combination algorithm of Ali Mom’s New Year’s Shopping Festival and the online algorithm, compared with the other two experimental buckets (5.50% and 5.48%), CTR index based on Emedding increased by 0.02% and 0.04% respectively, and the relative increase was 0.363% and 0.73% respectively.
4. In the case of the overall increase in recall of similar products recommended by xiaomi algorithm team with low willingness, the conversion can be improved by about 2.3% to 2.7%, with a relative increase of 12.5%. The previous version was an 11% improvement. It was later expanded to other scenarios.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.