Knowledge graph is a special graph structure, which contains semantic information and graph structure information. It can be used in many fields, such as QA system, recommendation system, new drug discovery, stock market prediction, etc. Now both academia and industry have put forward their own knowledge graph construction platform. The Fourth Paradigm also established Sage Knowledge Base, a low-threshold, whole-process Knowledge graph platform. Today’s share aims to introduce techniques for automating knowledge graph representation learning from the dimensions of triples to subgraphs.

Today’s presentation is organized around the following four points:

  • Background: What is knowledge representation learning
  • Important directions: Knowledge representation learning and AutoML
  • Model design: From triples to subgraphs
  • conclusion

▌ Background

So let me first share with you the background of knowledge representation learning.

Knowledge graph is a special graph structure. It uses entities to represent natural objects or abstract concepts, and uses relations to model interactions between entities. Its basic storage form is triplet (head entity H, relation R, tail entity T). Knowledge graph is a semantic graph which contains both semantic information and graph structure information.

Knowledge representation learning aims to learn how to map symbols (including entities and relations) in a knowledge graph to a low-dimensional vector space. Its advantage is that the learned vector is continuous and hidden properties can be discovered. In addition, it is very efficient to calculate the similarity in vector space.

The overall framework of knowledge graph is shown in the figure above. Given a knowledge map, we will figure all the existence in the triple as samples, and we also need to sampling the negative samples in the picture, we need to ensure that the negative samples do not exist in the knowledge map, the common practice for the figure of a triple head tail entities, entities, relationships, select at least one random replacement. After the positive and negative triples are obtained, we will design a model to model the knowledge graph, optimize the loss function through cyclic iteration, and update the embedding and model parameters. Our optimization goal is to retain as much information as possible from the original image.

Knowledge graph shows that learning has the following basic tasks:

  • Knowledge graph completion: this is the most widely concerned task in the industry at present. It performs link prediction or triplet classification in the graph based on non-existent edges.
  • Entity matching: Its goal is to match the same entity in two different knowledge graphs. For example, some words in one Chinese map and another English map are synonymous, so we can further explore the relationship between other words in the two maps to determine whether they are synonymous.
  • Entity classification: The task objective is to predict the category of entities. For example, judging which field an article belongs to in the academic graph.

Here, the link prediction task is taken as an example to show the evaluation index of knowledge graph representing learning. For a given triplet (h, R,t), we can replace the head entity or tail entity with all entities in the graph to form a set. Then the evaluation task is to obtain the ranking of positive samples after scoring the model. There are three evaluation indexes: Mean Rank (MR), mean Rank (MRR) and Hit@K. The three indicators are evaluated with different emphases. MR is greatly affected by outliers, MRR pays more attention to accuracy, Hit@K pays more attention to recall rate.

Key directions: Knowledge representation learning and AutoML

The knowledge graph shows that learning consists of five key modules. First, we need to define a scoring function to model the problem; Then, since the graph only contains positive samples, the design of negative samples is also critical. So once we have positive and negative samples we need to define a loss function; At the same time, in order to avoid model overfitting, we need to design regularization terms; Finally, we optimize the whole model for the loss function and regularization term.

Given a series of triples, we need to score them. There are three types of modeling methods: based on triples, scoring them directly; Based on the path, we need to find the connection path between the head entity and the tail entity, and deduce through a series of paths found. Based on the subgraph, we need to extract the subgraph containing the head and tail entities in the graph, and then deduce according to the information of the subgraph.

The basic approach of negative sampling is that for a given positive sample (h, R,t), we randomly replace the head or tail entity with another entity in the graph, and this new triplet does not exist in the graph.

However, due to the large number of negative sample sets, simple random sampling will lead to poor quality of negative samples. In 2018, the academic circle proposed a method based on adversative neural network to generate high-quality negative samples. However, the problem of this kind of method lies in that we need to train a generation model separately to output negative samples, and since samples are discrete, the learning process of the model needs to be based on reinforcement learning. These two problems will lead to the instability of the whole model learning.

To solve the above problems, we propose a method to store high quality negative samples in cache. In the process of training, the generation, extraction and updating of negative samples are constantly carried out to ensure that the model can obtain high-quality negative samples and greatly improve the training efficiency. Moreover, this method does not require additional training of a sample generation model. The above work was published in ICDE 2019 and VLDB-J 2021.

Models need to balance expressiveness and complexity. In the knowledge graph, as the parameters of embedding are very large, we must consider regularization of the model. The common regularization methods include: limiting the norm of embedding to less than 1; Add a P-norm regularizer; Add dropout for embedding.

The model parameters can be optimized by stochastic gradient descent. The whole algorithm process is shown in the figure above. We first extract positive samples from the graph and obtain negative samples through negative sampling. Then we use the model to get the scoring of the triplet. There are many super parameters in the process of model optimization, such as learning rate, initialization, Batch size, Dimension size, early stop, etc.

Hyperparameter optimization is one of our key research areas. Most hyperparameter optimization algorithms select a set of parameters in the search space in each iteration to evaluate the model effect in the whole knowledge graph, but this algorithm is time-consuming and laborious. At this year’s ACL conference, we presented KGTuner. This is a two-stage hyperparametric search optimization algorithm. In the first stage, we significantly reduce the parameter search space and make the algorithm evaluate the model effect on a sampled subgraph. Through this method, we can explore a large number of high-quality hyperparameter samples in the first stage of the search, and select the top10 hyperparameters with the best model effect as the candidate set of the second stage fine-tune. In the second stage of search optimization, we will increase the batch size and dimension size of samples, evaluate the model effect based on the complete spectrum, and finally determine the optimal hyperparameter.

Here is a brief introduction to automatic machine learning. AutoML wraps the search space and search target as a superlevel optimization problem, so that the whole hyperparametric optimization problem can be transformed into a Bi-level optimization process. Specifically, the search space defines the objects we need to search (such as hyperparameters, model structure, etc.), the training target is located in the inner layer of the optimization problem, and the search target is located in the outer layer of the optimization problem. At the same time, there are some restrictions in the search process.

The data sets of different knowledge graphs have great differences in data distribution, and they also contain different types of relations. For example, the types of relations can be divided into symmetry, antisymmetry, opposites, asymmetry and combinative. In addition, the downstream tasks of knowledge graph are diverse, such as edge prediction, entity matching, node classification, etc. With such complex data and diverse tasks, we need a wealth of expertise if we want to do unified modeling, and AutoML can help us lower the threshold for modeling.

The definition of AutoML search optimization can also correspond to the four components of hyperparametric optimization:

  • Search space: including search hyperparameters and scoring model;
  • Search target: ranking based evaluation index;
  • Training objectives: loss function;
  • Search restrictions: Total duration of the search.

AutoML also has a number of search algorithms to choose from. The traditional search algorithms include Grid search, random search, greedy algorithm, genetic algorithm and so on. In addition, we also have some model-based optimization algorithms, such as Bayesian optimization, reinforcement learning, gradient descent algorithm, etc.

▌ Model design: From triples to subgraphs

Now I will focus on sharing the model design for knowledge graph representation learning.

1. Triples based model

In the past decade, a large number of scoring functions have been proposed to model triples, with different emphasis on capturing the relationships that exist in triples. Let’s start with the triples-based model.

The first is a model based on translation distance. For example, TransE and its derived models, TransH and RotatE, take the three-component module as the head entity and get the tail entity by translating it in a vector space through relation R. The scoring function of the model is designed based on the distance difference of vectors. TransH and RotatE address TransE’s inability to model one-to-many and symmetric relationships. But in general, the model based on translation distance has insufficient expression ability, which leads to its unsatisfactory effect in edge prediction task.

With the development of neural networks, models based on multi-layer perceptron (MLP), ConvE (ConvE) and RECURSIVE neural network (RSN) have been proposed to model triplet knowledge representation learning. However, neural network-based models are often complex and difficult to train.

The last class is bilinear model, which is also the most effective class of knowledge representation learning based on triples. Bilinear model has strong expression ability and low complexity. RESCAL, proposed in 2011, is a typical bilinear model. In layman’s terms, a linear model is a matrix multiplied by a vector, while a bilinear model is a matrix multiplied by a vector (head entity and tail entity), resulting in a scalar to be used as a triple score. The dimension of Relation matrix of RESCAL is D×D, which contains too many parameters, leading to overfitting of the model.

To solve the over-fitting problem, the DistMult restricts the Relation matrix to a diagonal matrix. However, DistMult’s scoring function satisfies the commutative law, which makes it impossible to model some asymmetric relationships.

To facilitate unified modeling of bilinear models, DistMult can be split into the representation shown above. For example, if we split both the head and tail entities into four pieces, the DistMult can be decomposed as the sum of four inner products.

To address DistMult’s inability to model asymmetric relationships, ComplEx transforms the space where entity vectors and relational matrices reside from a real space to a ComplEx space. Then ComplEx can model asymmetric relations due to the asymmetry of ComplEx multiplication.

Similarly, we can do distmult-like splitting of ComplEx as the sum of the inner products of four sets of real components and four sets of virtual components. In fact, for all models based on bilinear modeling, we can break them down into similar expressions. The details are not explained here, but you can read the paper for more information.

The difference between all the bilinear models above lies in the different modeling methods of Relation matrix. Although each model has good expression ability, for a variety of downstream tasks, each of them is not superior to other methods of the same type in all tasks, resulting in its insufficient generalization ability. Based on the expression form of bilinear model and the problems existing in previous methods, we propose AutoSF, aiming at automating the modeling method of searching relation matrix, so as to achieve the goal of unified modeling.

The search problem is defined as shown in the figure above. Considering that the difference between bilinear models lies in the different forms of relation matrix, we abstract the problem as follows: For a RELATION matrix of K×K, we can find -k,… 0,… K select one of the (2K+1) numbers to fill. Given the search space, we select a structure (i.e. a Relation matrix) in the search space every time, which corresponds to a scoring function; Based on this scoring function we can train a model and use the evaluation function in the validation set to measure the effect of the model. The results are passed to the optimizer to select the structure in the search space that can achieve better results in the validation set. Finally, the optimal scoring function is obtained through cyclic iteration. The search space is hugeIf we train and evaluate each candidate element, the overhead would be unacceptable, so AutoSF and its improved version AutoSF+ designed two algorithms respectively to optimize the search efficiency.

In AutoSF we propose a progressive search algorithm. In the above figure, for example, we first search for the relation matrix with the best performance in the test set and restrict it to contain only four non-zero elements. On this basis, we further increase the number of non-zero elements in each iteration, and search the relation matrix with the best evaluation index under corresponding conditions. The inspiration of the progressive algorithm comes from the fact that we want to make sure that the relation matrix with the best effect is fine-tuned every time, so that it gradually finds the optimal structure. However, the progressive search algorithm is greedy and can easily get the local optimal solution. To solve this problem, we propose a search mode based on genetic algorithm in AutoSF+. Specifically, we can use variation and crossover operations to modify the matrix in each iteration, and then reserve some matrices with better performance in all the modified matrices, and finally find better Relation matrix. Generally speaking, this algorithm is more flexible than the progressive algorithm.

At the same time, we will consider the properties of the domain in the process of selecting the matrix. We design a filter to reduce redundant evaluations. For example, matrices containing all zero rows or columns do not need to be evaluated; Some matrices are equivalent to each other, and we only need to keep one of them for evaluation. In addition, we prove that matrix symmetry is an important property of whether the model can fully express the knowledge graph. Therefore, we define a predictor for model performance prediction based on the symmetry of Relation matrix. It scores model performance by using two-layer MLP based on the symmetry related features contained in the matrix.

The experimental results show that the bilinear model is better than the model based on translation distance and neural network. In addition, for each set of bilinear models, the performance is different under different data sets, and there is no absolute winner. AutoSF can obtain optimal results in different data sets through search algorithm. We visualized the structures found in the search, as shown in the figure above. It can be observed that the optimal relationship structure searched by AutoSF is different in different data sets, which proves that our method achieves the goal of data dependent. Finally, the representation learning of knowledge graph based on triples is briefly summarized. Bilinear model is the most effective modeling method based on triples. AutoSF(+) is a bilinear scoring model based on AutoML, whose search space is a bilinear model in the form of unified expression, which can ensure the model has excellent expression ability. AutoSF(+) uses progressive and genetic algorithm-based search algorithms, and designs filters and predictors to introduce domain attributes to further improve model structure search performance. Related articles have been published in ICDE 2020 and TPAMI 2022.

2. Model based on relational path

Let’s introduce the relational path-based model.

The expressive power of a triad is limited. If we can connect the head and tail entities in a triad through one of the paths in the diagram, we can get much richer information. First, the triplet itself is kept in the path; Second, paths can express more complex relationships. In addition, the path contains long chains of information between multiple triples.

PTransE expands on TransE, transforming triples into a series of paths composed of multiple relationships. Similar to TransE, the relationship between the head and tail entities can be expressed using the sum of translation vectors. The specific formula is shown in the figure above. However, as with TransE, it can’t solve one-to-many and symmetric relationships, so PTransE’s modeling performance is mediocre.

It has a Recurrent Network (RSN) that uses RNN to model the path, in which the entity node is added to the Skip Connection structure, and finally outputs the corresponding entity and relation embedding. RSN is good at modeling long-term information, but it is difficult to effectively capture semantic information within triples.

In order to solve the defects of the above method, we put forward the Interstellar model. We shard the paths on a per-triplet basis. For each triplet, we can search the modeling structure of the model. If we disconnect the paths between triples, the model can degrade to triple-based modeling; If we remove the tail entities of each triplet in the middle of the path (input 0 vector), then the model degrades to PTransE (modeling only the relationship vector representation). With this strategy, we can make the model automatically capture the different semantic information and properties contained in the path.

As with AutoSF, we need to design the search algorithm for the model. If we directly adopt AutoSF search algorithm, it would be very expensive to conduct a complete training (standalone) of the model corresponding to each group of new structural configuration, but the advantage of this approach is that accurate model evaluation results can be obtained. In contrast, the one-shot method shares the trained model parameters, making this type of search algorithm very efficient, but at the cost of only unreliable model evaluation results. Our algorithm absorbs the advantages of the two kinds of algorithms, and uses the standalone method at the macro level (Connection and Combinators) to get accurate evaluation of the effect of model structure configuration. One-shot method was used to obtain efficient model evaluation results at the micro level (activation and weight matrix).

In the entity matching task, through comparative experiments, we can find that the performance of Interstellar is greatly improved compared with other basic models. In different data sets, our model also has completely different structural configurations for ternary component modules. In the link prediction task, we observed that there were many one-hop relationships in THE WN18-RR dataset, but relatively high two-hop relationships in the FB15K-237 dataset. This feature is also well reflected in the triplet structure search results eventually given by Interstellar, as shown in the visual diagram above. The above results prove that the modeling results of our proposed model are strongly correlated with the data, and the design of the model for structural search plays a key role in improving the final performance. Here, I summarize the relational path-based modeling approach. First, the relational path-based approach contains more information than the triple-based approach. We propose the Interstellar, which is an algorithm based on neural network structure search and can deal recursively with the relationship between entities in the path. The search space can be divided into macro search space (connections and Combinators operators) and micro search space (Activations and Weighting matrices). A hybrid search algorithm is designed by taking advantage of the advantages of stand alone and one-shot methods. This work was presented at the NeurIPS 2020 conference.

3. Model based on graph neural network

The last one is the graph neural network (GNN) model based on subgraph. In recent years, graph neural network has demonstrated its powerful expression ability in data containing graph structure. In the field of knowledge graph representation learning, many models have recently tried to apply GNN to the subgraph of the graph. R-gcn, CompGCN and KE-GCN all take shallow embedding as model input, use relational GNN to aggregate nodes to get high-level embedding, and then use TransE, ConvE and other scoring functions to get the final score. However, this type of method needs to load a complete knowledge graph, resulting in poor scalability. In addition, such models rely on scoring functions, and GNN can only improve the final effect of the models.

In 2020, GraIL model was proposed in the industry. It extracts subgraphs containing them from the original map based on given head and tail entities, conducts entity labeling based on the distance between nodes and head and tail entities, and finally uses GNN to carry out message delivery and update for subgraphs. Finally, we can get a score for the triad of head and tail entities. GraIL can do inductive reasoning without trained embedding, which makes it possible to score unknown nodes using graph structure. However, the time complexity of extracting subgraph and generating node tags in subgraph is high.

Based on the shortcomings of the above models, we propose a red-GNN model. Firstly, we use relational path to enhance the path obtained in the figure to the same length by introducing identity relation. We then stack all the paths to get a relational subgraph (a directed graph, which preserves the propagation direction of the information).

Due to the existence of overlapping paths in the relationship subgraph, we can use dynamic programming to model all the relationship subgraphs with the same head entity at one time. As shown in the figure above, the left side is the traditional GNN calculation method, which requires separate calculation for each relational subgraph, while the right side shows the calculation method of Red-GNN. We can use recursive calculation and parallel calculation to make GNN model multiple relational subgraphs at one time. GNN information aggregation is based on the relationship information between entities, and attention mechanism is adopted for adaptive fusion.

The figure above shows the results of the comparative experiment. Red-gnn is a model with pure subgraph structure and no physical embedding, so it is suitable for reasoning of both Transductive and Physical Interpretation. From the experimental results, we can find that even if the model does not use any embedding information, its effect is still better than most methods. Due to the small number of parameters in the model and the design of the algorithm based on dynamic programming, red-GNN is significantly more efficient than GraIL. Now LET me summarize the GNN-based approach. First of all, it is not ideal to simply apply GNN to knowledge graph after a small amount of modification. We put forward the Red-GNN model. After learning the submap structure based on GNN, SOTA effect is achieved in both The interpretation of The Transductive and the Physical Revolution. In my opinion, subgraph learning in knowledge graph is a promising research direction. At present, this method is facing the challenge of computational efficiency.

▌ summary

Today we introduce three types of models in the field of knowledge graph representation learning:

  • Triples-based model: Focuses on modeling semantics with symmetry or opposites, but it fails to model structural information and is poorly interpretable. Its advantage lies in its high computational efficiency and wide application in practice.
  • Relational path-based model: focuses on modeling compound semantics, can model sequential structural information, and uses simple logical rules to achieve interpretability. The computational complexity of these models depends on the number of paths.
  • Subgraph-based models: focus on modeling higher-order semantics, can model topological information, and use complex logical rules to achieve interpretability. The computational complexity of these models is determined by the extracted subgraphs.

AutoML design model structure can greatly improve the performance of knowledge graph representation learning, but when using this method, we need to consider the prior knowledge and properties of the domain for search space and search algorithm. In the future, we can continue to explore knowledge graph representation learning from the following four aspects: transforming the structure design of GNN into an automated process;

  • To achieve efficient inference in super-scale knowledge graph;
  • Inference is completed in dynamic or event knowledge graph;
  • Apply knowledge graph representation learning to a wide range of fields, such as biology, finance, etc.

▌ Q&A

  1. There are some long-tailed relationships in the knowledge graph. How should we design algorithms to better learn these relationships?

In practice, we found that AutoML could alleviate the problem of under-learning of long-tail relationships. Because AutoML’s evaluation method can indirectly increase the weight of those poorly learned relationship samples, the optimization model structure design can take into account the learning effect of these long tail relationships. In fact, this is similar to the way upsampling works. In addition, small sample learning can be considered to learn the long tail relationship in a more targeted way.