Tencent AI Lab and Tsinghua University, the history, latest progress and application of Hong Kong Chinese long text deep learning

preface

The field of artificial intelligence has experienced rapid development in recent years. Big data analytics in graphics, video, gaming, natural language processing, finance and other fields have made leaps and bounds and spawned many applications that have changed our daily lives. Nearly period of time, figure neural network become a research hotspot in the field of artificial intelligence, especially in the social network, knowledge map, chemical research, text analysis, combinatorial optimization, and other fields, figure neural network relations implied in finding data for strong skills can help us get better data, then can let us make better decisions. Mapping the evolution of human social networks through neural networks, for example, could help us understand the underlying workings of human society and bring us one step closer to our ideal society.

In this year’s ACM SIGKDD (ACM SIGKDD, KDD for short) conference, graph neural network research status has been fully reflected: According to rough statistics, nearly 40 of 216 papers (Research Track) received by KDD this year are related to graph neural network. Therefore, a one-day course on graph neural networks was highlighted by participants. The theme of the joint course is “Deep Graph Learning: Fundamentals, Progress, and Applications”. Foundations, Advances and Applications “, jointly organized by Tencent AI Lab, Tsinghua University, The Chinese University of Hong Kong and other institutions, has talked about the current cutting-edge research progress of graph neural network from the basic graph concept.

This course is divided into two themes. In this paper, the history of graph neural networks, the latest research progress of graph neural networks and the application progress of graph neural networks are summarized. Theme II: Advances and Applications of the core content, Theme, and more detailed I can see the content of the course slides and related papers: ai.tencent.com/ailab/ml/KD…

In order to solve a series of challenging problems in graph learning, explore the boundaries of graph learning applications and help companies in various graph related businesses. In the second half of 2017, Tencent AI Lab started the research on deep learning of layout diagram, and actively explored the application boundary of deep learning of diagram. In addition, he has published a number of papers on the top conferences of machine learning data mining, involving large graph computing, ultra-deep graph neural network, unsupervised graph learning, graph against attack, graph sample learning, etc. In the future, we will explore the application of graph deep learning in a wide range of scenarios, such as social recommendation, drug discovery and so on. So that it can truly benefit mankind.

A graph and a Graph Neural network 1 What is a graph? The Word “graph” in Chinese can correspond to many different words in English: image, picture, map and the graph concerned in this paper. A graph, “graph,” or “graph,” is a structure that can be used to describe the relationships between things. The basic components of a graph are vertices and the edges that join them. According to the property of edge direction, it can be divided into directed graph and undirected graph. In general, we can usually represent graphs in terms of points and lines that join them, which helps us to understand them more intuitively, as shown in the figure below:

But for the convenience of computer processing, we can also represent the graph by matrix. For example, if it is defined that when v_i is connected to v_j, A[I,j]=1, otherwise A[I,j]=0, then the above matrix can be expressed as the adjacent matrix A:

Graphs have strong representational ability. Physics system modeling, protein prediction, the classification of diseases, and many of the text and image processing tasks can be represented as graph data structure, such as figure can be used to represent sentences in the text of the dependencies and the image of the relative position of things, can also be used in the analysis of social network in information dissemination and customer relations, but also by analyzing the relation between the molecular association to discover new drugs.

In recent years, under the dual power of big data and hardware development, deep neural network technology has witnessed leapfrog development, which enables us to have the ability to analyze and understand large-scale graph data. In general, graph analysis tasks can be divided into three categories: node classification, link prediction and clustering.

Graph neural network (GNN) is a neural network that processes Graph data, in which there are two kinds of operations worth mentioning: Graph Filter (divided into spatial filtering and spectral filtering) and Graph pooling. Graph filtering can refine the node features, while graph pooling can generate the representation of the graph itself from the node representation.

Generally speaking, the framework of GNN is composed of filtering and activation layers at the node level, while for layer level tasks, different modules are composed of filtering, activation and pooling layers and then connected.

In terms of GNN implementation, the most commonly used approach is the Message Passing Framework. In brief, the framework is divided into two steps.

The first step is the message generation step. First, the status data is collected from the neighbor nodes, and then the message for the current node is generated using the corresponding function. In the second step, we update the state of the target node.

Most of the current spatial GNN can be built as some kind of messaging process, and in fact most of the current Deep learning toolkits for graphs, such as Deep Graph Library and PyTorch Geometric, use this framework.

Graph Neural network (GNN) is not a new thing. The earliest HISTORY of GNN can be traced back to 1997. Roughly summed up, the development process of GNN can be roughly divided into three stages.

In the first phase, the main approach used by GNN is an extension based on recurrent neural networks (RNN). RNNS are known to excel at processing sequence data, which, in turn, can be thought of as graphs of a particular pattern. Therefore, some early work (TNN97 / TNN08) applied the generalization of RNN for processing sequence data to special graph structures such as trees and directed acyclic graphs (DAG). But since then the field has almost ground to a halt. However, driven by the current wave of deep learning, research on modeling and processing of unstructured data has emerged widely, and GNN has also ushered in its own development opportunity. The number of related papers in top conferences has also grown rapidly.

In the second phase, convolution is introduced into the GNN workflow. When represented by matrix, graph has many similarities with images that convolution is good at processing, which also opens the era of using convolution in GNN. A series of works transform graph convolution in spectral space to approximation in topological space, and on this basis, the graph convolution network (GCN) was born in 2017, which for the first time uses layer by layer convolution to expand the sensing field, and the graph neural network has been applied in practice.

We are now in the third stage of GNN development. There have been various variations of Graph convolution, attention mechanisms have been introduced into GNN, as well as Graph Pooling techniques and higher-order GNNS. Important technologies that emerged during this phase include:

Y variant convolution: Lanczos network (using the Lanczos algorithm to obtain the low-rank approximation of the graph Laplace), graph wavelet neural network (using the wavelet transform to replace the Fourier transform), hyperbolic GCN (building GCN into hyperbolic space).

Y attention mechanisms: graph attention network (using learnable self-attention to replace fixed aggregation weights), gated attention network (adding learnable gates to model the importance of each head), spectral graph attention network (applying attention to the high/low frequency components of the spectrum).

Y Diagram pooling: SAGE (self-attention graph embedding, using self-attention to model node importance when pooling), graph pooling through graph shearing (graph pooling through pre-trained subgraphs obtained by graph shearing algorithm), differentiable graph pooling (DIFFPOOL, Node representation is aggregated hierarchically by learning cluster allocation matrices; feature pooling (EigenPooling, a better allocation matrix is obtained by integrating node features and local structures).

Y High-order GNN: High-order GNN refers to encoding high-order proximities into the figure by extending the receptive field. Higher-order proximity describes the relationships between nodes with more diverse distances, not just those of neighboring nodes. Research work in this area includes DCNN (extending the adjacencies matrix into tensors by stacking the power series of the transition matrix, and then output node embedding and graph embedding independently of each other), MixHop (using a normalized multi-order adjacencies matrix and then assembling the outputs of each order), Thus, high and low order proximity are obtained simultaneously) and APPNP (personalized PageRank is used to build a better neighbor relationship for the target node).

The following diagram briefly summarizes the relationship between the various GNN variants:

After understanding the basic knowledge and development context of GNN, we will step into the current frontier research field and interpret some recent theoretical research results and design innovations.

We know that graphs are great at expressing relationships between things, but what are the limits of graph neural networks? Huang Wenbing, an assistant researcher in the Department of Computer Science of Tsinghua University, a “Shuimu Scholar” of Tsinghua University and a “rhinoceros Bird Visiting Scholar” of Tencent, introduced the relevant research progress in the course.

In order to effectively assess GNN expressiveness, evaluation criteria need to be defined first. Currently, there are three typical tasks for evaluation: graph isomorphism, function approximation, and graph detection/optimization/evaluation.

For the graph isomorphism task, the goal of GNN is to determine whether any given two graphs are isomorphic. This is a very important task. For the graph classification task, if two graphs are isomorphic, GNN needs to output the same label for both graphs.

However, the problem of determining whether a graph is isomorphic is a NP-hard problem. The traditional Weisfeiler-Lehman (WL) test method can basically identify whether most graph structures are isomorphic except a few graph structures. Can GNN better solve this problem?

Not necessarily. In 2019, Xu et al. And Morris et al. GNN has proven to be at best as powerful as WL tests. Later, Xu et al. Further proved that if the aggregation and readout functions in GNN are injective functions, GNN is equivalent to the WL test.

For the function approximation task, the goal of the task is to determine whether GNN can approximate any graph-based function with arbitrary accuracy. Because GNN itself is a function of some kind based on graphs, its performance on this task will reflect its capabilities. In fact, DNN has a similar assessment task. We know that DNN can converge to any vector function as long as there are enough hidden units, and this is called the “general approximation theorem”. So it’s natural to ask similar questions about GNN.

Maron et al. Proposed an architecture, for GNN with graph invariant layer and graph Equivariant layer, if the isovariant layer is stacked after a nonlinear layer, such as ReLU, Layer on top of each other, then add a graph invariant mapping layer at the end. It can be seen that such a model can keep the mapping invariant as the arrangement of the inputs changes. Such models are known as graph invariant networks (INN).

How powerful is INN? Maron et al. Proved that for any continuous invariant graph function, if certain conditions hold, we can find specific INN parameters that enable the function to be estimated with arbitrary accuracy. This is one of the most powerful theoretical results in graph learning. For N n, this is equivalent to the general approximation theorem for NN.

We’ve seen recent advances in graph isomorphism and function approximation, and further, what’s the relationship between the two? Chen et al. 2019 showed that the two are actually equivalent under certain conditions.

Next, let’s see if GNN is expressive enough to solve more difficult tasks, such as finding the shortest path in a graph or determining whether a loop exists in a graph. These tasks are difficult because they require very fine-grained reasoning at the node level.

Loukas showed that GNN could solve these tasks, concluding that it could do so as long as the GNN was deep and wide enough, and the nodes had mutually identifiable properties. Here, depth refers to the number of GNN layers, and width refers to the dimension of hidden units.

So to sum up, GNN is actually very expressive, given the right architecture; Therefore, to fully exploit the true power of GNN, we need more architectural research and exploration.

2 Training Depth Map Neural Network As mentioned briefly before, depth is very important for GNN ability. This is also true of deep neural networks (DNN) — deep networks tend to be very expressive, and are arguably one of the main drivers of the current ai boom.

So, specifically, does deeper GNN have such an advantage? Can a deeper GNN get a larger sensing field like CNN?

The answer, of course, is yes. For example, the shortest path finding problem requires a very deep receptive field to find all possible paths, while loop detection and subgraph finding problems also require large receptive fields.

So how can we increase the receptive field of GNN to make it more powerful? For this, we need a deeper or wider GNN.

First, let’s look at ways to extend GNN by simply adding depth. As you can see, for the six GNNS in the figure below: GCN, GraphSAGE (further improved pooling), ResGCN (using residual network thinking), JKNet (using DenseNet thinking), IncepGCN (using Convolut-V3 thinking), APPNP (borrowed from PageRank) “), simply increasing depth does not necessarily improve accuracy, and the opposite may even happen. For example, GCN, GraphSAGE, and ResGCN’s accuracy decreases significantly as depth increases.

Which begs the question: What is the root cause of the improved expressiveness of adding depth? What are the reasons that hinder the deep expansion of GNN?

Recent research has identified three fundamental factors that may hinder the development of GNN: over-smoothing, overfitting and training dynamics. The latter two are also common deep learning problems, while over-smoothing is a unique problem in graph deep learning.

Over smooth first we looked at smooth. GNN is essentially a representation that pushes the mixture of adjacent nodes layer by layer. Therefore, in the extreme, if the number of layers is infinite, the representation of all nodes will converge to a stagnation point, which is completely independent of the input features and will lead to the problem of gradient disappearance. Therefore, one of the phenomena of over-smoothing is that the training loss and verification loss of the model are difficult to decrease. So why is there smoothing?

Let’s do this in terms of linear GCN. First, how does GCN relate to smoothing? Generally speaking, GCN can be considered a special form of Laplacian Smoothing, which looks something like this:

This process means that a node’s new features are constructed based on a weighted average of itself and its neighbors.

To understand where this oversmoothing occurs, let’s first talk about when does GCN fail due to oversmoothing? We’re going to talk about three over-smooth cases. The first is that with linear activation, the hidden variable H_L converges to a particular point. The second is that H_L converges to a particular plane M when activated with ReLU. The third is that with ReLU plus deviation, H_L converges to the surface of a particular subcube O(M, r).

In the case of linear activation, why does H_L converge to a particular point? In fact, it has to do with L random walks. The probability of a bludger migrating from one node to one of its neighbors is “1/ the degree of the node”. After L – step migration, the migration path will form a sequence of visited nodes. Mathematically, the random walk is essentially a normalized matrix to the L times the initial probability.

Then, if we replace this initial probability with a set of learnable parameters on the node features, it transforms into a linear L-layer GCN.

As you can see, some of the results from the random walk also apply to linear GCN, one of which is that the random walk converges to a stationary point after an infinite number of steps.

In detail, we first need to perform eigenvalue decomposition, i.e. decompose the normalized adjacencies matrix into n eigenvalues λ and their corresponding eigenvector U.

By expanding this sum, the following formula can be obtained:

The eigenvalues in this graph have one property. That is, if a graph g contains m interconnected components, then the eigenvalues of the normalized adjacencies matrix are composed of m maximal eigenvalues of 1, and the remaining λ lies in the open interval of (-1,1).

Therefore, as lL approaches infinity, the largest m term still exists because λ is equal to 1. However, the rest of the terms will be ignored because these lambda to the l will approach zero. This will make the hidden variable H_L tend to a specific point as the depth of the network increases.

On the other hand, for the nonlinear case, H_L will converge to a particular subspace M with a nonlinear activation ReLU. First, let’s define the M subspace:

As the depth of the layer increases, the hidden variable will get closer and closer to the subspace M. The distance from H_L+1 to the subspace is at least:

It should be noted that λ_m+1 is the largest non-1 eigenvalue in the adjacencies matrix and s_L is the largest singular value in the model parameter W_l.

So let’s start to parse this convergence formula. The convergence of the normalized adjacencies satisfies this inequality.

If we assume that the dimension of this subspace is m, then the m largest λ will be in the subspace, and the rest will be in the range λ_m+1.

Then, the convergence of model parameters W_l and ReLU satisfies the following two inequalities respectively:

For more detailed proof of these inequalities, see the ICLR 2020 paper Graph Neural Networks And Loss Expressive Power for Node Classification.

By combining these inequalities, we can obtain the convergence of the subspace distance of the hidden variables along the number of layers. And you can see that as the number of layers goes to infinity, the subspace distance goes to zero, so the hidden variable will converge to the subspace M.

Now, in the more general case, what about GCN with ReLU plus deviation? H_L will converge to the surface of a particular subcube O(M,r). First, we write the formula for GCN with deviation:

Obviously, since the distance from b_L to the subspace is constant, its convergence satisfies:

It can be seen that as L approaches infinity, the right-hand side of the inequality is the sum of an infinite geometric sequence:

Therefore, we can see that H_L will approach the surface of a subcube with a distance r from the subspace M, which is equal to the above formula.

In summary, by analyzing the three cases from different scenarios above, we can see that there is a universal formula for all three cases. We can use the following inequality to unify the over-smooth case:

Then we can get different details by using different values of V and r:

Overfitting and training dynamics

And then we looked at the fitting problem. Overfitting is a common problem in deep learning, which occurs when the number of data points is small and the number of parameters is large. At this point, the model fits the training data perfectly (essentially memorizing the data), but does a poor job of validating the data.

Dynamic change of training is also a common problem in deep learning. According to the chain rule, when the model becomes deeper, the result of s_L -1 multiplied by λ_m+1 is less than 1, and the gradient disappears. If we use the RGB color as the node feature in the following graph, we can see that when the number of layers reaches 500, the gradient of these features drops to 0 and the nodes become black, i.e. RGB=[0, 0, 0].

Overfitting and dynamic change of training are common problems in deep learning, which will not be discussed here. Let’s take a look at what has been done to solve the over-smoothing problem.

How do I solve the oversmoothing problem?

First, how do you quantify smoothing? Taking GCN as an example, we first define an ε -smooth, such that for any L greater than a particular layer L, the subspace distance of the hidden variable H_l will be less than ε :

Then, the ε -smooth layer is defined as the smallest layer where the subspace distance of H_l is less than ε. However, the ε -smooth layer is difficult to derive. Therefore, we take a relaxed ε -smooth layer as the upper bound. The formula for the relaxed ε -smooth layer is as follows:

After quantifying the smoothing with layers, we can find ways to alleviate the smoothing problem.

Note that λ_m+1 is related to the adjacencies matrix and s_max is related to the weight of the model, which also implies that there are two directions to alleviate the over-smoothing problem.

First, we can reduce over-smoothing by treating the adjacent matrix, that is, increasing λ_m+1, and then increasing the relaxed ε -smoothing layer:

So how do we do this? Simply discard some edges at each epoch. Research shows that when some edges are discarded, the propagation speed of information on the graph will decrease, and the dimension of subspace will increase with the increase of connected components. Both of these phenomena after edge discarding help alleviate the over-smoothing problem.

DropEdge: Towards Deep Graph Convolutional Networks on Node Classification can be found in ICLR 2020 by Tencent AI Lab for more details and experimental demonstration.

Second, we can ease over-smoothing by adjusting the weight of the model. To increase s_max, we can increase the initial value of W_ls. The following figure shows the effect of this approach. See ICLR 2020 Graph Neural Networks Slide Lose Expressive Power for Node Classification.

Another paper in ICLR 2020, “PairNorm: Dzjoversmoothing in GNNs”, refers to a technique in addressing the problem of training dynamics.

The PairNorm idea centers and rescales or normalizes the GCN output such that the total PairNorm square distance remains constant. As shown in the figure, the output points of GCN are usually closer to each other after graph convolution. But by using the PairNorm, the new output pairings can remain similar to the input distances.

Another way to overcome training dynamics is to add shortcuts to the structure. This is how GNNS such as JKNet, IncepGCN, and APPNP can maintain performance in deep structures.

Because all three models contain a shortcut through the aggregation layer to the end point, shallow information can be propagated to the final layer, so these models are in effect still “shallow models.”

By the way, these different GNN structures still satisfy the general case of over-smoothness:

In summary, with the help of these new theoretical advances, the question of training deeper GNN has been preliminarily answered.

Maps in the real world can be very large, so making GNN capable of processing large-scale graphs is a very important research topic.

Basic GNN is often unable to handle large scale graphs because it is often unable to meet large memory requirements and gradient updates are inefficient.

In order to enable GNN to handle large-scale graphs, researchers have proposed three different sampling paradigms: nodal sampling, layer-based sampling, and subgraph based sampling.

Among them, nodes-by-nodes sampling is based on the target node, layer-by-layer sampling is based on the convolution layer, and graph-by-graph sampling is based on the original graph, and then the subgraph is used for model inference.

According to these three paradigms, in order to achieve large-scale GNN, we need to solve two problems: how to design an efficient sampling algorithm? How to ensure sampling quality?

There have been some achievements in building large-scale GNN in recent years, and the following chart gives a timeline of these achievements:

We will briefly describe the results of these studies in this timeline.

Let’s start with GraphSAGE, which can be seen as an extension of the original GCN: GCN’s average aggregator has been added with a number of generalized aggregators, including the pooled aggregator and the LSTM aggregator. Different aggregators also have different effects on the model. In addition, after aggregation, GraphSAGE uses the connect function to combine information from neighboring nodes of the target node’s machine, instead of the summation function used by GCN. These two major improvements are based on the spatial understanding of GCN.

To achieve large-scale GNN, GraphSAGE first adopted a mini-batch training approach, which reduced communication costs during training. In each iteration, only the nodes used to compute the representation are considered. However, the problem of adjacent node expansion may arise with Mini-Batch training, which causes Mini-Batch to use most or even all of the nodes of the diagram for a large number of layers!

To solve this problem and further improve performance, GraphSAGE adopted the neighborhood sampling method with a fixed number of samples, that is, each layer sampled a fixed size set of neighboring nodes.

As can be seen from the figure on the upper right, the number of sampling nodes decreases after the sampling method with a fixed number of samples is adopted. When the size of the graph is large, this gap is even more significant. However, GraphSAGE cannot avoid the problem of adjacent node expansion when the number of network layers is large, and the sampling quality cannot be guaranteed.

To further reduce the sample size and obtain some theoretical quality assurance, VR-GCN incorporates a control variable based estimator (CV sampler). Historical hidden embedding can be maintained to obtain better estimation, which can be used to reduce the variance, and then eliminate the variance to achieve a smaller sample size. The mathematical form of VR-GCN is as follows:

However, VR-GCN has a downside: it requires extra memory to store all the history hidden embedding, which makes it difficult to scale it up on a large scale.

As we can see above, the nodal sampling method cannot completely solve the problem of adjacent node expansion. Let’s look at the layer-based sampling method.

FastGCN proposes to understand GCN from the perspective of Functional generalization, and provides the layer-based estimation form for GCN:

Based on this, we can sample a fixed number of nodes at each level. Furthermore, FastGCN also proposes a sampling pattern based on importance to reduce variance. In the sampling process, the sampling of each layer is independent of each other, and the sampling probability of each layer is consistent. Here is a comparison of GCN and FastGCN:

As you can see, the computing cost of FastGCN is significantly lower, and the study shows that this sampling pattern is not expected to lose much information, because it performs randomization in importance sampling, and with enough training of epochal, every node and link is expected to be sampled.

It can be seen that the FastGCN based on layer sampling completely solves the problem of adjacent node expansion, and the sampling method has quality assurance. However, the disadvantage of this approach is that the correlation between layers cannot be obtained and the performance of the model may be negatively affected.

To better obtain the correlation between layers, THE ASGCN proposes an adaptive layered sampling method, which dynamically adjusts the sampling probability of lower layers according to the sampling results of higher layers. As shown in the following figure, the ASGCN samples the neighbor nodes of the upper layer when sampling the lower layer, so that the correlation between layers is preserved. As shown in the figure on the right below, the entire sampling process is top-down. We first sample the target node in the output layer, then sample the nodes in the middle layer based on its sampling results, and repeat the process all the way to the input layer. During the sampling process, the number of sampling nodes at each layer will also maintain a fixed value.

In addition, to reduce the sampling variance, the ASGCN also introduces the explicit variance reduction method to optimize the sampling variance in the loss function.

On the whole, ASGCN provides better performance and variance control. However, the sampling efficiency may be affected due to the additional inter-layer dependencies in the sampling process.

Then came the subgraph – based sampling method. ClusterGCN first uses the graph segmentation algorithm to decompose the graph into smaller clustering sub-graphs, and then forms random batches on the sub-layer surface, and then inputs them into the GNN model, so as to reduce the single calculation requirement.

By limiting the size of the subgraph, this method can also effectively avoid the problem of adjacent node expansion, because at each layer, the sampling range does not exceed the cluster subgraph.

However, ClusterGCN’s paper does not carry out empirical research on the sampling quality of this method.

GraphSAINT does not use clustering algorithms during the sampling process (which would introduce additional biases and noise), but instead samples the subgraphs for mini-batch training directly through a subgraph sampler. It provides three sampler construction methods, namely nodal sampling, edge-based sampling and random walk sampling, as shown below:

GraphSAINT also theoretically analyzes the methods for controlling the bias and variance of the sampler, including loss normalization and aggregation normalization methods for eliminating sampling bias:

In addition, this paper proposes to reduce the sampling variance by adjusting the edge sampling probability. As the best method at present, The performance of GraphSAINT has also been proved in the experiment. For details, please visit the ICLR 2020 paper GraphSAINT: Graph Sampling Based Learning Method for reconstruction.

Clearly, there is much room for further research in large-scale graph neural networks, such as more efficient sampling techniques, architectures suitable for heterogeneous or dynamic graphs, and so on.

4 Self-supervised/Unsupervised Learning of Graph Neural Networks The expression, depth, and scale of GNN discussed earlier are all based on a supervised approach, that is, we have labels for input graphs. But in real life, getting these labels isn’t easy. For example, in the molecular attribute prediction task, we need professional assistance in order to obtain the basic truth tags. In addition, the training task and the test task are not always consistent. For example, for the social network recommendation task, we may use the data of the node user’s purchase of goods in the training, but we want to know whether the node user wants to see a certain movie, the training tag may not be useful in the test. Therefore, we need to study how to train GNN without labels.

At present, there have been some achievements in self-supervised graph learning. We can classify them into two broad categories according to their mechanisms: predictive methods and information theory-based methods. According to the difference of tasks to be solved, it can be divided into two cases: node classification and graph classification.

Forecasting methods First look at forecasting methods. Yann LeCun said, “A system of self-supervised learning is one that predicts some parts of its input from other parts of its input.” This means that in self-supervised learning, the structure of the input is important. Graphs, on the other hand, are highly structured, and therefore naturally well suited to self-supervised learning.

There are usually two methods to realize self-supervised learning for node classification task. One is to enforce the similarity between adjacent nodes, and the other is to perform the reconstruction of each node based on neighboring nodes.

First, let’s look at the first approach, introduced by GraphSAGE, whose basic idea is to force each node to have a similar representation to its neighbors. In this case, let h_u be a node adjacent to h_V, and our goal is to maximize their inner product. We call these adjacent nodes positive examples. Then, we minimize the similarity between H_V and other nodes obtained by negative sampling, which are usually uniformly sampled from the entire graph. This way, we can use back propagation to train the parameters of GNN.

The second approach, ep-B from Duran & Niepert, first calculates the aggregation of the representations of neighboring nodes. The goal is to minimize the distance between the reconstructed results and the real representation. At the same time, it maximizes the distance between the aggregate representation and the representation of other nodes.

The main difference between EP-B and GraphSAGE is that EP-B enforces similarity between the aggregation of neighboring nodes and each other, whereas GraphSAGE directly enforces similarity between neighboring nodes and each node.

What are the methods of graph classification?

The first method we’re going to introduce is the n-gram Graph. The method is divided into two stages: node representation stage and graph representation stage. In the node representation stage, a traditional self-supervised node embedding method, CBoW, is used to learn node representation. In the second stage, since the node representation is available, the representation is first calculated for each path of length n, called an N-gram path. The path representation is obtained as the product of the representations of each node in the path. Therefore, they will go through all the N-gram paths in the graph and sum up the representations of all paths. The final graph representation is obtained by connecting the 1-Gram to t-Gram paths.

In fact, such a calculation is equal to GNN without training. The n-gram Graph is trained without any supervision.

PreGNN, another method used for map classification, is also divided into two stages: the first stage is to perform node representation (but using two entirely new methods), and the second stage is to use simple readout to obtain a representation of the surface of the layer based on all nodes. But it trains the representation of layers through a supervisory strategy called cross entropy. The study points out that training at both the node level and layer level is important for final performance.

Since the second stage is so common, we’ll just read the first stage.

PreGNN proposed two loss functions at this stage. Context Prediction is a mandatory node representation similar to its surrounding structure. In addition, we perform negative sampling to minimize the similarity between h_V and the surrounding structures of other nodes. The idea is simple: the structures around each node define the local topology of that node.

Another loss is attribute masking. The design of this loss was inspired by the powerful NLP model BERT. In simple terms, the nodes are randomly replaced with masks and the training loss function is constructed to predict the input. It’s simple, but it works.

Another approach worth mentioning is GCC. This method uses contrastive learning to perform unsupervised learning of layer surfaces. One of the main problems with GCC, however, is that there is no nodal level training.

In summary, for the Graph classification task, n-Gram Graph and PreGNN both use noder-level self-supervision, whereas GCC uses layer-level self-supervision. So, the natural question is: Can we use self-monitoring at both the node level and the layer level?

The answer, of course, is yes. This comes to Tencent AI Lab research team proposed GROVER.

GROVER is also divided into two phases. At the node stage, we also consider edges, but for the sake of simplicity, only the node representation process is discussed here. In this phase, you first extract a dictionary pool for each node. Let’s call that key. We then mask each node like BERT and then predict the local structure around each node. If the local structure matches a key in the dictionary, a 1 is printed on that dimension, otherwise a 0 is printed. Note that this is a multi-label classification problem because there are often multiple local structures for each node. As a result, we only need one type of mask to do the two PreGNN things.

And then at the graph stage, the predictions are similar. We also first extracted graph motif, which is composed of typical functional groups (such as benzene rings). We then use GNN to get the output of each graph. Using this output, we will predict whether the graph contains these Graph Motifs. Note that this is also a multi-label classification problem.

In addition, Tencent AI Lab also proposed a powerful GNN model similar to Transformer in this study: GTransformer. It will first use a newly proposed dynamically extended range MPNN named dyMPN to get the key, query, and value for each input graph, and then get the final output like Transformer. Experimental results demonstrate the power of this model.

These are the key components of GROVER. Furthermore, we also implemented a distributed graph training framework, and finally succeeded in pre-training a large model with 100 million parameters on 10 million unlabeled molecular data.

Using this pre-trained model, we can fine-tune downstream tasks. On this basis, GROVER can achieve significantly better performance than traditional methods such as MPNN and DMPNN, as well as self-supervised methods such as N-GRAM and PreGNN.

The information based method introduced the prediction method, let’s look at the information based method.

A good representation should be able to preserve a lot of information in the input. Inspired by this, Vincent et al. (2010) proposed the use of autoencoders for representation learning, meaning that hidden representations should be decoded to the same as their inputs.

However, the autoencoder consumes a lot of resources and needs both encoding and decoding. In the field of graphs, how to decode graphs is still a problem to be solved. So are there any other ways to measure information directly between representation and input? Yes, it is mutual information.

Given two random variables, mutual information is defined as the KL divergence between the product of their boundary attributes and joint attributes, which can be further deduced as entropy minus conditional entropy.

Why does mutual information compute information relationships? One way to look at it is, if X and Y are independent of each other, and p(X), p(Y)=p(X,Y), then the mutual information is equal to 0, which means that X and Y are unrelated. And that makes sense, because X and Y are independent of each other. If the conditional entropy is 0, then X and Y are determined to be related, then the mutual information output is the maximum.

Hjelm et al. 2019 shows that performing automatic coding is a lower limit for calculating the reconstruction error of mutual information.

It is very difficult to calculate mutual information, and only in recent years have some feasible methods emerged. There are three typical approaches (MINE, JSD MI, and infoNCE MI). The basic idea is to learn a neural network to maximize an alternative function of mutual information. Please refer to the papers for details.

Going back to graphs, can we use mutual information to achieve self-supervised learning of graphs? DGI is the first research result in this field. Its goal is to maximize the mutual information between the input node feature X and the adjacencies matrix A and the output representation H_I. DGI uses the JSD estimator, which contains both positive and negative examples.

However, it is difficult to compute mutual information directly, and we may need another GNN as an alternative to mutual information. DGI uses the representation read s instead of the input. As shown in the figure below, the original graph has two inputs, where the wrong graph is the negative example, and then we use the same GNN to get their output, and then perform the read function to get S. S can replace X and A in the original target to get the alternative target function.

DGI proves that this operation does not lead to loss of information, and it also proves that this substitution method is actually equivalent to true mutual information.

But there are still some problems with DGI. The first is that it requires a readout function to compute mutual information, and that readout function needs to be injective, which is not easy to guarantee. In addition, it also needs to build the wrong graph to get the negative example, so it is not efficient. In the experiment, DGI required different encoders for different tasks, which was not practical.

To solve these problems, Tsinghua University, Xi ‘an Jiaotong University and Tencent AI Lab jointly proposed GMI, whose basic idea is not to use readout functions and error samples, but to directly calculate mutual information.

In GMI, mutual information is first defined in two parts. The first is feature mutual information, which only measures the information relationship between node features and representations. The other is topological mutual information, which is the mutual information between the predicted edges and the original adjacencies matrix.

Obviously, this approach allows for both edges and features without having to read out the function or error samples. More importantly, feature mutual information can be further decomposed.

We show that the feature mutual information can be decomposed into the weighted sum of local mutual information. Each local mutual information calculates the mutual information between each node and its representation. The weights depend on different situations, and it’s good to set them to the same as the predicted edges. Then we can use JSD mutual information estimator to calculate feature mutual information and edge mutual information.

Experimental results on node classification tasks have shown that GMI performs better, and the associated code has been published: github.com/zpeng27/GMI

For an information-based approach to graph classification, see the ICLR 2020 paper InfoGraph: Unsupervised and semi-supervised Graph-Level Representation Learning via Mutual Information Maximization.

As an effective deep learning tool, graph neural network has been applied in many fields such as molecular attribute prediction, biological analysis, finance and so on. Here, taking the application of Tencent AI Lab in social network and medical imaging field as an example, the application progress of graph neural network is introduced.

1. GNN for Social Networks A Hierarchical Graph Perspective, in which Tencent AI Lab proposes A semi-supervised graph classification method using hierarchical graph.

A layered graph is a set of graph instances connected to each other by edges, as shown in the figure below:

In many real-world applications, a lot of data can be modeled as hierarchical graphs, such as social networks with grouping structures and collections of documents (such as graph-of-words with referential relationships). As shown above, suppose we have a “user-group” hierarchical graph and we know some of the tags, how can we predict the tags of other groups?

If you consider only the connections between groups, then the problem goes back to node classification. However, you can see that each group has its own user graph, and it is not appropriate to ignore such information. To make use of graph information at the user and group levels, we are faced with the challenge of how to represent a graph of any size as a fixed-length vector. How to integrate information at the instance level and at the hierarchical level?

Let’s start with the first question. Graph representation and node representation are at different levels; At the node level, figure G will be projected into a hidden space with size n× V; On the graph level, the graph G is projected as a hidden vector of size V. Therefore, in order to transform the space of the node level into the vector of the layer plane, self-attentional diagram embedding (SGAE) is introduced here.

First, a single graph is passed through a two-layer GCN to obtain a nodelevel representation H with a size of n× V, and then self-attention is calculated according to S in the figure above. After passing through a SoftMax function, you get a multi-head self-attention score with a head of R and a size of r×n. Then, if we apply these fractions to the nodes-level representation, we get a matrix of fixed size r by V. SAGE has three advantages: 1) its size remains constant due to self-attention, 2) it has alignment invariance due to smooth GCN, and 3) it can use node importance due to self-attention.

For the second question: How do I integrate the information at the instance level and at the hierarchical level? Here, the instance level is the graph level learning based on SAGE, and the hierarchical level model is the node level learning. We used feature sharing to connect the output of SAGE to the input of GCN. Then a new disagreement loss is introduced to minimize the inconsistency between the instance classifier and the hierarchical classifier.

In addition, we also use active learning to solve the problem of small sample size. We used the divergence loss to select instances for external annotations. For details of these two algorithms seal-AI and SEal-CI and related experimental results, please refer to the paper.

Following is another Tencent AI Lab study “Rumor Detection on Social Media with Bi-Directional Graph Convolutional Networks” accepted by AAAI 2020. A new method of rumor detection in social networks based on bigraph convolution networks is proposed.

Rumor can be regarded as one of the most persistent diseases facing our society today. The paper proposes to detect rumors on social media by following and reposting relationships. Whether rumor or news, they spread in a tree structure. But generally speaking, the spread of rumors has two properties. The first, as shown in figure B below, propagates deeply along a chain of relationships. Second, as shown in Figure C, rumors spread widely on social media. For example, a Twitter user may have a large number of followers.

In order to obtain these two attributes of rumor propagation simultaneously, we designed a new model based on GCN. This bidirectional GCN for rumor detection consists of four components: 1) Two different directed graphs to describe the spread and diffusion of rumors; 2) Use the two-layer GCN to calculate the node representation of the high-level plane; GCN can not only learn the feature information, but also learn the rumor propagation topology; 3) After observation, the root node usually contains the main content of rumors or news, while the followers usually only forward without any content. Therefore, by connecting the root feature to each node in the tree, the hidden feature of each layer can be enhanced; 4) The two representations of propagation and diffusion were pooled according to the node representation. These two representations are then aggregated to get the final result.

Experimental studies on Twitter15, Twitter16 and Weibo are conducted to verify the effectiveness of this method, and the results show that the new method has significantly better performance.

In addition, we evaluate the early detection of rumors by giving only a very limited number of nodes in the rumor tree and setting a detection cutoff time, and show that the graph-based approach is very suitable for the early detection of rumors.

GNN for medical imaging

Medical imaging is also an important application scenario of GNN. Tencent AI Lab has achieved some important research results in this field in the past two years. First of all, let’s look at Tencent AI Lab MICCAI 2018 paper Graph CNN for Survival Analysis on Whole Slide Pathological Images. It is proposed to use graph convolutional network for survival analysis based on full-section pathological images.

The goal of survival analysis is to predict the risk of specific events, including organ failure, adverse drug reactions, and death. The effective analysis results have important clinical application value. But in practice, there are many difficulties.

First, full-section pathological image (WSI) analysis is a computation-intensive process, as the data volume of a single WSI is more than 0.5 GB, and it contains millions of cells, and involves both local and global features, so it is very complex. In addition, how to apply the topological features of WSI to survival analysis is still a problem to be solved.

To this end, we propose to model WSI as a Graph, and then develop a Graph convolutional neural network (Graph CNN) that uses an attentional mechanism to achieve better survival analysis by providing an optimal Graph representation of WSI.

Experimental results show that this new method is superior to other methods.

This part also introduces GNN’s other work on medical images in recent years: In Graph Convolutional Nets for Tool Presence Detection in Surgical Videos published by IPMI2019, the author proposes to use GCN to detect the tools in Surgical Videos, This is one of the core issues of automated surgical video content analysis, which can be used in applications such as evaluation of surgical equipment use and automatic generation of surgical reports. This model uses GCN to learn better features along the time dimension by considering the relationship between successive video frames.

In the paper Graph Attention Multi-Instance Learning for Accurate Colorectal Cancer Staging, published by MICCAI 2020, The authors propose the use of graph attention multi-instance learning to accurately determine whether colorectal cancer is in early, middle or late stage.

Summarized and prospected in this course, we introduced the development history of neural network, including neural network power of expression, depth and scale of the extension, the supervision/progress of unsupervised learning and so on, also briefly introduces the tencent AI Lab in figure of the neural network social networks and some preliminary achievements of medical imaging applications.

The field of graph deep learning is still in development, with many interesting problems waiting to be solved, such as reverse graph recognition (IGI), that is, in the graph classification problem, can we infer the label of each node from the label of the graph? Subgraph recognition, that is, how to find the key subgraph in the graph and the combination of graph and multi-sample learning problem to form multi-sample learning problem, and the research on the robustness of deep learning related to attack and defense on the graph. Finally, the level diagram is also a hot research direction. Graph neural network will play a more important role in the future research and application of artificial intelligence.

Tencent AI Lab and Tsinghua University, the history, latest progress and application of Hong Kong Chinese long text deep learning

Related Posts

Python Note 001- Special methods of classes

Little white must see! Three things you need to know about Face detection!

Jupyter Notebook Python Basics