“The content of this article includes the basic knowledge of graph convolution and related auxiliary understanding of knowledge points, I believe that students will be able to smooth the start to understand GCN!” \

Daisy Confesses

Source: Zhihu column Yu Zhen machine learning notes.

Editor: happyGirl

Specifically, what does this paper include:

  • What are the types of graph networks, and what are the differences and connections between them?
  • What is the status of graph convolution, graph convolution?
  • Is there a general formula for graph convolution, how can all the different formulas be unified?
  • Is there a simple example to illustrate the training process of the network?
  • What code implementations and libraries do you want to get started?

This article does not include

  • Fourier operator to Laplace operator to Laplace matrix mathematical bulldozing, interested can see the references.

Read a series of articles:

“The article read Transformer and Attention” (zhuanlan.zhihu.com/p/90033981) “the article read residual ResNet network” (https://zhuanlan.zhihu.com/p/91385516)

Graph neural Network GNN

What is the Weisfeiler-Lehman(WL) algorithm and WL Test? (zhuanlan.zhihu.com/p/90645716)

Library neural networks PyTorch geometric (PYG) zero foundation started tutorial “(zhuanlan.zhihu.com/p/91229616)


Graph network classification

In the beginning, we will first sort out the differences and relationships between Graph Embedding, Graph Neural Network and Graph Convolutional Network that are often mentioned.

Graph Embedding

Graph Embedding/Network Embedding (GE), which belongs to the category of representation learning, can also be called Network Embedding, Graph representation learning, Network representation learning and so on. There are usually two levels of meaning:

  • The nodes in the graph are represented as low-dimensional, real-valued and dense vector forms, so that the vector forms obtained can have the ability of representation and reasoning in vector space, and such vectors can be used in specific tasks downstream. For example, the node representation of user social network is the representation vector of each user, and then used for node classification.
  • The whole graph is represented as low-dimensional, real-valued and dense vector form, which is used to classify the whole graph structure.

There are three main ways of graph embedding:

  • Matrix decomposition: The method based on matrix decomposition is to express the relationship between nodes in the form of matrix, and then decompose the matrix to get the embedding vector. The matrices usually used to represent node relations include adjacency matrix, Laplace matrix, node transition probability matrix, node attribute matrix and so on. Different decomposition strategies can be applied according to the properties of matrices.
  • DeepWalk: DeepWalk is based on the word vector word2vec. When word2vec trains word vectors, it takes corpus as input data, while graph embedding input is the whole graph, and the two seem to have no correlation. But DeepWalk’s authors found that both the expected number of word occurrences and the number of random walk nodes on the graph visited followed a power-law distribution. Therefore, DeepWalk takes nodes as words, takes random walk sequences of nodes as sentences, and then directly uses them as input of WORD2VEC to embed node representation. Meanwhile, the embedded representation of nodes as initialization parameters of downstream tasks can optimize the effect of downstream tasks, which also gives rise to a lot of related work.
  • Graph Neural Network: The Network built by Graph combined with deep learning method is collectively called Graph Neural Network GNN, which is the main content of the next section. Therefore, Graph Neural Network GNN can be applied to Graph embedding to obtain the vector representation of Graph or Graph node.

Graph Neural Network

Graph Neural Network (GNN) refers to the general name of the models applied by Neural networks on graphs. According to different technologies and classification methods, it can be divided into different types as shown in the following figure. For example, from the perspective of propagation mode, Graph Neural Network can be divided into Graph convolutional Neural Network (GCN). Graph attention Network (GAT, abbreviated to distinguish it from GAN), Graph LSTM, etc., is essentially a new attempt to borrow from the same set of network structure techniques of text images. However, we will not cover each of the following in detail in this article. As an introduction, we will focus on the most classic and meaningful basic model, GCN, which is also the basis for understanding other models.

GNN classification of graph neural network: the existing graph model work is divided from three aspects of graph type, training mode and transmission mode respectively

Graph Convolutional Network

Graph Convolutional Network (GCN), as classified above, is a kind of neural Network that adopts Graph convolution, which has been developed into numerous versions based on the simplest Graph convolution, and plays a role in the field of Graph Network just as convolution operation plays a role in image processing.

GCN distinction and connection

As shown in FIG. 2, these three concepts can be summarized in one sentence: Graph convolutional neural network (GCN) belongs to the class of graph neural network GNN. It is a graph neural network that adopts convolution operation and can be applied to graph embedding GE.

Convolution VS graph convolution

To understand graph convolution, the core operation of graph convolution network, the position of convolution in CNN can be compared.

As shown in the figure below, the digital image is a two-dimensional discrete signal, the digital image to do a convolution is actually using convolution kernels (convolution template) sliding on the image, the image point on the pixel gray value and the corresponding numerical multiplication on the convolution kernel, then the value added after all multiplied as the convolution kernel middle gray levels of pixels corresponding pixels in the image, And finally the process of sliding all the images.

The weighted sum of pixels can be obtained by using the random shared convolution kernel to extract a specific feature, and then the parameters of the convolution kernel can be optimized by using back propagation to automatically extract feature, which is the cornerstone of CNN feature extraction.

Image convolution

In reality, however, more important data sets are stored in the form of graphs, such as social network information, knowledge graphs, protein networks, the World Wide Web, and so on. The form of these graph networks is not like images, which are in the form of neatly arranged matrices, but unstructured information. Is there a general paradigm for graph feature extraction similar to the convolution in the image field? That’s what graph convolution means in graph convolution networks.

For most graph models, there is a similar general expression, which is referred to collectively as GCNs. Therefore, it can be said that graph convolution is a powerful tool for processing unstructured data. With the gradual deepening of research in this field, human’s processing of knowledge fields will no longer be limited to structured data (CV, NLP), and more attention will be paid to this knowledge field with a wider scope and richer meaning.

Let’s take this paradigm apart step by step.

Figure convolution

Example of graph structure

The definition of the figure

For graphs, we have the following feature definitions:

For a graph, is a collection of nodes and is a collection of edges. For each node, it has its characteristics, which can be represented by a matrix. Where, represents the number of nodes, represents the number of features of each node, or can be said to be the dimension of feature vectors.

A visual understanding of graph convolution

Before a plunge into figure convolution formula, we first from the perspective of other understand the physical meaning of this operation, there is a visual understanding, we are trying to get the node said, easy to think of the most convenient and effective method is to use it around the node, its neighbor nodes or neighbor, and so on, this kind of thought can be summed up in one word:

Each node in the graph changes its state all the time due to the influence of neighbors and distant points until the final equilibrium. The closer the relationship is, the more influence the neighbor has.

In fact, the idea of getting information from neighbor nodes is used in many areas, such as word2vec, such as Pagerank. The content article [2] explains this point in great detail.

More details of how to transform from Fourier transform to Laplace operator to Laplace matrix of the mathematical topple can be turned to the blog [7], in order to avoid the mathematical foundation is not so strong beginners (such as me) be confused, we first establish an outline, not too divergent.

Definition of graph correlation matrix

So what is there to measure the relationship between the neighbors of a node, and those of you who have studied graph theory naturally think of adjacency matrices and Laplace matrices. For a simple example, for the left graph in the following figure (for simplicity, an undirected graph with unweighted edges is used), its degree matrix, adjacency matrix and Laplacian matrix are shown in the following figure respectively. The degree matrix has values only on the diagonal, which are the degrees of the corresponding nodes, and the rest are 0. The adjacency matrix is 1 only between two nodes with edge connection and 0 elsewhere. The Laplace matrix is zero. But it’s important to note that this is the simplest kind of Laplacian matrix, and there are several other Laplacian matrices that are going to be introduced.

A graph’s degree matrix, adjacency matrix and Laplace matrix

General formula for graph convolution

Any graph convolution layer can be written as a nonlinear function:

Is the input of the first layer, and is the number of nodes of the graph, is the dimension of feature vector of each node, and is the adjacency matrix. The difference between different models lies in the realization of different functions.

Several specific implementations are described below, but the parameters of each implementation are referred to as the Laplace matrix.

Implement a

Where is the weight parameter matrix of the first layer, and is the nonlinear activation function, such as ReLU.

This idea is based on the idea that the characteristics of a node are related to all its neighbors. The adjacency matrix multiplied by features is equivalent to adding the features of the neighbors of a node. In this way, the information of multilayer neighbor can be used.

But there are two problems with this:

  • The influence of nodes on themselves is not considered;
  • Adjacency matrix    It is not normalized, which may cause problems when extracting graph features, such as nodes with more neighbors tend to have more influence.

So implementation two and implementation three are optimized for these two points.

Realize the

The Laplace matrix, formally known as Combinatorial Laplacian, is an improvement on problem 1 of implementing one:

  • The degree matrix is introduced to solve the problem of not considering the self-transmission of node information

Implement three

For the Symmetric Normalized Laplacian matrix, it is a symbol difference, but in essence, it is an improvement to realize one and two problems:

  • Self degree matrix is introduced to solve the self transfer problem.
  • The normalization of the adjacency matrix is obtained by multiplying both sides of the adjacency matrix by the square root of the degree of the node and then taking the inverse. For each pair of nodes, the elements of the matrix are given by the following formula (for undirected and powerless graphs) :

Where are the degrees of nodes respectively, that is, the values of degree matrix at nodes.

Maybe it’s a little confusing how do you normalize both sides by multiplying them by the inverse of a matrix? This is a review of what the inverse of a matrix essentially does.

So just to review the definition of the inverse of a matrix, if we want to solve for the matrix X, then of course we multiply both sides of the equation, and then the equation becomes 1, 2, 3.

For example, a single node operation, do the normalization is divided by its node degrees, so that each edge adjacency information values were normalized, not because of a certain node has 10 side and the other one side leads to the influence of the former is larger than the latter, because after the normalized weights only 0.1 of the latter, The operation of going from a single node to a two-dimensional matrix, is the inverse of the matrix, times the essence of the inverse of the matrix, is to do matrix division to normalize. But times the square root of the node I and j degrees, that’s the same thing as thinking about the degrees of the points on both sides of an edge.

In addition to the above for the two kinds of common Laplacian, and etc. [3] [4], normalization is different way, according to the paper [5] the experiment, the convolution kernels is not in the form of a kind of can in any scenario than other forms effect is good, so in a variety of specific use of time to try, but the main or to realize three, That’s what most bloggers are talking about.

Another way of saying it

Above in the form of matrix calculation, may look very puzzling, from the perspective of a single node to see below these formula (essence is the same, as explained above, to a single node is division, inverse matrix is multiplied by the degree of matrix), for the first layer, the characteristics of the node to its adjacency nodes, Is the set of all neighbor nodes of the node, which can be calculated by the following formula:

Where,,, is the neighbor node of, and is the degree of, which is actually equivalent to the formula above, so some places the formula is this, some places the formula is that.

Code word is not easy, feel harvest remember to praise oh ~

For more content, please click on the end of the article to read the original article and pay attention to the author’s column and author exchange!

reference

The classification of [1] https:// zhuanlan.zhihu.com/p/77729049 graph embedding

[2] www.zhihu.com/question/54… A physical explanation of graph convolution

[3] www.zhihu.com/question/54…

[4] tkipf. Making. IO/graph – convo… Two papers will be discussed in detail

[5] github.com/conferences…

[6] persagen.com/files/misc/…

[7] zhuanlan.zhihu.com/p/85287578 Laplacian and Laplace operator

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

You are not alone in the battle. The path and materials suitable for beginners to enter artificial intelligence download machine learning online manual Deep learning online Manual note:4500+ user ID:92416895), please reply to knowledge PlanetCopy the code

Like articles, click Looking at the