In-depth understanding of graph convolutional neural network (GCN) principle

@TOC


preface

The development of deep learning changes with each passing day, from the classic deep network (DNN, CNN, RNN) to GAN, reinforcement learning. Deep learning covers more and more application scenarios. The graph neural network introduced today is another kind of deep learning method. Although graph neural network can also be included in the category of deep learning, it has its own unique application scenarios and algorithm implementation, which is not too friendly to beginners. GCN, full name Graph Convolutional Network and Graph Convolutional Network, this paper mainly realizes the deep understanding of GCN and helps to quickly understand the principle and use of GCN.

1. Why is GCN needed

The input of convolutional neural network (CNN) is a graph structure with Euclidean structure, such as pictures, that is, graphs like this:The convolution kernel is used to extract features from Euclidian space. Since images are relatively regular graph structures, the convolution kernel can be used to extract node features by translation. In other words, the core of CNN lies in its kernel, which is a small window that translates on the image and extracts features by convolution. The key here lies in the translation invariance of the image structure: no matter where a small window moves to the image, its internal structure is exactly the same, so CNN can realize parameter sharing. That’s what CNN is all about. But usually we encounter topological networks or social networks, as follows

A network contains different numbers of nodes, and different nodes also contain different neighbors, which makes the traditional CNN unable to play a role in this graph structure. Moreover, each node in the graph is usually connected. Therefore, when GCN came out, this problem was solved.

Ii. Principle of GCN

1. Definition of graph

Graph structure is expressed by G=(V,E), including directed graph or undirected graph, but only undirected graph is considered in GCN. V represents the geometry of node,E represents the geometry of edge, N represents the number of nodes, and M represents the number of edges. Here we introduce the meanings of various symbols in GCN for a graph structure:


v i V said v i Is a n o d e e i j = ( e i . e j ) E said n o d e i with j The boundary between the N ( v ) = { u V ( v . u ) E } Said some v The collection of all neighbors of A i j Represents the adjacency matrix of a graph A i j = 1 said n o d e i and j There are edges between D Represents the degree matrix of the current graph, D It’s diagonal matrix d i i D said A The degree of each node in X R n x d said n The eigenvectors of the nodes, and the dimension of the eigenvectors is d \ \ v_i \ \ qquad said v_i in V is a node \ \ e_ {ij} = (e_i e_j) node \ \ \ qquad in E said quad edge between I and j \ \ N (V) = \ {u \ | in V (V, u) \ in E\ \\ qquad represents the set of all neighbors of point v \\A_{ij}\qquad represents the adjacency matrix of the graph \\A_{ij}=1\qquad represents the edge between node\quad I and j \\D\qquad represents the degree matrix of the current graph, D is the diagonal matrix \\d_{ii}\in D\quad denotes the degree of each node in A \\X\in R^{n\times D} denotes the eigenvectors of n nodes, the dimension of which is D

2. The GCN to come

The magic of GCN is that it can aggregate node features near a node and learn node features through weighted aggregation to perform a series of prediction tasks.

2.1 Matrix calculation formula

Suppose we have a batch of graph data in hand, in which there are N nodes, and each node has its own feature vector. We assume that the features of these nodes form a feature matrix X of N × D dimension, and then the relationship between each node will also form a feature matrix X
n x n n\times n
The adjacency matrix of dimension A. X and A are the inputs to our model.

For all nodes, H(L)H^{(L)}H(L) represents the feature vector matrix at the L layer at all stages, and H(L +1)H^{(L +1)}H(L +1) represents the feature vector matrix after a convolution operation. The formula of a convolution operation is as follows: H (l + 1) = sigma (D ~ A ~ D – 12 ~ – 12 H (l) (l) W H ^ {} (l + 1) = \ sigma \ left (\ tilde ^ {D} {- \ frac {1} {2}} \ tilde {A} \ tilde ^ {D} {- \ frac {1} {2}} H ^ {} (l) W ^ {} (l), right) H (l + 1) = sigma (D ~ A ~ D – 21 ~ – 21 H (l) (l) W) A ^ = A + \ hat I {A} = A + IA ^ = A + I in this formula:

  • A+IA+IA+I, III is the identity matrix, that is, the diagonal is 1, the rest is 0
  • D ~ \ tilde ~ is A ~ \ {D} D tilde {A} ~ degree matrix, A calculation method for D = ∑ ~ A ~ ij \ tilde {D} = \ sum {\ tilde {A} _ {ij}} D = ∑ ~ A ~ ij
  • HHH is the eigenvector matrix of all nodes of each layer. For the input layer, H(0)H^{(0)}H(0) is equal to XXX, [n,d][n,d][n,d] dimensions
  • Sigma is a nonlinear activation function, such as RELURELURELU
  • W(l)W^{(l)}W(l) represents the trainable parameter matrix of the current layer convolution transformation

In fact, the weighted sum of the neighbor nodes of each node in the graph is realized, and the characteristics of nodes of a new layer are obtained by multiplying with the parameter matrix.The feature of each node is calculated iteratively in the form of matrix, and then the convolution operation is carried out through layer propagation, and finally the feature of node is updated.

2.2 Explain the meaning of the formula with small-scale matrix

The formula is calculated from a matrix point of view, so why do we have such a complicated matrix form, first of all let’s use the simplest matrix calculation to see


H ( l + 1 ) = f ( H ( l ) . A ) = sigma ( A H ( l ) W ( l ) ) H ( 0 ) = X R n x d \begin{array}{c} H^{(l+1)}=f\left(H^{(l)}, A\right)=\sigma\left(A H^{(l)} W^{(l)}\right) \\ H^{(0)}=X \in R^{n \times d} \end{array}

In the above formula, W is the parameter of convolution transformation, which can be trained and optimized. A matrix is the adjacency matrix, if Aij is not 0, it means node I, j is the neighbor, and ij has edges. H is the eigenvector matrix of all nodes, each row is the eigenvector of a node, and H(0) is the X matrix. The product of A and H is the sum of all the neighbor node vectors, as shown in the figure below, representing A×HA\times HA×H.


[ 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 0 ] [ 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 ] = [ 5 5 5 5 5 1 1 1 1 1 5 5 5 5 5 3 3 3 3 3 ] \left[\begin{array}{cccc} 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{array}\right]\left[\begin{array}{ccccc} 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 & 2 \\ 3 & 3 & 3 & 3 & 3 \\ 4 & 4 & 4 & 4 & 4 \end{array}\right]=\left[\begin{array}{ccccc} 5 & 5 & 5 & 5 & 5 \\ 1 & 1 & 1 & 1 & 1 \\ 5 & 5 & 5 & 5 & 5 \\ 3 & 3 & 3 & 3 & 3 \end{array}\right]

AAA represents adjacency matrix, and HHH represents 4 nodes, each of which has A 5-dimensional feature vector. Directly multiplying A and HA and HA and H will get the result of matrix AH on the right. After multiplying AH by the W training parameter, the activation function σ yields the eigenvectors for the next four nodes.

However, there are some problems in the above formula. AHAHAH only obtains the neighbor information of a node and ignores the node itself. To solve this problem, we can set the value of the diagonal in matrix AAA to 1, that is, each node will point to itself. The new convolution formula is as follows:


H ( l + 1 ) = sigma ( A ~ H ( l ) W ( l ) ) A ~ = A + I n \begin{aligned} H^{(l+1)} &=\sigma\left(\tilde{A} H^{(l)} W^{(l)}\right) \\ \tilde{A} &=A+I_{n} \end{aligned}

I is the identity matrix, that is, the diagonal is 1, and the rest is 0. Even if the above convolution formula is used, the information of the node itself can be taken into account, but there are still problems in this formula: matrix A is not normalized, AH will add the vectors of all the neighbors of the node, so the eigenvectors of some nodes will be very large after multi-layer convolution. Because the adjacency matrix is not normalized, there may be problems in extracting graph features, for example, nodes with many neighbor nodes tend to have larger eigenvalues. Symmetric normalization is usually used: normalization means aggregating the neighbor features of a node


A = D 1 2 A D 1 2 A i j = A i j d i d j \begin{array}{l} A=D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \\ A_{i j}=\frac{A_{i j}}{\sqrt{d_{i}} \sqrt{d_{j}}} \end{array}

2.3 To sum up:

First, in order to consider the characteristics of all nodes in A, we changed AAA plus III into A~\tilde{A}A~. Then, in order to consider the phenomenon that when the features of neighbor nodes are aggregated, there will be more features of neighbor nodes due to the constant addition of feature vectors, we used symmetric normalization. In this way, we can get the layer propagation formula of GCN, H (l + 1) = sigma (D ~ A ~ D – 12 ~ – 12 H (l) (l) W H ^ {} (l + 1) = \ sigma \ left (\ tilde ^ {D} {- \ frac {1} {2}} \ tilde {A} \ tilde ^ {D} {- \ frac {1} {2}} H ^ {} (l) W ^ {} (l), right) H (l + 1) = sigma (D ~ a ~ D – 21 ~ – 21 H (l) (l) W)

Take a two-tier GCN as an example


A ^ = D ~ 1 2 A ~ D ~ 1 2 Z = softmax ( A ^ ReLu ( A ^ X W ( 0 ) ) W ( 1 ) ) \begin{array}{c} \hat{A}=\widetilde{D}^{-\frac{1}{2}} \tilde{A} \widetilde{D}^{-\frac{1}{2}} \\ Z=\operatorname{softmax}\left(\hat{A} \operatorname{ReLu}\left(\hat{A} X W^{(0)}\right) W^{(1)}\right) \end{array}

GCN can be predicted by the node feature vectors obtained from the last convolution layer, that is, softmax is used for the output layer of the last layer, and ReLuReLuReLu is used for the nonlinear excitation function from layer 0 to layer 1.

Third, how great is GCN

After reading the above formulas and training methods, GCN seems to have nothing special. However, combining the results posted in GCN paper, it turns out that GCN is so good that the same node extracted by GCN in the original data is automatically clustered in spaceThe reason why GCN is so strong is that the neighbor relationship of each node in the figure is taken into consideration, which is different from the traditional CNN, and the beauty of the mathematical formula behind it is also worth our study. Please refer to the calculation of GCN formula for further studyGraph Convolutional Network (GCN)

conclusion

So far, we have deeply understood the related concepts of GCN. In fact, each node in the graph changes its state all the time due to the influence of neighbors and distant points until the final equilibrium. The closer the relationship is, the more influence the neighbor has. The characteristics of each node are finally learned through the interaction between nodes and the training of W parameter matrix.

At the same time, the number of GCN layers should not be too much. ** Simply speaking, if the number of GCN layers is too large, the embedding of each point is similar, which is bad for subsequent node classification and loss functions. Of course, various normalize can be used to offset some effects. From the perspective of thermodynamics, each node embedding is regarded as node temperature, and each convolution of GCN can be regarded as heat exchange between each node and surrounding nodes. When there is no external heat source (the node has no extra label label), if the graph is fully connected, multiple convolution will finally make the temperature of each node become the same.

In the next article we will discuss the implementation of GCN using DGL and the introduction of GAT.