Machine learning algorithms and natural language processing
@public account original columnist Don. Hub
Unit | jingdong algorithm engineer
School | imperial college
-
Graph convolutional networks (GCN) introduction to detail \
-
- What is the GCN
-
- Summary of GCN
- The model definition
- Mathematical deduction
-
- Graph Laplacian
- ref
The field of graph neural network is a relatively new field with a lot of exploration potential, so I have been thinking about getting started. Graph convolutional network is very popular among them. I found a tutorial: Complete Guide to Graph convolutional Network (GCN) Novice Village, trying to get out of the novice village, but it turned out to be an advice because of my limited qualifications. After reading it, I still didn’t understand it very well, so I decided to write an introduction myself. Of course, this article is very well introduced, and I refer to many of them, so I will cite many of them, plus the related derivation supplement.
This paper is mainly divided into two parts, the first part introduces what GCN is, the second part will carry out detailed mathematical derivation.
What is the GCN
Summary of GCN
The GCN discussed in this paper comes from the paper semi-supervised CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS, which is one of the most classic papers in the field of GCN.
As we can see from this GCN diagram, one ownsGraph of input channel as input, through hidden layers in the middle, get
The output of each output channel.
Graph convolutional network can be composed of two levels of acting transformations:
Note that all graphs in this article refer specifically to undirected and unweighted graphs.
- Graph level:
For example, by introducing some forms of pooling operation (see, e.g. Duvenaud et al., NIPS 2015). Then change the structure of the graph. However, GCN has not carried out this level of operation. So we see that the output of our network structure is the same as the output graph. \
- The node level:
In general, node level does not change the structure of the graph, but only changes the features/signals of the graph.As input: one
The matrix (
: Enter the number of nodes of the graph,
The characteristic dimension of the input) to get the output
A:
The matrix (
Characteristic dimensions of output).
A) A feature Description: refers to each node
Characteristic representation of
B) The structure of each graph can pass through an adjacency matrixRepresentation (or any other matrix derived from it)
We can easily reconstruct a graph from an adjacency matrix. For example:Among them
Represents the node,
On behalf of the edge
We do this by constructingThe adjacency matrix can be obtained by the matrix of
, including
If the node
And the node
Connect, otherwise
And we can get that from graph
And by the same token
You can also get the structure of graph.
Therefore, each hidden layer in the middle of the network can be written as the following nonlinear function:
Input layerAnd the output layer
,
Is the layer. Different GCN models are adopted differently
Function.
The model definition
The functions used in the paper are as follows:
When I first watched it, I was scared! This function is too abstract. But let’s take a look at what it does, and then take the formula step by step, and why it does what it does.
Firstly, physically, the information of the next layer of each node is obtained by the weighted sum of the information of the previous layer and the information of adjacent nodes, and then through linear transformationAnd nonlinear transformation
。
Let’s break it down step by step, and we’re going to define a simpleFunction as the base of the network layer.
We can easily adopt the simplest layer-wise Propagation rules
We’re going to directlyYou do matrix multiplication, and then you go through a weight matrix
Do a linear transformation, and then go through a nonlinear activation function
ReLU, for example, finally gets the input of the next layer
。
We need to pay special attention to thisLet’s multiply matrices. What does that mean?
Let’s take a look at the figure below.
Suppose each nodeSo what is it going to be after you multiply the matrices.
The input layerAccording to the operation formula of the matrix, we can easily get the representation of the node in the next layer
And it’s easy to find
And the
It’s a neighbor of node 1. Refer to the code below for specific calculation results.
A = torch.tensor([
[0.1.0.0.1.0],
[1.0.1.0.1.0],
[0.1.0.1.0.0],
[0.0.1.0.1.1],
[1.1.0.1.0.0],
[0.0.0.1.0.0]
])
H_0 = torch.tensor([
[1.1.1.1],
[2.2.2.2],
[3.3.3.3],
[4.4.4.4],
[5.5.5.5],
[6.6.6.6]
])
A.matmul(H_0)
>>>tensor([[ 7.7.7.7],
[ 9.9.9.9],
[ 6.6.6.6],
[14.14.14.14],
[ 7.7.7.7],
[ 4.4.4.4]])
Copy the code
So we have untilIs through the adjacency matrix fast way, quickly add the information of adjacent nodes to get their next level of input. \
But is that perfect?
- Problem 1: Although we have information about the surrounding nodes, we have no information about ourselves (unless we have an edge pointing to ourselves).
The solution we adopted is to manually add a self-loop to each node, i.e, including
Is the identity matrix.
- Question two: From the above results can also be seen after a
After the matrix transformation, the resulting output will be larger, namely the eigenvector
The scale of the input will change, and after multiple levels of change, it will be more and more different from the input scale.
So can we take the adjacency matrixYou normalize it so that the sum of each of the last rows is 1, so that
Weighted sum is obtained.
We can putDivide each row by the sum of rows to get normalized
. And the sum of each row is the degree of each node. Expressed by matrix, is:
for
Let’s look at the graph above.
import torch
A = torch.tensor([
[0.1.0.0.1.0],
[1.0.1.0.1.0],
[0.1.0.1.0.0],
[0.0.1.0.1.1],
[1.1.0.1.0.0],
[0.0.0.1.0.0]
], dtype=torch.float32)
D = torch.tensor([
[2.0.0.0.0.0],
[0.3.0.0.0.0],
[0.0.2.0.0.0],
[0.0.0.3.0.0],
[0.0.0.0.3.0],
[0.0.0.0.0.1],
], dtype=torch.float32)
hat_A = D.inverse().matmul(A)
>>>hat_A
tensor([[0.0000.0.5000.0.0000.0.0000.0.5000.0.0000],
[0.3333.0.0000.0.3333.0.0000.0.3333.0.0000],
[0.0000.0.5000.0.0000.0.5000.0.0000.0.0000],
[0.0000.0.0000.0.3333.0.0000.0.3333.0.3333],
[0.3333.0.3333.0.0000.0.3333.0.0000.0.0000],
[0.0000.0.0000.0.0000.1.0000.0.0000.0.0000]])
Copy the code
But in practice we have used the symmetrical form of normalization:for
\
This has to do with the Laplacian Matrix, which we’ll see in the next section. We can see that
D_minus_sqrt = D.inverse().sqrt()
D_minus_sqrt.matmul(A).matmul(D_minus_sqrt)
>>>
tensor([[0.0000.0.4082.0.0000.0.0000.0.4082.0.0000],
[0.4082.0.0000.0.4082.0.0000.0.3333.0.0000],
[0.0000.0.4082.0.0000.0.4082.0.0000.0.0000],
[0.0000.0.0000.0.4082.0.0000.0.3333.0.5774],
[0.4082.0.3333.0.0000.0.3333.0.0000.0.0000],
[0.0000.0.0000.0.0000.0.5774.0.0000.0.0000]])
Copy the code
If we combine these two tricks, we get
Among them ,
是
The degree of matrix. while
Is the
I did a symmetric normalization.
Mathematical deduction
Graph Laplacian
First of all, we can represent a graph in many ways, we can use an adjacency matrix, we can use an Incidence matrix. In this matrix, each row represents an edge, and each column represents a node. In each row, the beginning of the edge node is denoted as 1, and the end of the edge is denoted as -1. Let’s call this metrix. The details are shown in the figure below.
So graph Laplacian is defined as:
C = torch.tensor([
[1.- 1.0.0],
[1.0.- 1.0],
[1.0.0.- 1],
[0.- 1.1.0],
[0.0.1.- 1],
[0.- 1.0.1],
])
C.T.matmul(C)
>>>tensor(
[[ 3.- 1.- 1.- 1],
[- 1.3.- 1.- 1],
[- 1.- 1.3.- 1],
[- 1.- 1.- 1.3]])
Copy the code
We can find the value of the diagonal, where if
, then their product is equal to 0, if
or
, then their product = 1. So we know that the diagonal represents the Degree of each node.
For non-diagonal valuesWe can see that if the node
和
It’s not connected, so
Otherwise,
, so we know that the value of the non-diagonal is the negative value of the adjacency matrix.
So we can derive that
As shown below (note that W represents the adjacency matrix)
To sum up:
Refer to the following code for calculation
C = torch.tensor([
[- 1.1.0.0.0.0], # 12 -
[- 1.0.0.0.1.0], # 1- 5
[0.- 1.1.0.0.0], # 2- 3
[0.- 1.0.0.1.0], # 2- 5
[0.0.- 1.1.0.0], # 34 -
[0.0.0.- 1.1.0], # 4- 5
[0.0.0.- 1.0.1], # 5- 6
])
C.T.matmul(C)
>>>
tensor([[ 2.- 1.0.0.- 1.0],
[- 1.3.- 1.0.- 1.0],
[ 0.- 1.2.- 1.0.0],
[ 0.0.- 1.3.- 1.- 1],
[- 1.- 1.0.- 1.3.0],
[ 0.0.0.- 1.0.1]])
Copy the code
We need to know LaplacianThe nature of the:
Is a symmetric matrix
Eigen values with real numbers, non-negative values
Real, orthogonal eigenmatrices (Eigen Vectors), i.e.
For this, let’s assumeThe eigenvalue of is
The eigenvector is
:
- For the eigenvalues we have
- Symmetric Normalized Laplacian (Symmetric Normalized Laplacian)
Its element values are 1 on the diagonal and 1 on the non-diagonal
We need to know that the convolution of two functions can be obtained by the following formula
Among themStands for Fourier transform
The Graph Fourier transform corresponds to the following:
The inverse Graph Fourier transform corresponds to the following:
Among themIs the Laplacian
The specific corresponding relationship of the characteristic matrix of
We know the ordinary convolution formula:
Then the corresponding graph convolution formula is:
As a feature of the graphFilter, we hope that its scope is the same as CNN, which is in the area near the central node, so we define
Is a function of Laplacian
, so acting once is equivalent to propagating the information of neighboring nodes once.
And because, so we can put
As a function of the Laplacian eigenvalues
, the parameters for
。
So the graph convolution in the Fourier domain can be expressed as:
We know thatWe need to compute the Laplacian matrix first
Which involves a lot of matrix operations, so the article borrows from Chebyshev polynomials to perform approximate computations:
-
Among them
,
Represents the
Sublaplacian, that is, it depends on the nearest central node
The neighbor node of order (the distance between the neighbor node and the center node is K at most). \
-
, including
As well as
. \
Let’s go back to the original graph convolution calculation:
Among them
We know that the number of propagation neighbor layers adopted in the paper is 1, so takeAnd we assume
, can be obtained:
In practical application, in order to prevent overfitting and reduce operation, we makeGet:
We make, as well as
Get:
Plus the activation functionWe got
Among themCorresponding to the input
,
Corresponding parameters
ref
-
en.wikipedia.org/wiki/L\
-
T. N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks(ICLR 2017) [Link, PDF (arXiv), code, blog]\
-
math.stackexchange.com/
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code