[Preface] : I have watched a lot of graph network and graph convolutional network explanation and video intermittently before. The graphical network is no longer understood from text messages alone, so let’s look at the code. Now start to see the first paper and code of graph network, to formally enter the field of scientific research of graph network.

  • ‘GRAPH ATTENTION NETWORKS’
  • Article transferred from wechat official account “Alchemy of Machine Learning”
  • Note by: Brother Alchemist
  • Contact: wechat CYX645016617 (welcome to exchange and make progress together)
  • Paper portal: arxiv.org/pdf/1710.10…

1 Code implementation

  • Github: github.com/Diego999/py…
  • Comments: Github is clean and simple. After downloading the CORA data set, you can directly change the path to run it. My code presentation here is also based on github content.

1.1 Experimental Results

Since this is the first time I’ve seen a GNN paper, I don’t know what’s going to happen after 2018 (but expect an explosion). Graph Attention Network

It can be seen that the accuracy of CORA is about 0.83, and the result I tested with the official code is:

At least this is a solid study.

1.2 Data Reading

Cora data set consists of machine learning papers and is a popular data set for graph deep learning in recent years. In a data set, each paper is a sample, and each paper is characterized by whether or not a word is included in the paper. It’s a 0/1 vector. The label of the paper is the category of the paper, and there are seven categories:

  • Based on the case
  • Genetic algorithm (ga)
  • The neural network
  • Probability method
  • Reinforcement learning
  • Rule learning
  • The theory of

A paper is a node, so who is the neighbor of this node? Reference relationships. Papers are selected in such a way that each paper is cited or is cited by at least one other paper in the final corpus. There are 2708 papers in the corpus.

After the stem blocking and the removal of endings, only 1,433 unique words were left. All words with a document frequency less than 10 are removed.

The following is read from TXT data files to get the labels, characteristics of each sample, as well as the function of the adjacency matrix between samples.

import numpy as np
import scipy.sparse as sp
import torch


def encode_onehot(labels) :
    # The classes must be sorted before encoding to enable static class encoding.
    # In other words, make sure the first class always maps to index 0.
    classes = sorted(list(set(labels)))
    classes_dict = {c: np.identity(len(classes))[i, :] for i, c in enumerate(classes)}
    labels_onehot = np.array(list(map(classes_dict.get, labels)), dtype=np.int32)
    return labels_onehot


def load_data(path="./data/cora/", dataset="cora") :
    """Load citation network dataset (cora only for now)"""
    print('Loading {} dataset... '.format(dataset))

    idx_features_labels = np.genfromtxt("{}/{}.content".format(path, dataset), dtype=np.dtype(str))
    features = sp.csr_matrix(idx_features_labels[:, 1: -1], dtype=np.float32)
    labels = encode_onehot(idx_features_labels[:, -1])

    # build graph
    idx = np.array(idx_features_labels[:, 0], dtype=np.int32)
    idx_map = {j: i for i, j in enumerate(idx)}
    edges_unordered = np.genfromtxt("{}/{}.cites".format(path, dataset), dtype=np.int32)
    edges = np.array(list(map(idx_map.get, edges_unordered.flatten())), dtype=np.int32).reshape(edges_unordered.shape)
    adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])), shape=(labels.shape[0], labels.shape[0]), dtype=np.float32)

    # build symmetric adjacency matrix
    adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj)

    features = normalize_features(features)
    adj = normalize_adj(adj + sp.eye(adj.shape[0]))

    idx_train = range(140)
    idx_val = range(200.500)
    idx_test = range(500.1500)

    adj = torch.FloatTensor(np.array(adj.todense()))
    features = torch.FloatTensor(np.array(features.todense()))
    labels = torch.LongTensor(np.where(labels)[1])

    idx_train = torch.LongTensor(idx_train)
    idx_val = torch.LongTensor(idx_val)
    idx_test = torch.LongTensor(idx_test)

    return adj, features, labels, idx_train, idx_val, idx_test


def normalize_adj(mx) :
    """Row-normalize sparse matrix"""
    rowsum = np.array(mx.sum(1))
    r_inv_sqrt = np.power(rowsum, -0.5).flatten()
    r_inv_sqrt[np.isinf(r_inv_sqrt)] = 0.
    r_mat_inv_sqrt = sp.diags(r_inv_sqrt)
    return mx.dot(r_mat_inv_sqrt).transpose().dot(r_mat_inv_sqrt)


def normalize_features(mx) :
    """Row-normalize sparse matrix"""
    rowsum = np.array(mx.sum(1))
    r_inv = np.power(rowsum, -1).flatten()
    r_inv[np.isinf(r_inv)] = 0.
    r_mat_inv = sp.diags(r_inv)
    mx = r_mat_inv.dot(mx)
    return mx


def accuracy(output, labels) :
    preds = output.max(1) [1].type_as(labels)
    correct = preds.eq(labels).double()
    correct = correct.sum(a)return correct / len(labels)

Copy the code

The key function is:

  1. Sp is the SPARSE library function of SCIPY.
  2. sp.coo_matrix(a,b,c,shape,dtype)The function is to build a technology matrix. B is the row of the matrix, C is the column of the matrix, A is the number of row B and column C, and SHAPE is the size of the sparse matrix constructed. This function is not clear, you can baidu to go. The return value we get is a matrix in which the elements point from the cited id to the cited ID.
  3. adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj)

And the way to do that is to make directional points into a bidirectional adjacency matrix. The first factor added will repeat the reference to itself (this will not happen in the paper, but in other graph networks nodes may connect to themselves). The factor subtracted is to avoid the case of double counting and connecting itself. 4. Normalize_feature is simply to divide the sum of the features of each sample. The sum of eigenvalues of each sample is 1. 5. Normalia_adj Similar to the above process, normalia_adj is to normalize the rows and columns of the sample.

1.3 Model

output = model(features, adj)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
Copy the code

As you can see, the model puts in both the features and the critical matrix, and then the output should be the classification probability of each sample. Then loss is obtained through cross entropy calculation.

Model = GAT(nfeat= feature.shape [1], NHID =8, nclass=int(labels.max()) + 1, Dropout =0.6, nHEADS =8, alpha=0.2)Copy the code

When GAT was built, nfeat represented the number of features per sample, 1433 in this case, nHID undetermined, nclass the classification category, Nheads undetermined, and alpha=0.2 undetermined.

class GAT(nn.Module):
    def __init__(self, nfeat, nhid, nclass, dropout, alpha, nheads):
        """Dense version of GAT."""
        super(GAT, self).__init__()
        self.dropout = dropout

        self.attentions = [GraphAttentionLayer(nfeat, nhid, dropout=dropout, alpha=alpha, concat=True) for _ in range(nheads)]
        for i, attention in enumerate(self.attentions):
            self.add_module('attention_{}'.format(i), attention)

        self.out_att = GraphAttentionLayer(nhid * nheads, nclass, dropout=dropout, alpha=alpha, concat=False)

    def forward(self, x, adj):
        x = F.dropout(x, self.dropout, training=self.training)
        x = torch.cat([att(x, adj) for att in self.attentions], dim=1)
        x = F.dropout(x, self.dropout, training=self.training)
        x = F.elu(self.out_att(x, adj))
        return F.log_softmax(x, dim=1)
Copy the code

Above is the PyTorch model class that the model builds. It can be found:

  • There are several nheads, and self. radical contains several graphattentionlayers. And then finally one moreself.out_attIs the GraphAttentionLayer, which constitutes the entire network.
  • In the forward phase, the feature first performs random dropout, and the dropout rate is so large that I don’t know if graph networks are like this, six suspense.
  • After the dropout model, pass through the GraphAttentionLayer defined by different Nheads, and concat all the results;
  • After just one more dropout, just do itsefl.out_attWill do. Just finish with softmax.

Now the key is to build the GraphAttentionLayer

1.4 GraphAttentionLayer

class GraphAttentionLayer(nn.Module) :
    "" "Simple GAT layer, similar to https://arxiv.org/abs/1710.10903, "" "
    def __init__(self, in_features, out_features, dropout, alpha, concat=True) :
        super(GraphAttentionLayer, self).__init__()
        self.dropout = dropout
        self.in_features = in_features
        self.out_features = out_features
        self.alpha = alpha
        self.concat = concat

        self.W = nn.Parameter(torch.empty(size=(in_features, out_features)))
        nn.init.xavier_uniform_(self.W.data, gain=1.414)
        self.a = nn.Parameter(torch.empty(size=(2*out_features, 1)))
        nn.init.xavier_uniform_(self.a.data, gain=1.414)

        self.leakyrelu = nn.LeakyReLU(self.alpha)

    def forward(self, h, adj) :
        Wh = torch.mm(h, self.W) # h.shape: (N, in_features), Wh.shape: (N, out_features)
        e = self._prepare_attentional_mechanism_input(Wh)

        zero_vec = -9e15*torch.ones_like(e)
        attention = torch.where(adj > 0, e, zero_vec)
        attention = F.softmax(attention, dim=1)
        attention = F.dropout(attention, self.dropout, training=self.training)
        h_prime = torch.matmul(attention, Wh)

        if self.concat:
            return F.elu(h_prime)
        else:
            return h_prime

    def _prepare_attentional_mechanism_input(self, Wh) :
        # Wh.shape (N, out_feature)
        # self.a.shape (2 * out_feature, 1)
        # Wh1&2.shape (N, 1)
        # e.shape (N, N)
        Wh1 = torch.matmul(Wh, self.a[:self.out_features, :])
        Wh2 = torch.matmul(Wh, self.a[self.out_features:, :])
        # broadcast add
        e = Wh1 + Wh2.T
        return self.leakyrelu(e)

    def __repr__(self) :
        return self.__class__.__name__ + '(' + str(self.in_features) + '- >' + str(self.out_features) + ') '
Copy the code

The forward function in the GraphAttentionLayer (GAL), h is features, shape should be (2708,1433), adj is the adjacency matrix of the node, shape is (2708,2708)

  1. First, h is used to obtain implicit variables through torch. Mm, which is similar to a fully connected layer and reduces 1433 features to 8 features (NHID =8).
  2. e = self._prepare_attentional_mechanism_input(Wh)This paragraph should be the innovation of this paper. The shape of e returned by this function is (2708,2708)
  3. torch.whereThis is a new function. The in-place operation A[A>x] = 1 is not differentiable, so I use torch. Where (condiciton, B, A). A that satisfies this condition will be replaced by B at the corresponding position. So in this code, the value of zero_vec’s adjacency matrix greater than 0 is replaced by the value of e that we just calculated. This is atteniton, the notion of the varying importance of a critical node to that node. And then dropout, and then attention times W. Come to an end.

【 Summary 】 Firstly, 1433 features are compressed into 8 features through the full connection layer, and then an attention weight is obtained through some mechanism. Then, part of the weight value is selected according to the adjacency matrix, and the first 8 features are multiplied.

1.5 confused

This line of code:

Part # init
self.attentions = [GraphAttentionLayer(nfeat, nhid, dropout=dropout, alpha=alpha, concat=True) for _ in range(nheads)]
# the forward part of the
x = torch.cat([att(x, adj) for att in self.attentions], dim=1)
Copy the code

Why build eight graphAttentionLayers exactly the same? My feeling is that when you have 8 identical convolutional layers juxtaposed together, it doesn’t actually enhance the feature.

So I used different Nheads to conduct experiments here to see if there was any effect on the experimental results:

nheads test acc
8 0.84
4 0.848
2 0.841
1 0.8450
12 0.8480

The experimental results show that in fact, the number of Nheads has little influence on the actual situation, but now that we have done so, let’s look at the influence of NHID on the experimental results. Here, WE choose Nheads as 1

nhid test acc
8 0.84
16 0.8500
32 0.8400
64 0.8350
4 0.7940

The experimental results show that too few NHID results in feature loss, and too many NHID are easy to overfit. So choose the beginning is good