preface

Convolutional neural network (CNN) has made great achievements in the field of image processing. Its convolution and pooling structure can extract image information well, while in the field of NLP, recurrent neural network (RNN) is more widely used. RNN and its variants are better at processing context because of their memory function. However, CNN has achieved excellent results in many aspects of NLP field, such as semantic analysis, query retrieval, text classification and other tasks. This article looks at how to classify text using CNN.

Model structure

The model structure can be seen from the following layers, which consists of four parts, including the input layer, the convolution layer, the pooling layer and the full connection layer.

The input layer

The leftmost part of the figure is the input layer. In general, the input layer is the matrix corresponding to the sentence. Instead of using ont-hot vectors to represent words, a k-dimensional distributed word vector is used. So for a sentence of length N, it forms an n by k matrix.

So, let’s say thatIs the i-th word of the sentence, and it is the k-dimensional vector. So a sentence is, includingMeans series.

In addition, word vectors can be divided into two modes: static and non-static. Static mode means that I directly use the word vector released by the third party or the word vector trained by myself to initialize the matrix, and the word vector is fixed throughout the training without any effect on reverse error propagation or change in the process of each training. The non-static mode is different. After the word vector is used to initialize the matrix, the word vector will be fine-tuned according to the reverse error propagation in each subsequent training process, and the word vector will be updated throughout the training process.

Convolution layer

The second part of the figure is the convolution layer, which is used to extract the features of sentences. The convolution operation is mainly carried out by sliding an H × K convolution kernel W on the input layer from top to bottom, and a feature map is obtained through this convolution operation. Column 1 of feature map, behavior (N-h +1), namely, including.

In the figure above, the red box on the input layer is the convolution kernel of the convolution operation. It can be seen that it has 2 × K dimensions and becomes an element of the feature map after operation. In addition, h can also be set to 3, at which point the convolution kernel becomes 3 × K dimension, as shown in the yellow box. The same dimension can have several convolution kernels with different parameters, so finally, several feature maps can be obtained in each dimension.

What is the meaning of convolution operations? It can be seen that it actually extracts the features of adjacent words of different lengths according to the size of H, which can be corresponding to the N-Gram language model.

Pooling layer

The third part of the figure is the pooling layer. The function of the pooling layer is to further extract the features and extract the most important features. The max-over-time pooling operation is adopted here, that is, the maximum value in the feature map is taken as the most important feature, that is. Therefore, each feature map is pooled and a one-dimensional vector is finally obtained. Taking the maximum value as the feature also solves the problem of different sentence lengths. Although short sentences will be filled with 0, this problem is eliminated by taking the maximum value.

The previous feature maps were obtained through multiple different convolution kernel operations of the convolution layer, and then several one-dimensional vectors were obtained after processing of the pooling layer.

The connection layer

The last part in the figure is the full connection layer, and the full connection layer obtains the probability of each classification by using Softmax classifier. The output from the previous pooling layer is fully connected to the SoftMax layer, which defines the categories.

Prevent overfitting

To prevent overfitting, the dropout technique is used at the penultimate layer during training, which randomly dismisses some nodes of the hidden layer so that they do not work. This can be done by setting some nodes to zero as the network moves forward, such as the second-to-last layer,, here we assume that we have m convolution kernels. By dropout,, includingAnd r realize the function of mask, that is, r is a vector with the same size as Z, and its value is random 0 or 1. The node corresponding to 0 is the node discarded.

At the same time, L2 regularization can also be used to constrain the weight vector W at the full connection layer.

Main implementation code

Build the figure

First, we build the required placeholders and constants, including the input placeholders, label placeholders, and dropout placeholders, and L2 regular loss constants.

train_inputs = tf.placeholder(tf.int32, [None, sequence_length])
train_labels = tf.placeholder(tf.float32, [None, classes_num])
keep_prob = tf.placeholder(tf.float32)
l2_loss = tf.constant(0.0)
Copy the code

We then need an embedding layer to embed terms into the specified dimension space, which is specified by embedding_size. Vocabulary_size is the size of words, so that all words can be mapped to a given dimensional space. Embedding_lookup: inputs in tf.nm.embedding_lookup are embedding_lookup: inputs in the inputs are 2-dimensional, and their inputs are 3-dimensional. Because words are already represented as embedding_size vectors. In addition, the result is extended one dimension by invoking the convolution operation.

with tf.device('/cpu:0') : Embeddings = tf.Variable(tF.random_uniform ([vocabulary_size, embedding_size], -1.0, Embedding_lookup (embeddings, train_inputs) conv_inputs = tf.expand_dims(Embed, 1)Copy the code

And then we’re going to do convolution and pooling. Since we will define several convolutional kernels and each height has several convolutional kernels, we will obtain many different feature maps, and then conduct max-over-time pooling operation on these feature maps, and finally obtain the pooled feature.

features_pooled = []
for filter_height, filter_num in zip(filters_height, filter_num_per_height):
    conv_filter = tf.Variable(tf.truncated_normal([filter_height, embedding_size, 1, filter_num], stddev=0.1))
    conv = tf.nn.conv2d(conv_inputs, conv_filter, strides=[1, 1, 1, 1], padding="VALID") bias = tf.variable (tf.constant(0.1, shape=[filter_num])) Feature_map = tf.nn.relu(tf.nn.bias_add(conv, bias)) feature_pooled = tf.nn.max_pool(feature_map, ksize=[1, sequence_length - filter_height + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID')
    features_pooled.append(feature_pooled)
Copy the code

Now the network is left with the full connection layer, where we perform dropout operations to temporarily incapacitate some nodes, and then do linear calculations to get the score that leads to the prediction.

filter_num_total = sum(filter_num_per_height)
features_pooled_flat = tf.reshape(tf.concat(features_pooled, 3), [-1, filter_num_total])
features_pooled_flat_drop = tf.nn.dropout(features_pooled_flat, keep_prob)
W = tf.get_variable("W", shape=[filter_num_total, classes_num], , initializer = tf. Contrib. The layers. Xavier_initializer ()) b = tf. Variable (tf) constant (0.1, shape=[classes_num])) scores = tf.nn.xw_plus_b(features_pooled_flat_drop, W, b)Copy the code

Finally, the loss is calculated, one is L2 regular loss and the other is cross entropy loss, and the total loss is obtained through the two. And calculate the accuracy.

l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
losses = tf.nn.softmax_cross_entropy_with_logits(logits=scores, labels=train_labels)
loss = tf.reduce_mean(losses) + l2_lambda * l2_loss
predictions = tf.argmax(scores, 1)
correct_predictions = tf.equal(predictions, tf.argmax(train_labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
Copy the code

github

The complete code is attached to Github.

https://github.com/sea-boat/nlp_lab/tree/master/cnn_text_classify

reference

Convolutional Neural Networks for Sentence Classification

————- Recommended reading ————

My 2017 article summary – Machine learning

My 2017 article summary – Java and Middleware

My 2017 article summary – Deep learning

My 2017 article summary — JDK source code article

My 2017 article summary – Natural Language Processing

My 2017 Article Round-up — Java Concurrent Article

—————— advertising time —————-

Planet of Knowledge: The Ocean Liner

The public menu has been divided into “distributed”, “machine learning”, “deep learning”, “NLP”, “Java depth”, “Java concurrent core”, “JDK source”, “Tomcat kernel” and so on, there may be a suitable for your appetite.

Why to write “Analysis of Tomcat Kernel Design”

Welcome to: