Abstract: This paper will explain in detail the classical deep learning text classification algorithms implemented by Keras, including LSTM, BiLSTM, BiLSTM+Attention, CNN and TextCNN.

This article is shared by Huawei cloud community “Keras Deep Learning Chinese Text Classification Summary (CNN, TextCNN, BiLSTM, Attention)” by Eastmount.

I. Overview of text classification

Text classification aims to automatically classify and mark text sets according to certain classification system or standard, which belongs to an automatic classification system based on classification system. Text classification can be traced back to the 1950s. At that time, text classification was mainly carried out by rules defined by experts. The 1980s saw the emergence of expert system based on knowledge engineering. Since 1990s, text classification has been carried out by artificial feature engineering and shallow classification model with the help of machine learning. Nowadays, word vector and deep neural network are used for text classification.

Teacher Niu Yafeng summarized the traditional text classification process as shown in the figure below. In the traditional text classification, basically most machine learning methods are applied in the field of text classification. Mainly include:

  • Naive Bayes

  • KNN

  • SVM

  • Collection class method

  • Maximum entropy

  • The neural network

The basic process of text classification using Keras framework is as follows:

Step 1: preprocessing the text, word segmentation -> removal of stop words -> statistical selection of top N words as feature words

Step 2: Generate ids for each keyword

Step 3: Convert the text to a sequence of ids and complete the left side

Step 4: Train set Shuffle

Step 5: Embedding Layer converts words into word vectors

Step 6: Add the model and construct the neural network structure

Step 7: Train the model

Step 8: Get the accuracy rate, recall rate and F1 value

Note that if TFIDF is used instead of word vector for document representation, the TFIDF matrix is generated directly after word segmentation and input to the model.

Deep learning text classification methods include:

  • Convolutional Neural Network

  • Recurrent Neural Network (TextRNN)

  • TextRNN+Attention

  • TextRCNN(TextRNN+CNN)

  • BiLSTM+Attention

  • The migration study

Article recommended by Teacher Niu Yafeng:

  • Text classification based on Word2vec and CNN: A review & Practice

2. Data preprocessing and word segmentation

This article focuses on the code, which will be supplemented in previous and subsequent articles. The data set is shown in the figure below:

  • Training set: news_dataset_train.csv

Game theme (10,000), sports theme (10,000), cultural theme (10,000), financial theme (10,000)

  • Test set: news_dataset_test.csv

Game theme (5000), Sports theme (5000), Culture theme (5000), Financial theme (5000)

  • Validation set: news_dataset_val.csv

Game theme (5000), Sports theme (5000), Culture theme (5000), Financial theme (5000)

First, Chinese word segmentation is pre-processed and Jieba library is called to achieve it. The code is as follows:

  • data_preprocess.py

    – coding:utf-8 –

    By:Eastmount CSDN 2021-03-19

    import csv import pandas as pd import numpy as np import jieba import jieba.analyse

    Stop_list = pd.read_csv(‘stop_words. TXT ‘, engine=’python’, encoding=’utf-8′, delimiter=”\n”, names=[‘t’])[‘t’].tolist()

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – # Jieba participle function def txt_cut (juzi) : return [w for w in jieba.lcut(juzi) if w not in stop_list]

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – # in Chinese word segmentation read the file def fenci (filename, result) : Fw = open(result, “w”, newline = ”,encoding = ‘gb18030’) writer = csv.writer(fw) writer.writerow([‘label’,’cutword’])

    Labels = [] contents = [] with open(filename, "r", encoding="UTF-8") as f: Labels = [] contents = [] with open(filename, "r", encoding="UTF-8") as f: reader = csv.DictReader(f) for row in reader: Append (row['label']) content = row['content'] # seglist = txt_cut(content) '.join(list(seglist)) contents. Append (output) # file tlist = [] tlist.append(row['label']) tlist.append(output) writer.writerow(tlist) print(labels[:5]) print(contents[:5]) fw.close()Copy the code

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – # main function if name = = ‘main’ : fenci(“news_dataset_train.csv”, “news_dataset_train_fc.csv”) fenci(“news_dataset_test.csv”, “news_dataset_test_fc.csv”) fenci(“news_dataset_val.csv”, “news_dataset_val_fc.csv”)

The running results are as follows:

Then we tried to simply look at the length distribution of the data and label visualization.

  • data_show.py

    – coding: utf-8 –

    “”” Created on 2021-03-19 @author: xiuzhang Eastmount CSDN “”” import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – the first step in data read — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

    Read the test data set

    train_df = pd.read_csv(“news_dataset_train_fc.csv”) val_df = pd.read_csv(“news_dataset_val_fc.csv”) test_df = pd.read_csv(“news_dataset_test_fc.csv”) print(train_df.head())

    Solve Chinese display problem

    Plt.rcparams [‘axes. Unicode_minus ‘] = False

    See what labels the training set has

    plt.figure() sns.countplot(train_df.label) plt.xlabel(‘Label’,size = 10) plt.xticks(size = 10) plt.show()

    Analyze the distribution of phrase number in the training set

    Print (train_df.cutwordnum. Describe () plt.figure() plt.hist(train_df.cutwordnum,bins=100) plt.xlabel(” phrase length “, Size = 12) plt.ylabel(” frequency “, size = 12) plt.title(” training dataset “) plt.show()

The output is shown below, and later in the paper we will describe how to draw nice graphs.

UnicodeDecodeError: ‘UTF-8’ codec can’t decode byte 0xce in position 17: “, you need to save the CSV file in UTF-8 format, as shown in the following figure.

3.CNN Chinese text classification

1. Principle introduction

Convolutional Neural Networks (CNN) are a kind of Feedforward Neural Networks with deep structure and Convolutional computation. It is one of the representative algorithms of deep learning. It is commonly used in areas such as image recognition and speech recognition with better results, and can also be used in video analysis, machine translation, natural language processing, drug discovery and other fields. Alphago famously taught computers to read Go based on convolutional neural networks.

  • Convolution refers to processing the image region instead of each pixel, which strengthens the continuity of the image and enables you to see a graph instead of a point, and also deepens the neural network’s understanding of the image.

  • Normally, convolutional neural network will go through the process of “picture -> convolution -> holding -> convolution -> holding -> result is passed into two fully connected neural layers -> classifier”, and finally realize a CNN classification processing.

2. Code implementation

Keras implements the CNN code for text classification as follows:

  • Keras_CNN_cnews.py

    – coding: utf-8 –

    “”” Created on 2021-03-19 @author: xiuzhang Eastmount CSDN CNN Model “”” import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential

    GPU processing readers can annotate this part of the code if it is CPU

    0.9 indicates that 90% of GPU resources can be used for training

    os.environ[“CUDA_DEVICES_ORDER”] = “PCI_BUS_IS” os.environ[“CUDA_VISIBLE_DEVICES”] = “0” gpu_options = Tf. GPUOptions sess = tf (per_process_gpu_memory_fraction = 0.8). The Session (config = tf. ConfigProto (gpu_options = gpu_options))

    start = time.clock()

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — – the first step in data read — — — — — — — — — — — — — — — — — — — — — — — — — — — —

    Read the test data set

    train_df = pd.read_csv(“news_dataset_train_fc.csv”) val_df = pd.read_csv(“news_dataset_val_fc.csv”) test_df = pd.read_csv(“news_dataset_test_fc.csv”) print(train_df.head())

    Solve Chinese display problem

    Plt.rcparams [‘axes. Unicode_minus ‘] = False

    # — — — — — — — — — — — — — — — — — — — — — — — — — – the second step OneHotEncoder () code — — — — — — — — — — — — — — — — — — — —

    Encode the label data of the data set

    train_y = train_df.label val_y = val_df.label test_y = test_df.label print(“Label:”) print(train_y[:10])

    Le = LabelEncoder() train_y = le.fit_Transform (train_Y).0 (-1,1) Val_Y = le.Transform (val_y).0 (-1,1) test_y = 0 Le. The transform (test_y). Reshape (1, 1) print (” LabelEncoder “) print (train_y [10]) print (len (train_y))

    The label data of the data set is one-hot coded

    ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print(“OneHotEncoder:”) print(train_y[:10])

    # — — — — — — — — — — — — — — — — — — — — — — — the third step using the Tokenizer to encode the phrase — — — — — — — — — — — — — — — — — — — — max_words max_len = 600 tok = = 6000 Print (train_df.cutword[:5]) print(type(train_df.cutword))

    Prevent the presence of numeric STR processing in the corpus

    train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok)

    # use fit_on_texts() when creating Tokenizer objects to identify each word

    Save the trained Tokenizer and import

    with open(‘tok.pickle’, ‘wb’) as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open(‘tok.pickle’, ‘rb’) as handle: #loading tok = pickle.load(handle)

    Use the word_index property to see the encoding for each word

    Use the word_counts attribute to view the frequency for each word

    for ii,iterm in enumerate(tok.word_index.items()): if ii < 10: print(iterm) else: break print(“===================”)

    for ii,iterm in enumerate(tok.word_counts.items()): if ii < 10: print(iterm) else: break

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — the fourth step data into sequence — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

    Adjust each sequence to the same length using sequence.pad_sequences()

    After coding each word, each word in each sentence of news can be represented by the corresponding code, that is, each news can be transformed into a vector

    train_seq = tok.texts_to_sequences(train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content)

    Adjust each sequence to the same length

    train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) Print (train_seq_mat.shape) print(train_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat.shape)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 5 CNN model — — — — — — — — — — — — — — — — — — — — — — — — — –

    The category is 4

    num_labels = 4 inputs = Input(name=’inputs’,shape=[max_len], dtype=’float64′)

    Word embedding uses pre-trained word vectors

    layer = Embedding(max_words+1, 128, input_length=max_len, trainable=False)(inputs)

    Convolution layer and pooling layer (word window size: 3 128 cores)

    cnn = Convolution1D(128, 3, padding=’same’, strides = 1, activation=’relu’)(layer) cnn = MaxPool1D(pool_size=4)(cnn)

    Dropout prevents overfitting

    Flat = Flatten()(CNN) Drop = Dropout(0.3)(flat)

    The connection layer

    main_output = Dense(num_labels, activation=’softmax’)(drop) model = Model(inputs=inputs, outputs=main_output)

    Optimize function evaluation index

    model.summary() model.compile(loss=”categorical_crossentropy”, optimizer=’adam’, # RMSprop() metrics=[“accuracy”])

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 6 training and prediction — — — — — — — — — — — — — — — — — — — — — — — — — –

    Set it to train and then test

    flag = “train” if flag == “train”: 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), Callbacks =[EarlyStopping(monitor=’val_loss’,min_delta=0.0001)]) ## Save model model.save(‘my_model.h5’) del model # deletes Print (” Elapsed = (time.clock() -start) print(” time used:”, print(model_fit.history))

    else: Print (‘ test_pre ‘) ## print(‘ test_seq_mat ‘) ## print(‘ test_seq_mat ‘) ## print(‘my_model.h5’) ## print(‘ test_seq_mat ‘) ## Confm = metrics. Confusion_matrix (np.argmax(test_Y,axis=1),np.argmax(test_pre,axis=1)) print(confm)

    ## confusion matrix visualization Labname = [" sports ", "culture "," finance ", "Game "] print(metrics. Classification_report (np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) Figure (figsize=(8,8)) SNS. Heatmap (confm.T, square=True, annot=True, FMT ='d', cbar=False, linewidths=. Cmap ="YlGnBu") plT.xLabel ('True label',size = 14) plt.ylabel('Predicted Label ',size = 14) Plt.xTick (Np.arange (4)+0.5, Labname, size = 12) plt.yticks(Np.arange (4)+0.5, Labname, Size = 12) PLT. Savefig (' result. PNG ') PLT. The show () # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 7 verification algorithm -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # Reprocessing the validation data set using TOK, Val_seq = tok.texts_TO_SEQUENCES (val_df.cutword) ## Adjust each sequence to the same length val_SEq_mat = Sequence. Pad_sequences (val_seq,maxlen=max_len) ## Predict sequences from val_pre = model. Predict (val_seq_mat) ## Print (Elapsed = (time.clock()) print(Elapsed = (time.clock()) print(Elapsed = (time.clock()) print(Elapsed = (time.clock()) print(Elapsed = (val_y,axis=1)) print(Elapsed = (time. - start) print("Time used:", elapsed)Copy the code

The following figure shows the GPU running mode. Note that if you are running a CPU version, you only need to comment out the first part of the above code, and then use the corresponding library functions of the GPU for the LSTM part.

The training output model is shown in the figure below:

The training output is as follows:

Train on 40000 samples, validate on 20000 samples Epoch 1/10 40000/40000 [==============================] - 15s 371us/step - loss: 1.1798-ACC: 0.4772 - val_loss: 0.9878 - val_ACC: 0.5977 Epoch 2/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 93 us/step - loss: 0.8681 acc: 0.6612 - val_loss: 0.8167 - val_acc: 0.6746 Epoch 3/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 92 us/step - loss: 0.7268 acc: 0.7245 - val_loss: 0.7084 - val_ACC: 0.7330 Epoch 4/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 93 us/step - loss: 0.6369 acc: 0.7643 - val_loss: 0.6462 - val_acc: 0.7617 Epoch 5/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 96 us/step - loss: 0.5670 acc: 0.7957 - VAL_loss: 0.5895 - val_ACC: 0.7867 Epoch 6/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 92 us/step - loss: 0.5074 acc: 0.8226 - val_loss: 0.5530 - val_acc: 0.8018 Epoch 7/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 93 us/step - loss: 0.4638 acc: 0.8388-VAL_loss: 0.5105-val_ACC: 0.8185 Epoch 8/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 93 us/step - loss: 0.4241 acc: 0.8545 - val_loss: 0.4836 - val_acc: 0.8304 Epoch 9/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 92 us/step - loss: 0.3900 acc: 0.8692 - VAL_loss: 0.4599 - val_ACC: 0.8403 Epoch 10/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 4 s 93 us/step - loss: 0.3657 acc: 0.8761 - val_loss: 0.4472 - val_ACC: 0.8457 Time Used: 52.203992899999996Copy the code

The prediction and verification results are as follows:

[[3928 472 264 336] [ 115 4529 121 235] [ 151 340 4279 230] [ 145 593 195 4067]] precision recall f1-score support 0 0.91 0.79 0.84 5000 1 0.76 0.91 0.83 5000 2 0.88 0.86 0.87 5000 3 0.84 0.81 0.82 5000 AVg/total 0.85 0.84 0.84 20000 Precision recall F1-score support 0 0.90 0.77 0.83 5000 1 0.78 0.92 0.84 5000 2 0.88 0.85 0.86 5000 3 0.84 0.85 0.85 0.85 5000 AVg/total 0.85 0.85 0.85 20000Copy the code

TextCNN Chinese text classification

1. Principle introduction

TextCNN is an algorithm for text Classification using Convolutional Neural network, which is proposed by Yoon Kim in the article “Convolutional Neural Networks for Sentence Classification” in 2014.

The core idea of convolutional neural network is to capture local features. For text, local features are sliding Windows composed of several words, similar to N-gram. The advantage of convolutional neural network is that it can automatically combine and filter N-gram features to obtain semantic information of different levels of abstraction. The following figure shows the convolutional neural network model architecture for text classification in this paper.

Another classic TextCNN paper is A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification, the model results are shown in the figure below. TextCNN structure description is mainly used for text classification task. It explains in detail how TextCNN structure and word vector matrix are convolved.

Suppose we have some sentences that we need to classify. Each word in a sentence is composed of n-dimensional word vectors, that is, the input matrix size is M *n, where M is the sentence length. CNN needs to carry out convolution operation on the input sample. For text data, filter no longer slides horizontally but only moves downward, somewhat similar to local correlation between words extracted by N-gram.

There are three step size strategies in the figure, namely 2, 3 and 4, and each step size has two filters (the number of filters will be large in actual training). Apply different filters to different word Windows, and finally get 6 convolution vectors. Then, each vector is pooled and pooled to obtain the feature representation of the sentence. The sentence vector is thrown to the classifier for classification, and the whole text classification process is finally completed.

Finally, I sincerely recommend the following introductions of TextCNN, especially asia-Lee of CSDN. I like his articles very much. It’s really great!

  • Convolutional Neural Networks for Sentence Classification2014

  • A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification

  • Blog.csdn.net/asialee_bir…

  • zhuanlan.zhihu.com/p/77634533

2. Code implementation

Keras implements the TextCNN code for text classification as follows:

  • Keras_TextCNN_cnews.py

    – coding: utf-8 –

    “”” Created on 2021-03-19 @author: xiuzhang Eastmount CSDN TextCNN Model “”” import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential from keras.layers.merge import concatenate

    GPU processing readers can annotate this part of the code if it is CPU

    0.9 indicates that 90% of GPU resources can be used for training

    os.environ[“CUDA_DEVICES_ORDER”] = “PCI_BUS_IS” os.environ[“CUDA_VISIBLE_DEVICES”] = “0” gpu_options = Tf. GPUOptions sess = tf (per_process_gpu_memory_fraction = 0.8). The Session (config = tf. ConfigProto (gpu_options = gpu_options))

    start = time.clock()

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — – the first step in data read — — — — — — — — — — — — — — — — — — — — — — — — — — — —

    Read the test data set

    train_df = pd.read_csv(“news_dataset_train_fc.csv”) val_df = pd.read_csv(“news_dataset_val_fc.csv”) test_df = pd.read_csv(“news_dataset_test_fc.csv”)

    Solve Chinese display problem

    Plt.rcparams [‘axes. Unicode_minus ‘] = False

    # — — — — — — — — — — — — — — — — — — — — — — — — — – the second step OneHotEncoder () code — — — — — — — — — — — — — — — — — — — —

    Encode the label data of the data set

    train_y = train_df.label val_y = val_df.label test_y = test_df.label print(“Label:”) print(train_y[:10])

    Le = LabelEncoder() train_y = le.fit_Transform (train_Y).0 (-1,1) Val_Y = le.Transform (val_y).0 (-1,1) test_y = 0 Le. The transform (test_y). Reshape (1, 1) print (” LabelEncoder “) print (train_y [10]) print (len (train_y))

    The label data of the data set is one-hot coded

    ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print(“OneHotEncoder:”) print(train_y[:10])

    # — — — — — — — — — — — — — — — — — — — — — — — the third step using the Tokenizer to encode the phrase — — — — — — — — — — — — — — — — — — — — max_words max_len = 600 tok = = 6000 Print (train_df.cutword[:5]) print(type(train_df.cutword))

    Prevent the presence of numeric STR processing in the corpus

    train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok)

    Save the trained Tokenizer and import

    with open(‘tok.pickle’, ‘wb’) as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open(‘tok.pickle’, ‘rb’) as handle: #loading tok = pickle.load(handle)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — the fourth step data into sequence — — — — — — — — — — — — — — — — — — — — — — — — — — — — — train_seq = tok. Texts_to_sequences (train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content)

    Adjust each sequence to the same length

    train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) Print (train_seq_mat.shape) print(train_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat.shape)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 5 TextCNN model — — — — — — — — — — — — — — — — — — — — — — — — — –

    The category is 4

    num_labels = 4 inputs = Input(name=’inputs’,shape=[max_len], dtype=’float64′)

    Word embedding uses pre-trained word vectors

    layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)

    The word window sizes are 3,4 and 5

    cnn1 = Convolution1D(256, 3, padding=’same’, strides = 1, activation=’relu’)(layer) cnn1 = MaxPool1D(pool_size=4)(cnn1) cnn2 = Convolution1D(256, 4, padding=’same’, strides = 1, activation=’relu’)(layer) cnn2 = MaxPool1D(pool_size=4)(cnn2) cnn3 = Convolution1D(256, 5, padding=’same’, strides = 1, activation=’relu’)(layer) cnn3 = MaxPool1D(pool_size=4)(cnn3)

    Merge the output vectors of the three models

    cnn = concatenate([cnn1,cnn2,cnn3], Axis =-1) flat = Flatten()(CNN) Drop = Dropout(0.2)(flat) Main_output = Dense(num_labels, activation=’softmax’)(drop) model = Model(inputs=inputs, outputs=main_output) model.summary() model.compile(loss=”categorical_crossentropy”, optimizer=’adam’, # RMSprop() metrics=[“accuracy”])

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 6 training and prediction — — — — — — — — — — — — — — — — — — — — — — — — — –

    Set it to train and then test

    flag = “train” if flag == “train”: 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), Callbacks =[EarlyStopping(monitor=’val_loss’,min_delta=0.0001)]) model.save(‘my_model.h5’) del model Elapsed = (time.clock() – start) print(“Time used:”, elapsed) print(model_fit.history)

    else: Print (‘ test_pre ‘) ## print(‘ test_seq_mat ‘) ## print(‘ test_seq_mat ‘) ## print(‘my_model.h5’) ## print(‘ test_seq_mat ‘) ## Confm = metrics. Confusion_matrix (np.argmax(test_Y,axis=1),np.argmax(test_pre,axis=1)) print(confm)

    ## confusion matrix visualization Labname = [" sports ", "culture "," finance ", "Game "] print(metrics. Classification_report (np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) Figure (figsize=(8,8)) SNS. Heatmap (confm.T, square=True, annot=True, FMT ='d', cbar=False, linewidths=. Cmap ="YlGnBu") plT.xLabel ('True label',size = 14) plt.ylabel('Predicted Label ',size = 14) Plt.xTick (Np.arange (4)+0.5, Labname, size = 12) plt.yticks(Np.arange (4)+0.5, Labname, Size = 12) PLT. Savefig (' result. PNG ') PLT. The show () # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 7 verification algorithm -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # Reprocessing the validation data set using TOK, Val_seq = tok.texts_TO_SEQUENCES (val_df.cutword) ## Adjust each sequence to the same length val_SEq_mat = Sequence. Pad_sequences (val_seq,maxlen=max_len) ## Predict sequences from val_pre = model. Predict (val_seq_mat) ## print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) elapsed = (time.clock() - start)  print("Time used:", elapsed)Copy the code

The training model is as follows:

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== inputs (InputLayer) (None, 600) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, 600, 256) 1536256 inputs[0][0] __________________________________________________________________________________________________ conv1d_1 (Conv1D) (None, 600, 256) 196864 embedding_1[0][0] __________________________________________________________________________________________________ conv1d_2 (Conv1D) (None, 600, 256) 262400 embedding_1[0][0] __________________________________________________________________________________________________ conv1d_3 (Conv1D) (None, 600, 256) 327936 embedding_1[0][0] __________________________________________________________________________________________________ max_pooling1d_1 (MaxPooling1D) (None, 150, 256) 0 conv1d_1[0][0] __________________________________________________________________________________________________  max_pooling1d_2 (MaxPooling1D) (None, 150, 256) 0 conv1d_2[0][0] __________________________________________________________________________________________________  max_pooling1d_3 (MaxPooling1D) (None, 150, 256) 0 conv1d_3[0][0] __________________________________________________________________________________________________  concatenate_1 (Concatenate) (None, 150, 768) 0 max_pooling1d_1[0][0] max_pooling1d_2[0][0] max_pooling1d_3[0][0] __________________________________________________________________________________________________ flatten_1 (Flatten) (None, 115200) 0 concatenate_1[0][0] __________________________________________________________________________________________________ dropout_1 (Dropout) (None, 115200) 0 flatten_1[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 4) 460804 dropout_1[0][0] ================================================================================================== Total params: Trainable Params: 1,248,004 Non-trainable Params: 1536256 __________________________________________________________________________________________________Copy the code

The predicted results are as follows:

[[4448 238 182 132] [151 4572 124 153] [185 176 4545 94] [181 394 207 4218]] Precision Recall F1-Score Support 0 0.90 0.89 0.89 5000 1 0.85 0.91 0.88 5000 2 0.90 0.91 0.90 5000 3 0.92 0.84 0.88 5000 AVg/total 0.89 0.89 0.89 20000 Precision recall F1-Score support 0 0.90 0.88 0.89 5000 1 0.86 0.93 0.89 5000 2 0.91 0.89 0.90 5000 3 0.92 0.88 0.90 0.90 5000 AVg/total 0.90 0.90 0.90 20000Copy the code

5.LSTM Chinese text classification

1. Principle introduction

Long Short Term Network (LSTM) is a Recurrent Neural Network (RNN), a special type that can learn to rely on information over the Long Term. LSTM was proposed by Hochreiter & Schmidhuber (1997) and recently improved and popularized by Alex Graves. In many problems, LSTM has achieved considerable success and has been widely used.

Since RNN has the problem of gradient disappearing, people have improved the hidden structure of sequence index position T, and made the hidden structure complicated by some techniques to avoid the problem of gradient disappearing. Such special RNN is our LSTM. LSTM stands for Long short-term Memory. Because of its design characteristics, LSTM is very suitable for modeling temporal data, such as text data. The structure of LSTM is shown as follows:

LSTM is deliberately designed to avoid long-term dependency issues. Remembering long-term information is the default behavior of LSTM in practice, rather than the ability to acquire it at great cost. LSTM is an improvement on the ordinary RNN, LSTM RNN has three more controllers, namely:

  • Input controller

  • Output controller

  • Forget controller

The left side has a main line, such as the main plot of a movie, and the original RNN system has become a split plot, and all three controllers are on the split line.

  • Write Gate: Set a gate when entering the input. The gate is used to decide whether to write the input to Memory. It is a parameter that can be trained to control whether to remember the current point.

  • Read Gate: A gate at the output location that determines whether to read current Memory.

  • Forget gate: Handles the location of the forget controller to determine whether to forget the previous Memory.

The working principle of LSTM is as follows: if the branch content is very important to the final result, the input controller will write the branch content into the main content according to the importance degree, and then analyze it. If the split changes our thinking, the forget controller will forget some of the main content and then replace it proportionally, so the main content update depends on input and forget control; The final output is based on the main line content and the split line content. Our RNN is well controlled through these three gates, and based on these control mechanisms, LSTM is a good medicine for delaying memory, leading to better results.

2. Code implementation

Keras implements the LSTM code for text classification as follows:

  • Keras_LSTM_cnews.py

    “”” Created on 2021-03-19 @author: xiuzhang Eastmount CSDN LSTM Model “”” import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential

    Layers import CuDNNLSTM, CuDNNGRU

    GPU processing readers can annotate this part of the code if it is CPU

    0.9 indicates that 90% of GPU resources can be used for training

    os.environ[“CUDA_DEVICES_ORDER”] = “PCI_BUS_IS” os.environ[“CUDA_VISIBLE_DEVICES”] = “0” gpu_options = Tf. GPUOptions sess = tf (per_process_gpu_memory_fraction = 0.8). The Session (config = tf. ConfigProto (gpu_options = gpu_options))

    start = time.clock()

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — – the first step in data read — — — — — — — — — — — — — — — — — — — — — — — — — — — —

    Read the test data set

    train_df = pd.read_csv(“news_dataset_train_fc.csv”) val_df = pd.read_csv(“news_dataset_val_fc.csv”) test_df = pd.read_csv(“news_dataset_test_fc.csv”) print(train_df.head())

    Solve Chinese display problem

    Plt.rcparams [‘axes. Unicode_minus ‘] = False

    # — — — — — — — — — — — — — — — — — — — — — — — — — – the second step OneHotEncoder () code — — — — — — — — — — — — — — — — — — — —

    Encode the label data of the data set

    train_y = train_df.label val_y = val_df.label test_y = test_df.label print(“Label:”) print(train_y[:10])

    Le = LabelEncoder() train_y = le.fit_Transform (train_Y).0 (-1,1) Val_Y = le.Transform (val_y).0 (-1,1) test_y = 0 Le. The transform (test_y). Reshape (1, 1) print (” LabelEncoder “) print (train_y [10]) print (len (train_y))

    The label data of the data set is one-hot coded

    ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print(“OneHotEncoder:”) print(train_y[:10])

    # — — — — — — — — — — — — — — — — — — — — — — — the third step using the Tokenizer to encode the phrase — — — — — — — — — — — — — — — — — — — — max_words max_len = 600 tok = = 6000 Print (train_df.cutword[:5]) print(type(train_df.cutword))

    Prevent the presence of numeric STR processing in the corpus

    train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok)

    Save the trained Tokenizer and import

    with open(‘tok.pickle’, ‘wb’) as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open(‘tok.pickle’, ‘rb’) as handle: #loading tok = pickle.load(handle)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — the fourth step data into sequence — — — — — — — — — — — — — — — — — — — — — — — — — — — — — train_seq = tok. Texts_to_sequences (train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content)

    Adjust each sequence to the same length

    train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) Print (train_seq_mat.shape) print(train_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat.shape)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 5 LSTM model — — — — — — — — — — — — — — — — — — — — — — — — — –

    Define the LSTM model

    inputs = Input(name=’inputs’,shape=[max_len],dtype=’float64′)

    Embedding(vocabulary size, Batch size, word length of each news)

    layer = Embedding(max_words+1, 128, input_length=max_len)(inputs) #layer = LSTM(128)(layer) layer = CuDNNLSTM(128)(layer)

    Layer = Dense(128, activation=”relu”, name=”FC1″)(Layer) Layer = Dropout(0.1)(Layer) Layer = Dense(4, activation=”softmax”, name=”FC2″)(layer) model = Model(inputs=inputs, outputs=layer) model.summary() model.compile(loss=”categorical_crossentropy”, optimizer=’adam’, # RMSprop() metrics=[“accuracy”])

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 6 training and prediction — — — — — — — — — — — — — — — — — — — — — — — — — –

    Set it to train and then test

    flag = “train” if flag == “train”: 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), Callbacks =[EarlyStopping(monitor=’val_loss’,min_delta=0.0001)]) model.save(‘my_model.h5’) del model Elapsed = (time.clock() – start) print(“Time used:”, elapsed) print(model_fit.history)

    else: Print (‘ test_pre ‘) ## print(‘ test_seq_mat ‘) ## print(‘ test_seq_mat ‘) ## print(‘my_model.h5’) ## print(‘ test_seq_mat ‘) ## Confm = metrics. Confusion_matrix (np.argmax(test_Y,axis=1),np.argmax(test_pre,axis=1)) print(confm)

    ## confusion matrix visualization Labname = [" sports ", "culture "," finance ", "Game "] print(metrics. Classification_report (np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) Figure (figsize=(8,8)) SNS. Heatmap (confm.T, square=True, annot=True, FMT ='d', cbar=False, linewidths=. Cmap ="YlGnBu") plT.xLabel ('True label',size = 14) plt.ylabel('Predicted Label ',size = 14) Plt.xTick (Np.arange (4)+0.8, Labname, size = 12) plt.yticks(np.arange(4)+0.4, Labname, Size = 12) PLT. Savefig (' result. PNG ') PLT. The show () # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 7 verification algorithm -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # Reprocessing the validation data set using TOK, Val_seq = tok.texts_TO_SEQUENCES (val_df.cutword) ## Adjust each sequence to the same length val_SEq_mat = Sequence. Pad_sequences (val_seq,maxlen=max_len) ## Predict sequences from val_pre = model. Predict (val_seq_mat) ## print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) elapsed = (time.clock() - start)  print("Time used:", elapsed)Copy the code

The training output model is as follows:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= inputs (InputLayer) (None, 600) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 600, 128) 768128 _________________________________________________________________ cu_dnnlstm_1 (CuDNNLSTM) (None, 128) 132096 _________________________________________________________________ FC1 (Dense) (None, 128) 16512 _________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 _________________________________________________________________ FC2 (Dense) (None, 4) 516 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 917252 Trainable params: 917252 Non - trainable params: 0Copy the code

The predicted results are as follows:

[[4539 153 188 120] [47 4628 181 144] [113 133 4697 57] [101 292 157 4450]] Precision Recall F1-Score Support 0 0.95 0.91 0.93 5000 1 0.89 0.93 0.91 5000 2 0.90 0.94 0.92 5000 3 0.93 0.89 0.91 5000 AVg/total 0.92 0.92 20000 Precision recall F1-Score support 0 0.96 0.89 0.92 5000 1 0.89 0.94 0.92 5000 2 0.90 0.93 0.92 5000 3 0.94 0.92 0.93 5000 AVg/total 0.92 0.92 0.92 20000Copy the code

Vi.BiLSTM Chinese text Classification

1. Principle introduction

BiLSTM is bi-directional Long short-term Memory. BiLSTM is a combination of forward LSTM and backward LSTM. Both it and LSTM are often used to model context information in natural language processing tasks. The forward LSTM combines with the backward LSTM to form BiLSTM. For example, we coded the sentence “I love China,” as shown in the model.

There is also a problem with modeling sentences using LSTM: it is impossible to encode information from back to front. In more fine-grained classification, the interaction between emotion words, degree words and negative words should be paid attention to in the five classification tasks, such as the positive sense of strong degree, the positive sense of weak degree, the neutral, the negative sense of weak degree and the negative sense of strong degree. For example, “This restaurant is not as dirty as the next door”. “No” here is a modification of the degree of “dirty”. BiLSTM can better capture the two-way semantic dependence.

  • Refer to the article: zhuanlan.zhihu.com/p/47802053

2. Code implementation

The BiLSTM code for Keras to implement text classification is as follows:

  • Keras_BiLSTM_cnews.py

    “”” Created on 2021-03-19 @author: xiuzhang Eastmount CSDN BiLSTM Model “”” import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential

    Layers import CuDNNLSTM, CuDNNGRU from Keras. Layers import Bidirectional

    GPU processing readers can annotate this part of the code if it is CPU

    0.9 indicates that 90% of GPU resources can be used for training

    os.environ[“CUDA_DEVICES_ORDER”] = “PCI_BUS_IS” os.environ[“CUDA_VISIBLE_DEVICES”] = “0” gpu_options = Tf. GPUOptions sess = tf (per_process_gpu_memory_fraction = 0.8). The Session (config = tf. ConfigProto (gpu_options = gpu_options))

    start = time.clock()

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — – the first step in data read — — — — — — — — — — — — — — — — — — — — — — — — — — — —

    Read the test data set

    train_df = pd.read_csv(“news_dataset_train_fc.csv”) val_df = pd.read_csv(“news_dataset_val_fc.csv”) test_df = pd.read_csv(“news_dataset_test_fc.csv”) print(train_df.head())

    Solve Chinese display problem

    Plt.rcparams [‘axes. Unicode_minus ‘] = False

    # — — — — — — — — — — — — — — — — — — — — — — — — — – the second step OneHotEncoder () code — — — — — — — — — — — — — — — — — — — —

    Encode the label data of the data set

    train_y = train_df.label val_y = val_df.label test_y = test_df.label print(“Label:”) print(train_y[:10])

    Le = LabelEncoder() train_y = le.fit_Transform (train_Y).0 (-1,1) Val_Y = le.Transform (val_y).0 (-1,1) test_y = 0 Le. The transform (test_y). Reshape (1, 1) print (” LabelEncoder “) print (train_y [10]) print (len (train_y))

    The label data of the data set is one-hot coded

    ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print(“OneHotEncoder:”) print(train_y[:10])

    # — — — — — — — — — — — — — — — — — — — — — — — the third step using the Tokenizer to encode the phrase — — — — — — — — — — — — — — — — — — — — max_words max_len = 600 tok = = 6000 Print (train_df.cutword[:5]) print(type(train_df.cutword))

    Prevent the presence of numeric STR processing in the corpus

    train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok)

    Save the trained Tokenizer and import

    with open(‘tok.pickle’, ‘wb’) as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open(‘tok.pickle’, ‘rb’) as handle: #loading tok = pickle.load(handle)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — the fourth step data into sequence — — — — — — — — — — — — — — — — — — — — — — — — — — — — — train_seq = tok. Texts_to_sequences (train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content)

    Adjust each sequence to the same length

    train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) Print (train_seq_mat.shape) print(train_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat.shape)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 5 BiLSTM model — — — — — — — — — — — — — — — — — — — — — — — — — — num_labels model = = 4 Sequential () model.add(Embedding(max_words+1, 128, input_length=max_len)) model.add(Bidirectional(CuDNNLSTM(128))) model.add(Dense(128, The activation = ‘relu) model. The add (Dropout (0.3)) model. The add (Dense (num_labels, activation=’softmax’)) model.summary() model.compile(loss=”categorical_crossentropy”, optimizer=’adam’, # RMSprop() metrics=[“accuracy”])

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 6 training and prediction — — — — — — — — — — — — — — — — — — — — — — — — — –

    Set it to train and then test

    flag = “train” if flag == “train”: 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), Callbacks =[EarlyStopping(monitor=’val_loss’,min_delta=0.0001)]) model.save(‘my_model.h5’) del model Elapsed = (time.clock() – start) print(“Time used:”, elapsed) print(model_fit.history)

    else: Print (‘ test_pre ‘) ## print(‘ test_seq_mat ‘) ## print(‘ test_seq_mat ‘) ## print(‘my_model.h5’) ## print(‘ test_seq_mat ‘) ## Confm = metrics. Confusion_matrix (np.argmax(test_Y,axis=1),np.argmax(test_pre,axis=1)) print(confm)

    ## confusion matrix visualization Labname = [" sports ", "culture "," finance ", "Game "] print(metrics. Classification_report (np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) Figure (figsize=(8,8)) SNS. Heatmap (confm.T, square=True, annot=True, FMT ='d', cbar=False, linewidths=. Cmap ="YlGnBu") plT.xLabel ('True label',size = 14) plt.ylabel('Predicted Label ',size = 14) Plt.xTick (Np.arange (4)+0.5, Labname, size = 12) plt.yticks(Np.arange (4)+0.5, Labname, Size = 12) PLT. Savefig (' result. PNG ') PLT. The show () # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 7 verification algorithm -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # Reprocessing the validation data set using TOK, Val_seq = tok.texts_TO_SEQUENCES (val_df.cutword) ## Adjust each sequence to the same length val_SEq_mat = Sequence. Pad_sequences (val_seq,maxlen=max_len) ## Predict sequences from val_pre = model. Predict (val_seq_mat) ## print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) elapsed = (time.clock() - start)  print("Time used:", elapsed)Copy the code

The training output model is shown below, and the GPU time is still very fast.

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 600, 128) 768128 _________________________________________________________________ bidirectional_1 (Bidirection (None, 256) 264192 _________________________________________________________________ dense_1 (Dense) (None, 128) 32896 _________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 _________________________________________________________________ dense_2 (Dense) (None, 4) 516 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 1065732 Trainable params: 1065732 Non - trainable params: 0 Train on 40000 samples, validate on 20000 samples Epoch 1/10 40000/40000 [==============================] - 23s 587us/step - loss: 0.5825-ACC: 0.8038-val_loss: 0.2321-val_ACC: 0.9246 Epoch 2/10 40000/40000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 21 s 521 us/step - loss: 0.1433 acc: 0.9542 - val_loss: 0.2422 - val_ACC: 0.9228 Time Used: 52.763230400000005Copy the code

The prediction results are shown in the figure below:

[[4593 143 113 151] [81 4679 60 180] [110 199 4590 101] [73 254 82 4591]] Precision Recall F1-Score Support 0 0.95 0.92 0.93 5000 1 0.89 0.94 0.91 5000 2 0.95 0.92 0.93 5000 3 0.91 0.92 0.92 5000 AVg/total 0.92 0.92 0.92 20000 Precision recall F1-Score support 0 0.94 0.90 0.92 5000 1 0.89 0.95 0.92 5000 2 0.95 0.90 0.93 5000 3 0.91 0.94 0.93 0.93 5000 AVg/total 0.92 0.92 0.92 20000Copy the code

7.BiLSTM+Attention

1. Principle introduction

Attention mechanism is a way to solve problems by imitating human Attention. Simply speaking, it is to quickly screen out high-value information from a large amount of information. It is mainly used to solve the problem that it is difficult to obtain the final reasonable vector representation when the input sequence of LSTM/RNN model is long. The method is to retain the intermediate results of LSTM, learn them with a new model, and associate them with the output, so as to achieve the purpose of information screening.

What is attention?

Let’s briefly describe what the attention mechanism is. I believe that NLP students will not be unfamiliar with this mechanism, which can be said to be brilliant in the paper “Attention is All You Need”. In the machine translation task, it helps to greatly improve the performance of the depth model. Output the best state-of-art model at that time. Of course, in addition to the attention mechanism, this model also uses many useful tricks to help improve the performance of the model. But there is no denying that the core of this model is attention.

Attention mechanism is also called attention mechanism. As its name implies, it is a technology that enables the model to focus on important information and fully learn and absorb it. It is not a complete model, but a technology that can be applied to any sequence model.

According to attention?

Why did you introduce attention? For example, in the SEQ2SEQ model, for a text sequence, we usually use some mechanism to encode the sequence, and encode it into a fixed-length vector through dimensionality reduction and other methods, which is used for input to the following full connection layer. In general, we will use CNN or RNN (including GRU or LSTM) models to encode sequence data, and then adopt various pooling or RNN to directly take the hidden state of the last t moment as the vector output of the sentence.

However, there is a problem here: conventional coding methods cannot reflect the attention to different morphemes in a sentence sequence. In natural language, different parts of a sentence have different meanings and importance, such as the example above: I hate this movie. If emotion analysis is made, obviously more attention should be paid to the word hate. Of course, CNN and RNN can be used to encode such information, but the encoding capability is also limited. For long texts, the model effect will not be improved too much.

  • Reference and recommend articles: zhuanlan.zhihu.com/p/46313756

Attention has a wide range of applications, including text and pictures.

  • Text: Applied to the SEQ2SEQ model, the most common application is translation

  • Image: Image extraction for convolutional neural networks

  • voice

The figure below is a classic BiLSTM+Attention model, which is also the model we need to build next.

2. Code implementation

Keras implements the BiLSTM+Attention code for text classification as follows:

  • Keras_Attention_BiLSTM_cnews.py

    “”” Created on 2021-03-19 @author: xiuzhang Eastmount CSDN BiLSTM+Attention Model “”” import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential

    Layers import CuDNNLSTM, CuDNNGRU from Keras. Layers import Bidirectional

    GPU processing readers can annotate this part of the code if it is CPU

    0.9 indicates that 90% of GPU resources can be used for training

    os.environ[“CUDA_DEVICES_ORDER”] = “PCI_BUS_IS” os.environ[“CUDA_VISIBLE_DEVICES”] = “0” gpu_options = Tf. GPUOptions sess = tf (per_process_gpu_memory_fraction = 0.8). The Session (config = tf. ConfigProto (gpu_options = gpu_options))

    start = time.clock()

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — – the first step in data read — — — — — — — — — — — — — — — — — — — — — — — — — — — —

    Read the test data set

    train_df = pd.read_csv(“news_dataset_train_fc.csv”) val_df = pd.read_csv(“news_dataset_val_fc.csv”) test_df = pd.read_csv(“news_dataset_test_fc.csv”) print(train_df.head())

    Solve Chinese display problem

    Plt.rcparams [‘axes. Unicode_minus ‘] = False

    # — — — — — — — — — — — — — — — — — — — — — — — — — – the second step OneHotEncoder () code — — — — — — — — — — — — — — — — — — — —

    Encode the label data of the data set

    train_y = train_df.label val_y = val_df.label test_y = test_df.label print(“Label:”) print(train_y[:10])

    Le = LabelEncoder() train_y = le.fit_Transform (train_Y).0 (-1,1) Val_Y = le.Transform (val_y).0 (-1,1) test_y = 0 Le. The transform (test_y). Reshape (1, 1) print (” LabelEncoder “) print (train_y [10]) print (len (train_y))

    The label data of the data set is one-hot coded

    ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print(“OneHotEncoder:”) print(train_y[:10])

    # — — — — — — — — — — — — — — — — — — — — — — — the third step using the Tokenizer to encode the phrase — — — — — — — — — — — — — — — — — — — — max_words max_len = 600 tok = = 6000 Print (train_df.cutword[:5]) print(type(train_df.cutword))

    Prevent the presence of numeric STR processing in the corpus

    train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok)

    Save the trained Tokenizer and import

    with open(‘tok.pickle’, ‘wb’) as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open(‘tok.pickle’, ‘rb’) as handle: #loading tok = pickle.load(handle)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — the fourth step data into sequence — — — — — — — — — — — — — — — — — — — — — — — — — — — — — train_seq = tok. Texts_to_sequences (train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content)

    Adjust each sequence to the same length

    train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) Print (train_seq_mat.shape) print(train_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat.shape)

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — step 5 Attention mechanism — — — — — — — — — — — — — — — — — — — — — — “” “because Keras there is no ready-made Attention layer can be used directly, We need to build a new layer function ourselves. Call: compute_output_shape: compute_output_shape: compute_output_shape: compute_output_shape: compute_output_shape Blog.csdn.net/huanghaocs/… zhuanlan.zhihu.com/p/29201491 “””

    Hierarchical Model with Attention

    from keras import initializers from keras import constraints from keras import activations from keras import regularizers from keras import backend as K from keras.engine.topology import Layer

    K.clear_session()

    class AttentionLayer(Layer): def init(self, attention_size=None, **kwargs): self.attention_size = attention_size super(AttentionLayer, self).init(**kwargs)

    def get_config(self):
        config = super().get_config()
        config['attention_size'] = self.attention_size
        return config
        
    def build(self, input_shape):
        assert len(input_shape) == 3
    
        self.time_steps = input_shape[1]
        hidden_size = input_shape[2]
        if self.attention_size is None:
            self.attention_size = hidden_size
            
        self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),
                                initializer='uniform', trainable=True)
        self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),
                                initializer='uniform', trainable=True)
        self.V = self.add_weight(name='att_var', shape=(self.attention_size,),
                                initializer='uniform', trainable=True)
        super(AttentionLayer, self).build(input_shape)
    
    def call(self, inputs):
        self.V = K.reshape(self.V, (-1, 1))
        H = K.tanh(K.dot(inputs, self.W) + self.b)
        score = K.softmax(K.dot(H, self.V), axis=1)
        outputs = K.sum(score * inputs, axis=1)
        return outputs
    
    def compute_output_shape(self, input_shape):
        return input_shape[0], input_shape[2]
    Copy the code

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 6 BiLSTM model — — — — — — — — — — — — — — — — — — — — — — — — — –

    Define the BiLSTM model

    BiLSTM+Attention

    num_labels = 4 inputs = Input(name=’inputs’,shape=[max_len],dtype=’float64′) layer = Embedding(max_words+1, 256, Input_length =max_len)(inputs) # LSTM = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.1, return_sequences=True))(layer) bilstm = Bidirectional(CuDNNLSTM(128, Return_sequences =True) (layer) # Parameter Keep dimensions 3 Layer = Dense(128, activation=’relu’)(BilSTM) Layer = Dropout(0.2)(layer)

    Attentional mechanism

    attention = AttentionLayer(attention_size=50)(layer) output = Dense(num_labels, activation=’softmax’)(attention) model = Model(inputs=inputs, outputs=output) model.summary() model.compile(loss=”categorical_crossentropy”, optimizer=’adam’, # RMSprop() metrics=[“accuracy”])

    # — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — step 7 training and prediction — — — — — — — — — — — — — — — — — — — — — — — — — –

    Set it to train and then test

    flag = “test” if flag == “train”: 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), Callbacks =[EarlyStopping(monitor=’val_loss’,min_delta=0.0001)]) ## Save model model.save(‘my_model.h5’) del model # deletes Print (” Elapsed = (time.clock() -start) print(” time used:”, print(model_fit.history))

    Else: print(” AttentionLayer “) ## Import model = load_model(‘my_model.h5’, custom_objects={‘AttentionLayer’: Test_pre = model. Predict (test_seq_mat) ## Confm = metrics. Confusion_matrix (np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) print(confm)

    ## confusion matrix visualization Labname = [" sports ", "culture "," finance ", "Game "] print(metrics. Classification_report (np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) Figure (figsize=(8,8)) SNS. Heatmap (confm.T, square=True, annot=True, FMT ='d', cbar=False, linewidths=. Cmap ="YlGnBu") plT.xLabel ('True label',size = 14) plt.ylabel('Predicted Label ',size = 14) Plt.xTick (Np.arange (4)+0.5, Labname, size = 12) plt.yticks(Np.arange (4)+0.5, Labname, Size = 12) PLT. Savefig (' result. PNG ') PLT. The show () # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 7 verification algorithm -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # Reprocessing the validation data set using TOK, Val_seq = tok.texts_TO_SEQUENCES (val_df.cutword) ## Adjust each sequence to the same length val_SEq_mat = Sequence. Pad_sequences (val_seq,maxlen=max_len) ## Predict sequences from val_pre = model. Predict (val_seq_mat) ## Print (Elapsed = (time.clock()) print(Elapsed = (time.clock()) print(Elapsed = (time.clock()) print(Elapsed = (time.clock()) print(Elapsed = (val_y,axis=1)) print(Elapsed = (time. - start) print("Time used:", elapsed)Copy the code

The training output model is as follows:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= inputs (InputLayer) (None, 600) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 600, 256) 1536256 _________________________________________________________________ bidirectional_1 (Bidirection (None, 600, 256) 395264 _________________________________________________________________ dense_1 (Dense) (None, 600, 128) 32896 _________________________________________________________________ dropout_1 (Dropout) (None, 600, 128) 0 _________________________________________________________________ attention_layer_1 (Attention (None, 128) 6500 _________________________________________________________________ dense_2 (Dense) (None, 4) 516 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 1971432 Trainable params: 1971432 Non - trainable params: 0Copy the code

The prediction results are shown in the figure below:

[[4625 138 100 137] [63 4692 77 168] [129 190 4589 92] [82 299 78 4541]] Precision Recall F1-Score Support 0 0.94 0.93 0.93 5000 1 0.88 0.94 0.91 5000 2 0.95 0.92 0.93 5000 3 0.92 0.91 0.91 5000 AVg/total 0.92 0.92 0.92 20000 Precision recall F1-Score Support 0 0.95 0.91 0.93 5000 1 0.88 0.95 0.91 5000 2 0.95 0.90 0.92 5000 3 0.92 0.93 0.93 5000 AVg/total 0.92 0.92 0.92 20000Copy the code

Click to follow, the first time to learn about Huawei cloud fresh technology ~