Abstract: In this paper, a case of RNN text classification learning is implemented through Keras, and the principle knowledge of recurrent neural network and its comparison with machine learning are introduced in detail.
This article is shared by Eastmount, huawei Cloud community “Text Classification Based on Keras+RNN vs Text Classification based on traditional machine Learning”.
RNN text classification
1.RNN
It’s called Recurrent Neural Networks, or RNN for short. The essential concept of RNN is to use timing information. In traditional neural networks, it is assumed that all inputs (and outputs) are independent. However, for many tasks, this is very limiting. For example, if you want to predict the next word based on an unfinished sentence, the best way is to use contextual information. RNNS (recurrent neural networks) are “circular” because they perform the same task on each element of the sequence, and each result is independent of the previous calculation.
Assume that there is a group of data data0, datA1, datA2 and data3, and use the same neural network to predict them and get the corresponding results. If there is a relationship between data, such as the steps before and after cooking and the order of English words, how can the relationship between data be learned by neural networks? And that’s going to be RNN.
For example, if there are ABCD numbers and the next number E needs to be predicted, the prediction will be made according to the previous ABCD order, which is called memory. Before prediction, it is necessary to review the previous memory, add the new memory points in this step, and finally output output. Cyclic neural network (RNN) makes use of this principle.
First, let’s think about how humans analyze correlations or sequences of things. Humans usually remember past events to help us decide what to do later, but could computers do the same?
When analyzing datA0, we store the analysis results into Memory, and then when analyzing DATA1, the neural network (NN) will generate new memories, but at this time the new memories are not associated with the old memories, as shown in the figure above. In RNN, we will simply call the old memory to analyze the new memory. If we continue to analyze more data, NN will accumulate all the previous memories.
The following is a typical RNN result model. According to the time points t-1, t and t+1, each moment has a different X. The state of the previous step and x(t) of this step will be considered in each calculation, and then the y value will be output. In this mathematical form, s(t) will be generated after every RNN runs. When RNN wants to analyze X (t+1), y(t+1) at the moment is created by S (t) and S (t+1) together, and S (t) can be regarded as the memory of the last step. The accumulation of multiple neural networks NN is converted into a recurrent neural network, whose simplified figure is shown on the left side of the figure below. For example, if the sentence in the sequence has five words, then there will be five layers of neural network when the network is horizontally expanded, one layer corresponding to one word.
In short, RNN can be used as long as your data is sequential, such as the order of human speech, the order of phone numbers, the order of image pixels, the order of ABC letters, etc. RNN is commonly used in natural language processing, machine translation, speech recognition, image recognition and other fields.
2. Text classification
Text classification aims to automatically classify and mark text sets according to certain classification system or standard, which belongs to an automatic classification system based on classification system. Text classification can be traced back to the 1950s. At that time, text classification was mainly carried out by rules defined by experts. The 1980s saw the emergence of expert system based on knowledge engineering. Since 1990s, text classification has been carried out by artificial feature engineering and shallow classification model with the help of machine learning. Nowadays, word vector and deep neural network are used for text classification.
Teacher Niu Yafeng summarized the traditional text classification process as shown in the figure below. In the traditional text classification, basically most machine learning methods are applied in the field of text classification. Mainly include:
-
Naive Bayes
-
KNN
-
SVM
-
Collection class method
-
Maximum entropy
-
The neural network
The basic process of text classification using Keras framework is as follows:
-
Step 1: preprocessing the text, word segmentation -> removal of stop words -> statistical selection of top N words as feature words
-
Step 2: Generate ids for each keyword
-
Step 3: Convert the text to a sequence of ids and complete the left side
-
Step 4: Train set Shuffle
-
Step 5: Embedding Layer converts words into word vectors
-
Step 6: Add the model and construct the neural network structure
-
Step 7: Train the model
-
Step 8: Get the accuracy rate, recall rate and F1 value
Note that if TFIDF is used instead of word vector for document representation, the TFIDF matrix is generated directly after word segmentation and input to the model. In this paper, word vector and TFIDF are used to conduct experiments.
Deep learning text classification methods include:
-
Convolutional Neural Network
-
Recurrent Neural Network (TextRNN)
-
TextRNN+Attention
-
TextRCNN(TextRNN+CNN)
Text classification based on Word2vec and CNN: A Review & Practice
Text classification based on traditional machine learning Bayesian algorithm
MultinomialNB+TFIDF text classification
The data set adopts teacher Ji Ji Wei’s self-defined text, with a total of 21 lines of data, including 2 categories (millet mobile phone, millet porridge). The basic process is as follows:
-
Get the data sets Data and Target
-
Call Jieba library to achieve Chinese word segmentation
-
Tf-idf value is calculated and word frequency matrix is converted to TF-IDF vector matrix
-
Call machine learning algorithms for training and prediction
-
Experimental evaluation and visual analysis
The complete code is as follows:
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 22:10:20 2020 @author: Eastmount CSDN "" from" jieba import lcut # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- load data and the pretreatment of -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the data = [[0, 'millet porridge is made of millet as the main ingredients of porridge, light taste, fragrance, with simple and easy to make, [0, 'porridge must first boil water and then put into the washed millet '], [0,' protein and amino acids, fat, vitamins, minerals '], [0, 'millet is a traditional health food, can be independently braised rice and porridge '], [0,' apple, Porridge is a kind of fruit '], [0, 'porridge is of high nutritional value, rich in minerals and vitamins, rich in calcium, which helps to metabolize excess salt in the body '], [0,' eggs have high nutritional value, is a good source of high-quality protein, B vitamins, [0, 'The apples in this supermarket are very fresh '], [0,' millet is one of the main food in the north, many areas have the custom of eating millet porridge for dinner '], [0, 'millet is high in nutritional value, comprehensive and balanced nutrition, Mainly contains carbohydrates '], [0, 'protein and amino acids, fat, vitamins, salt '], [1,' Xiaomi, Samsung, Huawei, as the flagship of the three Android phones '], [1, 'Forget Xiaomi and Huawei! This is really perfect. '], [1, 'Apple may be stuck in 2016 again, but this time it can't raise prices much. '], [1,' Samsung wants to continue to hold huawei down, the A70 isn't enough. '], [1, 'Samsung will hit a new high in screen share, [1, 'Huawei P30 and Samsung A70 sold out and won the Best Mobile phone marketing award of Suning '], [1,' Lei Jun, tell you with a picture: Where is the gap between Xiaomi and Samsung '], [1, 'Mi Mi chat APP official Linux version online, adapted to the depth of the system '], [1,' Samsung has just updated their own wearable device APP'], [1, 'Huawei, Xiaomi crossover is not terrible, Join (lcut(I [1]) for I in data],] #中文 解 析 X, Y = [". Join (lcut(I [1]) for I in data], [I [0] for I in data] print(X) print(Y) #[I [0] for I in data] print(X) print(Y) # # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- to calculate word -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the from sklearn. Feature_extraction. Text Import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer # Convert words in the text to the word frequency matrix vectorizer = X_data = vectorizer.fit_transform(X) print(X_data Vectorizer.get_feature_names () print(' [查看 外] ') for w in word: print(w, end = "") else: Print (x_data.toarray ()) print(x_data.toarray ()) transformer = TfidfTransformer() tfidf = Tfidf [I][j] tf-idf weight = tfidf.toarray() print(weight) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- data analysis -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the from sklearn. Naive_bayes import MultinomialNB from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(weight, Y) print(len(X_train), len(X_test)) print(len(y_train), MultinomialNB = MultinomialNB().fit(X_train) Y_train) pre = clf.predict(X_test) print(" classification_report ", pre) print(" classification_report ", y_test) The pre) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- visual analysis -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # dimension reduction from drawing graphics sklearn.decomposition import PCA import matplotlib.pyplot as plt pca = PCA(n_components=2) newData = pca.fit_transform(weight) print(newData) L1 = [n[0] for n in newData] L2 = [n[1] for n in newData] plt.scatter(L1, L2, c=Y, s=200) plt.show()Copy the code
The following output is displayed:
- Accuracy of the 6 predicted data ===> 0.67
The drawing graph is as follows:
2.GaussianNB+Word2Vec text classification
The difference between this method and the previous one is that Word2Vec is used to calculate the word vector, dividing words in each row of data set, calculating the word vector of each feature word, and then converting it into the word vector matrix, such as 15 lines of data, each line of data has 40 feature words, and each feature word is represented by the word vector of 20 dimensions, namely, (15, 40, 20). Also, GaussianNB is required to replace MultinomialNB because the word vector is negative.
Installation of Word2Vec and Similarity Calculation of Chinese short Text in Qing Years
-
Sentences: List of lists of tokens passed, default is None
-
Size: dimension of the word vector. The default value is 100
-
Window: The maximum distance between the current word and the predicted word in the same sentence. The default value is 5
-
Min_count: minimum word frequency filtering. The default value is 5
-
Workers: number of threads. The default value is 3
-
Sg: Model parameter, whose value is 0 for CBOW and 1 for skip-gram. The default value is 0
-
Hs: model parameter, whose value is 0 for negative example sampling, 1 for hierarchy softmax, and the default is 0
-
Negative: Indicates the number of negative samples. The default value is 5
-
Ns_exponent: An exponent used to form a negative sample. The default value is 0.75
-
Cbow_mean: the context-vector parameter. A value of 0 represents the context-vector sum and a value of 1 represents the context-vector average. The default value is 1
-
Alpha: initial learning rate. The default value is 0.025
-
Min_alpha: minimum learning rate. The default value is 0.0001
The complete code is as follows:
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 22:10:20 2020 @author: Eastmount CSDN """ from jieba import lcut from numpy import zeros from gensim.models import Word2Vec from sklearn.model_selection import train_test_split from tensorflow.python.keras.preprocessing.sequence import pad_sequences Max_features maxlen = = # 20 word vector dimension sequence maximum length 40 # # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- load data and the pretreatment of -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Data = [[0, 'millet porridge is made of millet as the main ingredients of porridge, light taste, fragrance, with simple and easy to make, [0, 'porridge must first boil water and then put into the washed millet '], [0,' protein and amino acids, fat, vitamins, minerals '], [0, 'millet is a traditional health food, can be independently braised rice and porridge '], [0,' apple, Porridge is a kind of fruit '], [0, 'porridge is of high nutritional value, rich in minerals and vitamins, rich in calcium, which helps to metabolize excess salt in the body '], [0,' eggs have high nutritional value, is a good source of high-quality protein, B vitamins, [0, 'The apples in this supermarket are very fresh '], [0,' millet is one of the main food in the north, many areas have the custom of eating millet porridge for dinner '], [0, 'millet is high in nutritional value, comprehensive and balanced nutrition, Mainly contains carbohydrates '], [0, 'protein and amino acids, fat, vitamins, salt '], [1,' Xiaomi, Samsung, Huawei, as the flagship of the three Android phones '], [1, 'Forget Xiaomi and Huawei! This is really perfect. '], [1, 'Apple may be stuck in 2016 again, but this time it can't raise prices much. '], [1,' Samsung wants to continue to hold huawei down, the A70 isn't enough. '], [1, 'Samsung will hit a new high in screen share, [1, 'Huawei P30 and Samsung A70 sold out and won the Best Mobile phone marketing award of Suning '], [1,' Lei Jun, tell you with a picture: Where is the gap between Xiaomi and Samsung '], [1, 'Mi Mi chat APP official Linux version online, adapted to the depth of the system '], [1,' Samsung has just updated their own wearable device APP'], [1, 'Huawei, Xiaomi crossover is not terrible, X, Y = [lcut(I [1]) for I in data], [I [0] for I in data] y_test = train_test_split(X, Y) #print(X_train) print(len(X_train), len(X_test)) print(len(y_train), Len (y_test)) "" "[' samsung ', 'just' and 'update', 'a', 'home', 'the', 'can' and 'wear', 'device' "" "' APP '] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec word vector -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec = Word2Vec (X_train, Size =max_features, min_count=1) # print(word2vec) # print w2i = {w: I for I, W in enumerate(word2vec.wv.index2word)} print(" [display words] ") print(word2vec.wv.index2word) print(w2i) "" Proteins' vitamin ', ' ', 'and', 'APP', 'amino acids',... "" "" "" {:', '0,' ': 1," millet ": 2,' and ': 3,' huawei: 4, ....}""" # vectors = word2vec.wv.vectors print(" vector matrix ") print(vectors. Shape) print(vectors) # Get (w) return vectors[I] if I else zeros(max_features) # def pad(ls_of_words): a = [[w2v(i) for i in x] for x in ls_of_words] a = pad_sequences(a, maxlen, X_train, X_test = pad(X_train), Print (x_test.shape) print(x_test.shape) ""(10, 40, 20) print(x_test.shape) "" 20)=>(15, 40*20) (6, 40, 20)=>(6, 40*20) X_train = X_train.reshape(len(y_train), maxlen*max_features) X_test = X_test.reshape(len(y_test), maxlen*max_features) print(X_train.shape) print(X_test.shape) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- modeling and training -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the from sklearn. Naive_bayes import GaussianNB the from Metrics import Classification_report from sklearn.model_selection import train_test_split # Call GaussianNB CLF = GaussianNB(). Fit (X_test, y_train) pre = clf.predict(X_test) print(" y_test) print(classification_report(y_test, pre))Copy the code
The following output is displayed:
- Accuracy of the 6 predicted data ===> 0.83
Keras implements RNN text classification
1.IMDB data set and sequence preprocessing
(1) IMDB data sets Keras framework provides us with some commonly used built-in data sets. For example, MNIST data set for handwriting recognition in the field of image recognition, imDB data set for film review in the field of text classification, and so on. These databases can be called with a single code:
- (trainX, trainY), (testX, testY) = imdb.load_data(path= “imdb.npz”, num_words=max_features)
The data sets were downloaded through s3.Amazonaws.com, but sometimes the site was unavailable and the data had to be downloaded locally and then analyzed. Keras Dataset Baidu Cloud Link:
- Pan.baidu.com/s/1aZRp0uMk… , extraction code: 3A2U
The author put the downloaded data in the folder C:\Users\ administrator. keras\datasets, as shown in the figure below.
The data set is the Internet Movie Database (IMDb), an online Database of Movie actors, movies, TV shows, TV stars and Movie productions.
The data and format in the imdb. NPZ file are as follows:
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, ...])
list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, ...])
list([1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5974, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2401, 311, 12, 16, 3711, 33, 75, 43, 1829, 296, 4, 86, 320, 35, ...])
...
list([1, 11, 6, 230, 245, 6401, 9, 6, 1225, 446, 2, 45, 2174, 84, 8322, 4007, 21, 4, 912, 84, 14532, 325, 725, 134, 15271, 1715, 84, 5, 36, 28, 57, 1099, 21, 8, 140, ...])
list([1, 1446, 7079, 69, 72, 3305, 13, 610, 930, 8, 12, 582, 23, 5, 16, 484, 685, 54, 349, 11, 4120, 2959, 45, 58, 1466, 13, 197, 12, 16, 43, 23, 2, 5, 62, 30, 145, ...])
list([1, 17, 6, 194, 337, 7, 4, 204, 22, 45, 254, 8, 106, 14, 123, 4, 12815, 270, 14437, 5, 16923, 12255, 732, 2098, 101, 405, 39, 14, 1034, 4, 1310, 9, 115, 50, 305, ...])] train sequences
Copy the code
Each list is a sentence in which each number represents the number of the word. So, how do I get the word that corresponds to the number? In this case, you need to use the imdb_word_index.json file in the following format:
{"fawn": 34701, "tsukino": 52006,... , "paget": 18509, "expands": 20597}Copy the code
A total of 88584 words are stored in key-value format, where key indicates a word and value indicates a word number. The higher the word frequency (the number of occurrences of the word in the corpus), the smaller the number is. For example, “The :1” has the highest number of occurrences and the number is 1.
(2) Sequence preprocessing In the process of deep learning vector conversion, PAD_SEQUENCES () sequences are usually used to fill. The basic usage is as follows:
keras.preprocessing.sequence.pad_sequences(
sequences,
maxlen=None,
dtype='int32',
padding='pre',
truncating='pre',
value=0.
)
Copy the code
The meanings of the parameters are as follows:
-
Sequences: two-tier nested lists of floating point or integer numbers
-
Maxlen: None or an integer, the maximum length of the sequence. Sequences larger than this length are truncated, and sequences smaller than this length are padded with zeros
-
Dtype: indicates the data type of the returned NUMpy array
-
Padding: Pre or post, which determines whether zeros should be added at the beginning or end of the sequence when needed
-
Truncating: Pre or POST: determines whether to truncate a sequence from the start or end
-
Value: a floating point number that will replace the default fill value of 0 when filling
-
The return value is a 2-dimensional tensor of length maxlen
The basic usage is as follows:
from keras.preprocessing.sequence import pad_sequences print(pad_sequences([[1, 2, 3], [1]], maxlen=2)) """[[2 3] [0 1]]""" print(pad_sequences([[1, 2, 3], [1]], maxlen=3, Value = 9), "" the [[1, 2, 3] 9 1 [9]]", "" print (pad_sequences ([[2 and 4]], Maxlen = 10) ", "" the [[0 0 0 0 0 0 0, 2, 3, 4]]", "" print (pad_sequences ([[1, 2, 3, 4, 5], [6, 7]]. maxlen=10)) """[[0 0 0 0 0 1 2 3 4 5] [0 0 0 0 0 0 0 0 6 7]]""" print(pad_sequences([[1, 2, 3], [1]], maxlen=2, Padding ='post') """ [[2, 3] [0, 1]] "" "print (pad_sequences ([[1, 2, 3], [1]], maxlen = 4, truncating = 'post')) "fill" "starting position: [[0 1 2 3] [0 0 0, 1]]" ""Copy the code
In natural language it is usually used with a word divider.
>>> Tokenizer.texts_to_sequences ([" RAIN I work late "]) [[4, 5, 6, 7]] > > > keras. Preprocessing. Sequence. Pad_sequences (tokenizer. Texts_to_sequences ([" under the rain I work overtime "]), maxlen = 20) array ([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 5, 6, 7]],dtype=int32)Copy the code
2. Word embedding model training
At this point, we will conduct training through word embedding model, and the specific process includes:
-
Import the IMDB dataset
-
Data sets are converted to sequences
-
Create Embedding model for Embedding words
-
Neural network training
The complete code is as follows:
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 17:08:28 2020 @author: Eastmount CSDN """ from keras.datasets import imdb #Movie Database from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense, Flatten, Embedding # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- define the parameters -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - max_features = 20000 # input_dim =max_features # Thesebank size must >=max_features maxlen = 80 Output_dim = 40 training epochs = 2 # # word vector dimensions batch # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- loading data and pretreatment -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # data acquisition (trainX, trainY), (testX, testY) = imdb.load_data(path="imdb.npz", num_words=max_features) print(trainX.shape, trainY.shape) #(25000,) (25000,) print(testX.shape, Testy. shape) #(25000,) (25000,) # trainX = sequence. Pad_sequences (trainX, maxlen=maxlen) testX = sequence.pad_sequences(testX, maxlen=maxlen) print('trainX shape:', trainX.shape) print('testX shape:', TestX. Shape) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - creating a model -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the model = Sequential () Embedding(input_dim, output_dim, input_length=maxlen)) Maxlen *output_dim model.add(Flatten()) # Model.add (Dense(units=1, activation='sigmoid')) #RMSprop ['acc']) # training model.fit(trainX, trainY, batch_size, epochs) #Copy the code
The following output is displayed:
(25000,) (25000,) (25000,) (25000,) trainX shape: (25000, 80) testX shape: (25000, 80) Epoch 1/2 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 98 us/step - loss: 0.6111 acc: 0.6956 Epoch 2/2 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 69 us/step - loss: 0.3578 acc: 0.8549 the Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 80, 40) 800000 _________________________________________________________________ flatten_2 (Flatten) (None, 3200) 0 _________________________________________________________________ dense_2 (Dense) (None, 1) 3201 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 803201 Trainable params: 803201 Non - trainable params: 0 _________________________________________________________________Copy the code
The display matrix is as follows:
3.RNN text classification
The complete code of RNN for text classification of the IMDB movie dataset is shown below:
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 17:08:28 2020 @author: Eastmount CSDN """ from keras.datasets import imdb #Movie Database from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense, Flatten, Embedding the from keras. The layers import SimpleRNN # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- define the parameters -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - Input_dim =max_features # Thesebank size must be >=max_features maxlen = 40 # Batch_size = 128 # Batch quantity output_DIM = 40 # Word vector dimension EPOchs = 3 # Training batch units = 32 #RNN Number of neurons # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- load data and the pretreatment of -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # data acquisition (trainX, trainY), (testX, testY) = imdb.load_data(path="imdb.npz", num_words=max_features) print(trainX.shape, trainY.shape) #(25000,) (25000,) print(testX.shape, Testy. shape) #(25000,) (25000,) # trainX = sequence. Pad_sequences (trainX, maxlen=maxlen) testX = sequence.pad_sequences(testX, maxlen=maxlen) print('trainX shape:', trainX.shape) print('testX shape:', TestX. Shape) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- to create RNN model -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the model = Sequential embedded (#) Add (Embedding(input_dim, output_dim, input_length=maxlen)) #RNN Cell model. Add (SimpleRNN(units, Return_sequences =True) # return SimpleRNN(units, Return_sequences =False) # return_sequences=False) # return_sequences=False) Activation =' sigmoID ')) # Model visualization model.summary() # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- modeling and training -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - model.com # activate neural network running (optimizer = 'RMsprop ', # RMsprop Optimizer Loss =' binary_Crossentropy ', Metrics = ['accuracy'] # training history = model.fit(trainX, trainY, batch_size=batch_size, epochs=epochs, verbose=2, Validation_split. = 1 # samples take 10% for validation) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- prediction and visualization -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- import matplotlib.pyplot as plt accuracy = history.history['accuracy'] val_accuracy = history.history['val_accuracy'] plt.plot(range(epochs), accuracy) plt.plot(range(epochs), val_accuracy) plt.show()Copy the code
The output is shown below, with three EPOchs trained.
-
Accuracy of training data ===> 0.9075
-
Val_accuracy of the evaluation data ===> 0.7844
The Epoch can be graphically represented in the following figure.
(25000,) (25000,) (25000,) (25000,) trainX shape: (25000, 40) testX shape: (25000, 40) Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 40, 40) 800000 _________________________________________________________________ simple_rnn_3 (SimpleRNN) (None, 40, 32) 2336 _________________________________________________________________ simple_rnn_4 (SimpleRNN) (None, 32) 2080 _________________________________________________________________ dense_2 (Dense) (None, 1) 33 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 804449 Trainable params: 804449 Non - trainable params: 0 _________________________________________________________________ Train on 22500 samples, Validate on 2500 samples Epoch 1/3-11S-loss: 0.5741-accuracy: 0.6735-val_loss: 0.4462-val_accuracy: 0.7876 Epoch 2/3-14S-loss: 0.3572-accuracy: 0.8430-val_loss: 0.4928-val_accuracy: Epoch 3/3-12S-loss: 0.2329-accuracy: 0.9075-val_loss: 0.5050-val_accuracy: 0.7844Copy the code
The curves of accuracy and VAL_accuracy drawn are as follows:
- Loss: 0.2329-accuracy: 0.9075-val_loss: 0.5050-val_accuracy: 0.7844
4.RNN implements text classification of Chinese data sets
1.RNN+Word2Vector text classification
The first step is to import the text data set and convert it to a word vector.
Data = [[0, 'millet porridge is made of millet as the main ingredients of porridge, light taste, fragrance, with simple and easy to make, [0, 'porridge must first boil water and then put into the washed millet '], [0,' protein and amino acids, fat, vitamins, minerals '], [0, 'millet is a traditional health food, can be independently braised rice and porridge '], [0,' apple, Porridge is a kind of fruit '], [0, 'porridge is of high nutritional value, rich in minerals and vitamins, rich in calcium, which helps to metabolize excess salt in the body '], [0,' eggs have high nutritional value, is a good source of high-quality protein, B vitamins, [0, 'The apples in this supermarket are very fresh '], [0,' millet is one of the main food in the north, many areas have the custom of eating millet porridge for dinner '], [0, 'millet is high in nutritional value, comprehensive and balanced nutrition, Mainly contains carbohydrates '], [0, 'protein and amino acids, fat, vitamins, salt '], [1,' Xiaomi, Samsung, Huawei, as the flagship of the three Android phones '], [1, 'Forget Xiaomi and Huawei! This is really perfect. '], [1, 'Apple may be stuck in 2016 again, but this time it can't raise prices much. '], [1,' Samsung wants to continue to hold huawei down, the A70 isn't enough. '], [1, 'Samsung will hit a new high in screen share, [1, 'Huawei P30 and Samsung A70 sold out and won the Best Mobile phone marketing award of Suning '], [1,' Lei Jun, tell you with a picture: Where is the gap between Xiaomi and Samsung '], [1, 'Mi Mi chat APP official Linux version online, adapted to the depth of the system '], [1,' Samsung has just updated their own wearable device APP'], [1, 'Huawei, Xiaomi crossover is not terrible, X, Y = [lcut(I [1]) for I in data], [I [0] for I in data] y_test = train_test_split(X, Y) #print(X_train) print(len(X_train), len(X_test)) print(len(y_train), Len (y_test)) "" "[' samsung ', 'just' and 'update', 'a', 'home', 'the', 'can' and 'wear', 'device' "" "' APP '] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec word vector -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec = Word2Vec (X_train, Size =max_features, min_count=1) # print(word2vec) # print w2i = {w: I for I, W in enumerate(word2vec.wv.index2word)} print(" [display words] ") print(word2vec.wv.index2word) print(w2i) "" Proteins' vitamin ', ' ', 'and', 'APP', 'amino acids',... "" "" "" {:', '0,' ': 1," millet ": 2,' and ': 3,' huawei: 4, ....}""" # vectors = word2vec.wv.vectors print(" vector matrix ") print(vectors. Shape) print(vectors) # Get (w) return vectors[I] if I else zeros(max_features) # def pad(ls_of_words): a = [[w2v(i) for i in x] for x in ls_of_words] a = pad_sequences(a, maxlen, X_train, X_test = pad(X_train), pad(X_test)Copy the code
The output is as follows:
6 6 6 6 Word2Vec(VOCab =120, size=20, alpha=0.025) ', ' ', 'and', 'millet', 'samsung', 'is',' vitamin ', 'proteins',' and ', 'fat', 'huawei', 'apple', 'can', 'APP', 'amino acids',' in ', 'mobile phone', 'flagship', 'minerals',' main ', 'have ',' millet porridge ', 'as ',' just ', 'update ',' equipment ',...] {: ', '0,' ': 1, :', '2,' millet: 3, "samsung" : 4, 'is' : 5,' vitamin: 6, 'protein: 7,' and ': 8,' fat ': 9,' and ': 10,' huawei: 11, 'apple' : 12, 'can ': 13, 'APP': 14,' amino acid ': 15... } 【 word vector matrix 】 (120, 20) [[0.00219526 0.00936278 0.00390177... -0.00422463 0.01543128 0.02481441] [0.02346811-0.01520025-0.00563479... -0.01656673-0.02222313 0.00438196] [-0.02253242-0.01633896-0.02209039... 0.01301584-0.01016752 0.01147605]... [ [-0.00599797 0.02155897-0.01874896] [-0.00599797 0.02155897-0.01874896... [0.0050361-0.00848463-0.0235001... 0.01531716-0.02348576 0.01051775]]Copy the code
The second step is to establish the RNN neural network structure, use the BI-GRU model, and conduct training and prediction.
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- modeling and training -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the model = Sequential two-way RNN (#) Add (Bidirectional(GRU(units), input_shape=(maxlen, max_features))) Activation ='sigmoid') # Compile (optimizer =' rmsprop') # compile(optimizer =' rmsprop', #RMSprop Optimizer Loss = 'binary_crossentropy', # binary crossentropy loss metrics = [' ACC '] # calculation error or accuracy) # training history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=verbose, validation_data=(X_test, Y_test) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- prediction and visualization -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # prediction score = model. The evaluate (X_test, y_test, batch_size=batch_size) print('test loss:', score[0]) print('test accuracy:', Score [1]) # Visualized acc = history.history['acc'] val_acc = history.history['val_acc'] # Set category label("Iterations") Plot plt.plot(range(epochs), acc, "bo-", lineWidth =2, markersize=12, label="accuracy") plt.plot(range(epochs), val_acc, "gs-", linewidth=2, markersize=12, label="val_accuracy") plt.legend(loc="upper left") plt.title("RNN-Word2vec") plt.show()Copy the code
The output results are shown in the figure below, and the values of accuracy and VAL_accuracy are found to be very unsatisfactory. How do you solve it?
The neural network model and Epoch training results are shown in the figure below:
-
The test loss: 0.7160684466362
-
The test accuracy: 0.33333334
EarlyStopping. EarlyStopping is a type of Callbacks used to specify which specific operations will be performed at the beginning and end of each epoch. Callbacks have set interfaces that can be used directly, such as ACC, VAL_ACC, Loss, and val_Loss. EarlyStopping is a callbacks used to stop training in advance. It can stop training when the loss in the training set is not reduced (that is, the reduction is less than a certain threshold). In the above program, Callbacks can be called to stop training when the loss is not reduced. [deep learning] the EarlyStopping of keras using tips and techniques – zwqjoy
Finally, the complete code of this part is given:
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 22:10:20 2020 @author: Eastmount CSDN """ from jieba import lcut from numpy import zeros import matplotlib.pyplot as plt from gensim.models import Word2Vec from sklearn.model_selection import train_test_split from tensorflow.python.keras.preprocessing.sequence import pad_sequences from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense, GRU, Bidirectional from tensorflow.python.keras.callbacks import EarlyStopping # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- define the parameters -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- max_features = # 20 word vector dimension units = 30 # RNN neuron number Maxlen = 40 # epochs = 9 # batch_size = 12 # data size per batch verbose = 1 # Patience = 1 # callbacks = [EarlyStopping('val_acc', Patience = patience)] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- load data and the pretreatment of -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- data = [[0, 'millet congee is millet as the main ingredients of stewing porridge, a light, fresh scent, with simple and easy system, the characteristics of jianwei xiaoshi'], [0, 'when the porridge must first after boiling water and then into the wash of millet'], [0, 'protein and amino acid, fat, vitamins and minerals'], [0,' millet is a traditional healthy food, [0, 'apple is a kind of fruit '], [0,' porridge is of high nutritional value, rich in minerals and vitamins, rich in calcium, which helps to metabolize excess salt in the body '], [0, 'eggs have high nutritional value, is a good source of high-quality protein, B vitamins, [0, 'The apples in this supermarket are very fresh '], [0,' millet is one of the main food in the north, many areas have the custom of eating millet porridge for dinner '], [0, 'millet is high in nutritional value, comprehensive and balanced nutrition, Mainly contains carbohydrates '], [0, 'protein and amino acids, fat, vitamins, salt '], [1,' Xiaomi, Samsung, Huawei, as the flagship of the three Android phones '], [1, 'Forget Xiaomi and Huawei! This is really perfect. '], [1, 'Apple may be stuck in 2016 again, but this time it can't raise prices much. '], [1,' Samsung wants to continue to hold huawei down, the A70 isn't enough. '], [1, 'Samsung will hit a new high in screen share, [1, 'Huawei P30 and Samsung A70 sold out and won the Best Mobile phone marketing award of Suning '], [1,' Lei Jun, tell you with a picture: Where is the gap between Xiaomi and Samsung '], [1, 'Mi Mi chat APP official Linux version online, adapted to the depth of the system '], [1,' Samsung has just updated their own wearable device APP'], [1, 'Huawei, Xiaomi crossover is not terrible, X, Y = [lcut(I [1]) for I in data], [I [0] for I in data] y_test = train_test_split(X, Y) #print(X_train) print(len(X_train), len(X_test)) print(len(y_train), Len (y_test)) "" "[' samsung ', 'just' and 'update', 'a', 'home', 'the', 'can' and 'wear', 'device' "" "' APP '] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec word vector -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec = Word2Vec (X_train, Size =max_features, min_count=1) # print(word2vec) # print w2i = {w: I for I, W in enumerate(word2vec.wv.index2word)} print(" [display words] ") print(word2vec.wv.index2word) print(w2i) "" Proteins' vitamin ', ' ', 'and', 'APP', 'amino acids',... "" "" "" {:', '0,' ': 1," millet ": 2,' and ': 3,' huawei: 4, ....}""" # vectors = word2vec.wv.vectors print(" vector matrix ") print(vectors. Shape) print(vectors) # Get (w) return vectors[I] if I else zeros(max_features) # def pad(ls_of_words): a = [[w2v(i) for i in x] for x in ls_of_words] a = pad_sequences(a, maxlen, X_train, X_test = pad(X_train), Pad (X_test) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- modeling and training -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the model = Sequential two-way RNN (#) Add (Bidirectional(GRU(units), input_shape=(maxlen, max_features))) Activation ='sigmoid') # Compile (optimizer =' rmsprop') # compile(optimizer =' rmsprop', #RMSprop Optimizer Loss = 'binary_crossentropy', # binary crossentropy loss metrics = [' ACC '] # calculation error or accuracy) # training history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=verbose, validation_data=(X_test, Y_test) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- prediction and visualization -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # prediction score = model. The evaluate (X_test, y_test, batch_size=batch_size) print('test loss:', score[0]) print('test accuracy:', Score [1]) # Visualized acc = history.history['acc'] val_acc = history.history['val_acc'] # Set category label("Iterations") Plot plt.plot(range(epochs), acc, "bo-", lineWidth =2, markersize=12, label="accuracy") plt.plot(range(epochs), val_acc, "gs-", linewidth=2, markersize=12, label="val_accuracy") plt.legend(loc="upper left") plt.title("RNN-Word2vec") plt.show()Copy the code
2.LSTM+Word2Vec text classification
Next we use LSTM and Word2Vec for text classification. The structure of the whole neural network is very simple. The first layer is the embedding layer, which converts the words in the text into vectors. Then it passes through a layer of LSTM, using the hidden state of the last moment in LSTM; Then a full connection layer can complete the construction of the entire network.
Notice the transformation of the matrix shape.
-
X_train = X_train.reshape(len(y_train), maxlen*max_features)
-
X_test = X_test.reshape(len(y_test), maxlen*max_features)
The complete code is shown below:
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 22:10:20 2020 @author: Eastmount CSDN """ from jieba import lcut from numpy import zeros import matplotlib.pyplot as plt from gensim.models import Word2Vec from sklearn.model_selection import train_test_split from tensorflow.python.keras.preprocessing.sequence import pad_sequences from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense, LSTM, GRU, Embedding from tensorflow.python.keras.callbacks import EarlyStopping # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- define the parameters -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- max_features = # 20 word vector dimension units = 30 # RNN neuron number Maxlen = 40 # epochs = 9 # batch_size = 12 # data size per batch verbose = 1 # Patience = 1 # callbacks = [EarlyStopping('val_acc', Patience = patience)] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- load data and the pretreatment of -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- data = [[0, 'millet congee is millet as the main ingredients of stewing porridge, a light, fresh scent, with simple and easy system, the characteristics of jianwei xiaoshi'], [0, 'when the porridge must first after boiling water and then into the wash of millet'], [0, 'protein and amino acid, fat, vitamins and minerals'], [0,' millet is a traditional healthy food, [0, 'apple is a kind of fruit '], [0,' porridge is of high nutritional value, rich in minerals and vitamins, rich in calcium, which helps to metabolize excess salt in the body '], [0, 'eggs have high nutritional value, is a good source of high-quality protein, B vitamins, [0, 'The apples in this supermarket are very fresh '], [0,' millet is one of the main food in the north, many areas have the custom of eating millet porridge for dinner '], [0, 'millet is high in nutritional value, comprehensive and balanced nutrition, Mainly contains carbohydrates '], [0, 'protein and amino acids, fat, vitamins, salt '], [1,' Xiaomi, Samsung, Huawei, as the flagship of the three Android phones '], [1, 'Forget Xiaomi and Huawei! This is really perfect. '], [1, 'Apple may be stuck in 2016 again, but this time it can't raise prices much. '], [1,' Samsung wants to continue to hold huawei down, the A70 isn't enough. '], [1, 'Samsung will hit a new high in screen share, [1, 'Huawei P30 and Samsung A70 sold out and won the Best Mobile phone marketing award of Suning '], [1,' Lei Jun, tell you with a picture: Where is the gap between Xiaomi and Samsung '], [1, 'Mi Mi chat APP official Linux version online, adapted to the depth of the system '], [1,' Samsung has just updated their own wearable device APP'], [1, 'Huawei, Xiaomi crossover is not terrible, X, Y = [lcut(I [1]) for I in data], [I [0] for I in data] y_test = train_test_split(X, Y) #print(X_train) print(len(X_train), len(X_test)) print(len(y_train), Len (y_test)) "" "[' samsung ', 'just' and 'update', 'a', 'home', 'the', 'can' and 'wear', 'device' "" "' APP '] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec word vector -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Word2Vec = Word2Vec (X_train, Size =max_features, min_count=1) # print(word2vec) # print w2i = {w: I for I, W in enumerate(word2vec.wv.index2word)} print(" [display words] ") print(word2vec.wv.index2word) print(w2i) "" Proteins' vitamin ', ' ', 'and', 'APP', 'amino acids',... "" "" "" {:', '0,' ': 1," millet ": 2,' and ': 3,' huawei: 4, ....}""" # vectors = word2vec.wv.vectors print(" vector matrix ") print(vectors. Shape) print(vectors) # Get (w) return vectors[I] if I else zeros(max_features) # def pad(ls_of_words): a = [[w2v(i) for i in x] for x in ls_of_words] a = pad_sequences(a, maxlen, X_train, X_test = pad(X_train), Print (x_test.shape) print(x_test.shape) ""(10, 40, 20) print(x_test.shape) "" 20)=>(15, 40*20) (6, 40, 20)=>(6, 40*20) X_train = X_train.reshape(len(y_train), maxlen*max_features) X_test = X_test.reshape(len(y_test), Maxlen * max_features) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- modeling and training -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the model = Sequential () Add (Embedding(max_features, 128)) add(LSTM(128, Dropout =0.2) Recurrent_dropout =0.2) # RecurrenT_dropout =0.2) # Recurrent_dropout =True # Note that the LSTM layer will only get output from the last node, and return_sequences=True if you want to output results at each point in time Model. Add (Dense(units=1, activation='sigmoid')) # #RMSprop Optimizer Loss = 'binary_crossentropy', # binary crossentropy loss metrics = [' ACC '] # calculation error or accuracy) # training history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=verbose, validation_data=(X_test, Y_test) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- prediction and visualization -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # prediction score = model. The evaluate (X_test, y_test, batch_size=batch_size) print('test loss:', score[0]) print('test accuracy:', Score [1]) # Visualized acc = history.history['acc'] val_acc = history.history['val_acc'] # Set category label("Iterations") Plot plt.plot(range(epochs), acc, "bo-", lineWidth =2, markersize=12, label="accuracy") plt.plot(range(epochs), val_acc, "gs-", linewidth=2, markersize=12, label="val_accuracy") plt.legend(loc="upper left") plt.title("LSTM-Word2vec") plt.show()Copy the code
The output, as shown below, is still not ideal.
-
The test loss: 0.712007462978363
-
The test accuracy: 0.33333334
The corresponding graph is shown below.
3.LSTM+TFIDF text classification
At the same time, add LSTM+TFIDF text classification code.
# -*- coding: utf-8 -*- """ Created on Sat Mar 28 22:10:20 2020 @author: Eastmount CSDN """ from jieba import lcut from numpy import zeros import matplotlib.pyplot as plt from gensim.models import Word2Vec from sklearn.model_selection import train_test_split from tensorflow.python.keras.preprocessing.sequence import pad_sequences from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense, LSTM, GRU, Embedding from tensorflow.python.keras.callbacks import EarlyStopping # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- define the parameters -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- max_features = # 20 word vector dimension units = 30 # RNN neuron number Maxlen = 40 # epochs = 9 # batch_size = 12 # data size per batch verbose = 1 # Patience = 1 # callbacks = [EarlyStopping('val_acc', Patience = patience)] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- load data and the pretreatment of -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- data = [[0, 'millet congee is millet as the main ingredients of stewing porridge, a light, fresh scent, with simple and easy system, the characteristics of jianwei xiaoshi'], [0, 'when the porridge must first after boiling water and then into the wash of millet'], [0, 'protein and amino acid, fat, vitamins and minerals'], [0,' millet is a traditional healthy food, [0, 'apple is a kind of fruit '], [0,' porridge is of high nutritional value, rich in minerals and vitamins, rich in calcium, which helps to metabolize excess salt in the body '], [0, 'eggs have high nutritional value, is a good source of high-quality protein, B vitamins, [0, 'The apples in this supermarket are very fresh '], [0,' millet is one of the main food in the north, many areas have the custom of eating millet porridge for dinner '], [0, 'millet is high in nutritional value, comprehensive and balanced nutrition, Mainly contains carbohydrates '], [0, 'protein and amino acids, fat, vitamins, salt '], [1,' Xiaomi, Samsung, Huawei, as the flagship of the three Android phones '], [1, 'Forget Xiaomi and Huawei! This is really perfect. '], [1, 'Apple may be stuck in 2016 again, but this time it can't raise prices much. '], [1,' Samsung wants to continue to hold huawei down, the A70 isn't enough. '], [1, 'Samsung will hit a new high in screen share, [1, 'Huawei P30 and Samsung A70 sold out and won the Best Mobile phone marketing award of Suning '], [1,' Lei Jun, tell you with a picture: Where is the gap between Xiaomi and Samsung '], [1, 'Mi Mi chat APP official Linux version online, adapted to the depth of the system '], [1,' Samsung has just updated their own wearable device APP'], [1, 'Huawei, Xiaomi crossover is not terrible, Join (lcut(I [1])) for I in data],] #中文分词 X, Y = [". Join (lcut(I [1])) for I in data], [I [0] for I in data] print(X) print(Y) #[I [0] for I in data] print(X) print(Y) # # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- to calculate word -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the from sklearn. Feature_extraction. Text Import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer # Convert words in the text to the word frequency matrix vectorizer = X_data = vectorizer.fit_transform(X) print(X_data Vectorizer.get_feature_names () print(' [查看 外] ') for w in word: print(w, end = "") else: Print (x_data.toarray ()) print(x_data.toarray ()) transformer = TfidfTransformer() tfidf = Tfidf [I][j] tF-IDF weight = tFIDF.toarray () print(weight) # data set partition X_train, X_test, y_train, y_test = train_test_split(weight, Y) print(X_train.shape, X_test.shape) print(len(y_train), len(y_test)) #(15, 117) (6, 117) 15 # 6 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- modeling and training -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the model = Sequential () # building Embedding layer Add (Embedding(max_features, 128)) # Add (LSTM(128, Dropout =0.2, Recurrent_dropout =0.2) # RecurrenT_dropout =0.2) # Recurrent_dropout =True # Note that the LSTM layer will only get output from the last node, and return_sequences=True if you want to output results at each point in time Model. Add (Dense(units=1, activation='sigmoid')) # #RMSprop Optimizer Loss = 'binary_crossentropy', # binary crossentropy loss metrics = [' ACC '] # calculation error or accuracy) # training history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=verbose, validation_data=(X_test, Y_test) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- prediction and visualization -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # prediction score = model. The evaluate (X_test, y_test, batch_size=batch_size) print('test loss:', score[0]) print('test accuracy:', Score [1]) # Visualized acc = history.history['acc'] val_acc = history.history['val_acc'] # Set category label("Iterations") Plot plt.plot(range(epochs), acc, "bo-", lineWidth =2, markersize=12, label="accuracy") plt.plot(range(epochs), val_acc, "gs-", linewidth=2, markersize=12, label="val_accuracy") plt.legend(loc="upper left") plt.title("LSTM-TFIDF") plt.show()Copy the code
The following output is displayed:
-
The test loss: 0.7694947719573975
-
The test accuracy: 0.33333334
The corresponding graph is as follows:
4. Comparative analysis of machine learning and deep learning
In the end, we made a simple comparison and found that machine learning is better than deep learning. Why? What improvements can we make?
-
MultinomialNB+TFIDF: Test accuracy = 0.67
-
GaussianNB+Word2Vec: Test accuracy = 0.83
-
RNN+Word2Vector: test accuracy = 0.33333334
-
LSTM+Word2Vec: Test accuracy = 0.33333334
-
LSTM+TFIDF: Test accuracy = 0.33333334
The author makes a simple analysis based on their articles and their own experience. The reasons are as follows:
-
First, due to the preprocessing of data set, the above code does not carry out stop word filtering, and a large number of punctuation marks and stop words affect the effect of text classification. At the same time, the dimension setting of the word vector also needs to be debugged.
-
The second reason is the size of the data set. In the case of a small amount of data, it is recommended to use CNN. The over-fitting of RNN will make you want to cry. If there is a large amount of data, perhaps RNN will work better. For innovation, RLSTM and RCNN are good choices in recent years. But while ordinary machine learning methods are good enough just for application purposes (especially for news datasets), bayes is really the best choice when you consider the time cost.
-
Third, CNN and RNN have different applicability. CNN is good at learning and capturing spatial features, while RNN is good at capturing temporal features. Structurally, RNN is superior. The introduction of seq2SEQ (two independent RNNS) for mainstream NLP problems, such as translation and text generation, breaks many previous benchmarks. Attention was introduced to solve the problem of long sentences. The essence of Attention is that an extra softmax is added to learn the mapping relationship between words and words, which is a bit like plug-in storage. The root of Attention comes from a paper named “Neural Turing Machine”.
-
Fourth, different data sets adapt to different methods, and each method has its own strengths. Some sentiment analysis GRU is better than CNN, while CNN may have an advantage in news classification and text classification contests. CNN has the advantage of speed. For large data, CNN can increase parameters and fit more types of Local Phrase frequency to achieve better results. If you want to make a system, and both algorithms have their own strengths, it’s ensemble time.
-
Fifth, IN the field of text emotion classification, GRU is better than CNN, and with the increase of sentence length, this advantage of GRU will be further amplified. GRU is easier to classify correctly when the sentence’s emotional classification is determined by the whole sentence, and CNN is easier to classify correctly when the sentence’s emotional classification is determined by several local key-phrases.
In short, we try to choose the algorithm suitable for our data set in the real experiment, which is also part of the experiment. We need to compare various algorithms, various parameters and various learning models to find a better algorithm. In the future, the author will further study TextCNN, Attention, BiLSTM, GAN and other algorithms, hoping to make progress together with everyone.
For Chinese long text classification, CNN or RNN is better? – zhihu
In live.
This is the end of the article. Hope to help you, at the same time the article is insufficient or wrong place, welcome readers to put forward. These experiments are common questions in my thesis research or project evaluation, and I hope readers can take these questions and think deeply about them based on their own needs, and I hope you can put what you have learned into practice.
In short, this paper realizes a case of RNN text classification learning through Keras, and introduces in detail the principle knowledge of recurrent neural network and its comparison with machine learning. Finally, as a rookie of ARTIFICIAL intelligence, I hope I can make continuous progress and deepen, and then apply it in the fields of image recognition, network security, anti-sample and so on, and guide everyone to write simple academic papers. Let’s go!
Thank you for meeting us in Huawei Cloud! I hope to grow together with you in huawei cloud community. The original address: blog.csdn.net/Eastmount/a… (By: Zhang’s House Eastmount 2021-11-09 night in Wuhan)
References:
Thanks again for the contributions of my predecessors and teachers, as well as the author’s Python ARTIFICIAL Intelligence and data analysis series, which can be downloaded on Github.
[1] Keras’ IMDB and MNIST datasets cannot be downloaded
[2] 4.42 (imdb.py) – Keras Learning Notes iv – WYx100
[3] Keras Text Classification – Strong Promotion of Kiki Wei teacher article
[4] TextCNN text classification (keras implementation) – strong promotion of asia-lee’s article
[5] Keras text classification: A strong inference from Wang Yilei’s article
[6] BiLSTM+ Attentional headline text classification by Keras (Natural language processing) — Ilivecode
[7] Solving large scale text classification problems with deep learning (CNN RNN Attention) — Review and Practice — Zhihuqing Song
[8] Serena And Niu Yafeng: An Overview of Text categorization based on Word2vec and CNN
[9] github.com/keras-team/…
[10] [Deep Learning] How to Use Keras at the End of The Track – ZWqjoy
[11] Why do validation accuracy outnumber train accuracy in deep learning? – ICOZ
[12] Keras implementation of CNN text classification – vivian_ll
[13] Keras Text Classification Practice (PART 1) – Weixin_34351321
[14] For Chinese long text classification, CNN or RNN is better? – zhihu
Click to follow, the first time to learn about Huawei cloud fresh technology ~