Original link:tecdat.cn/?p=8640
Original source:Tuo End number according to the tribe public number
introduce
In this article, we’ll see how to develop a text classification model with multiple outputs. We develop a text classification model that analyzes a text annotation and predicts multiple tags associated with that annotation. The multi-label classification problem is actually a subset of multiple output models. At the end of this article, you will be able to perform multi-label text categorization of data.
The data set
The dataset contains comments from Wikipedia conversation page editors. Reviews can fall into all or a subset of these categories, which is a multi-label classification problem.
Now we import the required libraries and load the data set into our application. The following script imports the required libraries:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
Copy the code
Now load the data set into memory:
toxic_comments = pd.read_csv("/comments.csv")
Copy the code
The following script displays the dimensions of the dataset and displays the title of the dataset:
print(toxic_comments.shape)
toxic_comments.head()
Copy the code
Output:
(159571, 8)Copy the code
The dataset contains 159,571 records and 8 columns. The title of the dataset is as follows:
Let’s delete any records that contain a null value or an empty string in their line.
filter = toxic_comments["comment_text"] ! = "" toxic_comments = toxic_comments[filter] toxic_comments = toxic_comments.dropna()Copy the code
The COMMENT_TEXT column contains text comments.
print(toxic_comments["comment_text"][168])
Copy the code
Output:
You should be fired, you're a moronic wimp who is too lazy to do research. It makes me sick that people like you exist in this world.
Copy the code
Let’s take a look at the tag associated with this comment:
print("Toxic:" + str(toxic_comments["toxic"][168]))
print("Severe_toxic:" + str(toxic_comments["severe_toxic"][168]))
print("Obscene:" + str(toxic_comments["obscene"][168]))
print("Threat:" + str(toxic_comments["threat"][168]))
print("Insult:" + str(toxic_comments["insult"][168]))
print("Identity_hate:" + str(toxic_comments["identity_hate"][168]))
Copy the code
Output:
Toxic:1
Severe_toxic:0
Obscene:0
Threat:0
Insult:1
Identity_hate:0
Copy the code
We will filter all labels or output columns first.
toxic_comments_labels = toxic_comments[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]]
toxic_comments_labels.head()
Copy the code
Output:
Using the toxic_Comments_Labels data box, we will draw bar charts showing the total number of comments for different labels. ` `
Output:
As you can see, “toxic” comments appear the most frequently, followed by “insulting” comments.
Create multi-label text classification model
There are two ways to create a multi-label classification model: using a single dense output layer and multiple dense output layers.
In the first approach, we can use a single dense layer with six outputs, with an S-type activation function and a binary cross entropy loss function.
In the second approach, we will create a dense output layer for each tag.
Multi-label text classification model with single output layer
In this section, we will create a multi-label text classification model with a single output layer. ` `
In the next step, we will create input and output sets. The input is a comment from the COMMENT_TEXT column.
We don’t need to perform any one-key encoding here, because our output label is already in the form of a one-key encoding vector.
Next, we split the data into training sets and test sets: we need to convert the text input into embedded vectors. ` `
We will use the GloVe word embedding to convert text input to numeric input.
The following script creates the model. Our model will have an input layer, an embed layer, an LSTM layer with 128 neurons and an output layer with 6 neurons, because we have 6 tags in the output.
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(6, activation='sigmoid')(LSTM_Layer_1)
model = Model()
Copy the code
Let’s output the model summary:
print(model.summary())
Copy the code
Output:
_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 200) 0 _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ embedding_1 (Embedding) (None, 200, 100) 14824300 _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ lstm_1 (LSTM) (None, 128) 117248 _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ dense_1 (Dense) (None, 6) 774 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 14942322 Trainable params: 118022 Non - trainable params: 14824300Copy the code
The following script outputs the structure of our neural network:
plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)
Copy the code
Output:
As can be seen from the figure above, the output layer contains only one dense layer with six neurons. Now let’s train the model:
You can spend more time training the model to see if the results are good or bad.Copy the code
The results are as follows:
rain on 102124 samples, validate on 25532 samples Epoch 1/5 102124/102124 [==============================] - 245s 2ms/step - loss: 0.1437-ACC: 0.9634-val_loss: 0.1361-val_ACC: 0.9631 Epoch 2/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 2 ms - 245 - s/step - loss: 0.0763 acc: 0.9753-VAL_loss: 0.0621 - val_ACC: 0.9788 Epoch 3/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 2 ms - 243 - s/step - loss: 0.0588 acc: 0.9800-VAL_loss: 0.0578-val_ACC: 0.9802 Epoch 4/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 2 ms - 246 - s/step - loss: 0.0559 acc: 0.9807 - val_loss: 0.0571-val_ACC: 0.9801 Epoch 5/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 2 ms - 245 - s/step - loss: 0.0528 acc: 0.9813-VAL_loss: 0.0554-val_ACC: 0.9807Copy the code
Now let’s evaluate the model in the test set:
print("Test Score:", score[0])
print("Test Accuracy:", score[1])
Copy the code
Output:
31915/31915 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 108 - s 3 ms/step Test Score: 0.054090796736467786 the Test Accuracy: 0.9810642735274182Copy the code
Our model achieves an accuracy of about 98%.
Finally, we will plot the losses and accuracy of the training and test sets to see if our model overfits.
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
Copy the code
Output:
You can see that the model is not overfitted on the validation set.
Multi-label text classification model with multiple output layers
In this section, we will create a multi-label text classification model where each output label will have an output dense layer. Let’s first define the preprocessing function:
Def preprocess_text(sen): def preprocess_text(sen): def preprocess_text(sen): def preprocess_text(sen): def preprocess_text(sen): def preprocess_text(sen)Copy the code
The second step is to create inputs and outputs for the model. The input to the model will be text comments and the output will be six tags. The following script creates the input layer and the combined output layer:
y = toxic_comments[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]]
Copy the code
Let’s divide the data into training sets and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
Copy the code
The y variable contains a combined output of six labels. However, we will create a separate output layer for each label. We will create six variables that store the individual labels from the training data, and six variables that store the individual label values of the test data.
The next step is to convert the text input into an embedded vector.
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
Copy the code
We will again use the word GloVe to embed:
embedding_matrix = zeros((vocab_size, 100))
Copy the code
Our model will have an input layer, an embedding layer, and then an LSTM layer with 128 neurons. The output of the LSTM layer will be used as the input for six dense output layers. Each output layer has one neuron with s-type activation function.
The following script creates our model:
Model = model () The following script outputs a summary of the model:Copy the code
print(model.summary())
Copy the code
Output:
_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, 200) 0 _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ embedding _1 (Embedding) (None, 200, 100) 14824300 input_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ lstm _1 (LSTM) (None, 128) 117248 embedding_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ dense _1 (Dense) (None, 1) 129 lstm_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ dense _2 (Dense) (None, 1) 129 lstm_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ dense _3 (Dense) (None, 1) 129 lstm_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ dense _4 (Dense) (None, 1) 129 lstm_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ dense _5 (Dense) (None, 1) 129 lstm_ 1[0][0] _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ ___ dense_6 (Dense) (None, 1) 129 lstm_1[0][0] ================================================================================================== Total Params: 14,942,322 Trainable Params: 118,022 Non-trainable Params: 14,824,300Copy the code
The following script shows the architecture of our model:
plot_model(model, to_file='model_plot4b.png', show_shapes=True, show_layer_names=True)
Copy the code
Output:
You can see that we have six different output layers. The figure above clearly illustrates the difference between the model we created in the previous section with a single input layer and one with multiple output layers.
Now let’s train the model:
history = model.fit(x=X_train, y=[y1_train, y2_train, y3_train, y4_train, y5_train, y6_train], batch_size=8192, Epochs = 5, verbose = 1, validation_split = 0.2)Copy the code
The training process and results are as follows:
Output:
Train on 102124 samples, validate on 25532 samples Epoch 1/5 102124/102124 [==============================] - 24s 239us/step - loss: 3.5116 - dense_1_loss: 0.6017 - denSE_2_loss: 0.5806 - DENSE_3_Loss: 0.6150 - dense_4_loss: 0.5585 - dense_5_loss: 0.5828 - denSE_6_loss: 0.5730 - dense_1_ACC: 0.9029 - denSE_2_ACC: 0.9842 - denSE_3_ACC: 0.94444 - dense_4_ACC: 0.9934 - denSE_5_ACC: 0.9508 - dense_6_ACC: 0.9870 - val_loss: 1.0369-val_dense_1_Loss: 0.3290 - val_dense_2_loss: 0.0983-VAL_DENSE_3_Loss: 0.2571 - val_DENse_4_loss: 0.0595-val_dense_5_loss: 0.1972 - val_dense_6_loss: 0.0959-val_denSE_1_ACC: 0.9037 - val_DENSE_2_ACC: 0.9901 - val_DENSE_3_ACC: 0.9469 - val_DENse_4_ACC: 0.9966 - val_denSE_5_acc: 0.9509 - val_dense_6_acc: 0.9901 Epoch 2/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 20 s 197 us/step - loss: 0.9084 - dense_1_loss: 0.3324 - denSE_2_loss: 0.0679-denSE_3_Loss: 0.2172 - dense_4_loss: 0.0338-dense_5_loss: 0.1983 - dense_6_loss: 0.0589 - denSE_1_ACC: 0.9043 - denSE_2_ACC: 0.9899 - denSE_3_ACC: 0.9474 - dense_4_ACC: 0.9968 - dense_5_ACC: 0.9510 - denSE_6_ACC: 0.9915 - val_loss: 0.8616 - val_dense_1_Loss: 0.3164 - val_dense_2_loss: 0.0555-VAL_DENSE_3_loss: 0.2127 - val_DENse_4_loss: 0.0235 - val_dense_5_loss: 0.1981 - val_dense_6_loss: 0.0554-val_denSE_1_ACC: 0.9038 - val_DENSE_2_ACC: 0.9900 - val_DENSE_3_ACC: 0.9469 - val_dense_4_ACC: 0.9965 val_denSE_5_acc: 0.9509 val_dense_6_acc: 0.9900 Epoch 3/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 20 s 199 us/step - loss: 0.8513 - dense_1_loss: 0.3179 - denSE_2_loss: 0.0566-DENSE_3_Loss: 0.2103 - dense_4_loss: 0.0216 - dense_5_loss: 0.1960 - dense_6_loss: 0.0490 - denSE_1_ACC: 0.9043 - denSE_2_ACC: 0.9899 - denSE_3_ACC: 0.9474 - dense_4_ACC: 0.9968 - dense_5_acc: 0.9510 - denSE_6_ACC: 0.9915 - val_loss: 0.8552 - val_dense_1_loss: 0.3158 - val_dense_2_loss: 0.0566-VAL_DENSE_3_Loss: 0.2074 - val_DENse_4_loss: 0.0225 - val_dense_5_loss: 0.1960 - val_dense_6_loss: 0.0568 - val_DENSE_1_ACC: 0.9038 - val_DENSE_2_ACC: 0.9900 - val_DENSE_3_ACC: 0.9469-val_dense_4_ACC: 0.9965 val_denSE_5_acc: 0.9509 val_dense_6_acc: 0.9900 Epoch 4/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 20 s 198 us/step - loss: 0.8442 - dense_1_loss: 0.3153 - denSE_2_loss: 0.0570 - DENSE_3_Loss: 0.2061 - dense_4_loss: 0.0213 - dense_5_loss: 0.1952 - dense_6_loss: 0.0493 - denSE_1_ACC: 0.9043 - denSE_2_ACC: 0.9899 - denSE_3_ACC: 0.9474 - dense_4_ACC: 0.9968 - dense_5_ACC: 0.9510 - denSE_6_ACC: 0.9915 - val_loss: 0.8527 - val_dense_1_loss: 0.3156 - val_dense_2_loss: 0.0558-VAL_DENSE_3_Loss: 0.2074 - val_DENse_4_loss: 0.0226 - val_dense_5_loss: 0.1951 - val_dense_6_loss: 0.0561-val_DENSE_1_ACC: 0.9038 - val_DENSE_2_ACC: 0.9900 - val_DENSE_3_ACC: 0.9469-val_dense_4_ACC: 0.9965 val_denSE_5_acc: 0.9509 val_dense_6_acc: 0.9900 Epoch 5/5 102124/102124 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 20 s 197 us/step - loss: 0.8410 - dense_1_loss: 0.3146 - denSE_2_loss: 0.0561-DENSE_3_Loss: 0.2055 - dense_4_loss: 0.0213 - dense_5_loss: 0.1948 - dense_6_loss: 0.0486-dense_1_ACC: 0.9043 - denSE_2_ACC: 0.9899 - denSE_3_ACC: 0.9474 - dense_4_ACC: 0.9968 - dense_5_acc: 0.9510 - denSE_6_ACC: 0.9915 - val_loss: 0.8501 - val_dense_1_Loss: 0.3153 - val_dense_2_loss: 0.0553-VAL_DENSE_3_Loss: 0.2069 - val_DENse_4_loss: 0.0226 - val_dense_5_loss: 0.1948 - val_dense_6_loss: 0.0553-val_DENSE_1_ACC: 0.9038 - val_DENSE_2_ACC: 0.9900 - val_DENSE_3_ACC: 0.9469-val_dense_4_ACC: 0.9965 - val_DENSE_5_ACC: 0.9509 - val_DENSE_6_ACC: 0.9900Copy the code
For each period, we have precision for all six dense layers in the output.
Now let’s evaluate the performance of the model on the test set:
print("Test Score:", score[0])
print("Test Accuracy:", score[1])
Copy the code
Output:
31915/31915 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 111 - s 3 ms/step Test Score: 0.8471985269747015 the Test Accuracy: 0.31425264998511726Copy the code
Only 31% accuracy can be achieved through multiple output layers.
The following script plots the losses and exact values of the training and validation set for the first intensive layer.
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
Copy the code
Output:
As you can see from the output, the accuracy of the tests (validation) does not converge after the first period.
conclusion
Multi-label text classification is one of the most common text classification problems. In this paper, we study two deep learning methods for multi-label text classification. In the first approach, we use a single dense output layer with multiple neurons, each of which represents a label.
In the second approach, we create a separate dense layer for each tag with one neuron. The results show that in our case, a single output layer with multiple neurons works better than multiple output layers.
Most welcome insight
1. Research hot spots of big data journal articles
2.618 Online Shopping data Review – What are the Chopped people concerned about
3. R language text mining TF-IDF topic modeling, sentiment analysis N-gram modeling research
4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization
5. Observation of news data under the epidemic
6. Python topics LDA modeling and T-SNE visualization
7. Topic-modeling analysis of text data in R language
8. Theme model: Data listening to the “online events” on the message board of People’s Daily Online
9. Python crawler is used to analyze semantic data of web fetching LDA topic