Convolutional neural network and recurrent neural network are used to classify Chinese texts. Convolutional Neural Networks for Sentence Classification; Convolutional Neural Networks for Sentence Classification; Also read dennybritz’s blog: Implementing a CNN for Text Classification in TensorFlow; And the thesis of character-level CNN: character-level Convolutional Networks for Text Classification
This paper is based on the simplified implementation of TensorFlow on Chinese data set, and uses character-level CNN and RNN to classify Chinese text, which achieves good results.
The environment
- Python 3.5
- TensorFlow 1.3
- numpy
- scikit-learn
The data set
Use a subset of THUCNews for training and testing. The data set can be downloaded from THUCTC: an efficient Chinese text classification kit. Please follow the open source agreement of the data provider.
Ten of these categories were used in this training, with 6500 pieces of data per category.
Categories are as follows:
Sports, finance, real estate, home, education, science and technology, fashion, politics, games, entertainment This subset can be downloaded here: link: pan.baidu.com/s/1bpq9Eub password: ycyw
The data set is divided as follows:
- Training set: 5000*10
- Verification set: 500 x 10
- Test set: 1000 x 10
See the two scripts in Helper for the process of generating subsets from the original data set. Where copy_data.sh is used to copy 6500 files from each category and cnews_group.py is used to consolidate multiple files into a single file. After executing this file, you get three data files:
- Cnews.train.txt: Training set (50000 pieces)
- Cnews.val.txt: Validation set (5000 pieces)
- Cnews.test. TXT: Test set (10000)
pretreatment
Data /cnews_loader.py is a pre-processed file for data.
- Read_file (): reads file data;
- Build_vocab (): Builds the vocabulary, using character-level representations. This function stores the vocabulary to avoid repeated processing each time;
- Read_vocab (): reads the vocabulary stored in the previous step and converts it to {word: id} representation;
- Read_category (): fixes the category to {category: id};
- To_words (): converts a piece of data represented by id back to text;
- Preocess_file (): converts the dataset from text to a fixed-length id sequence representation;
- Batch_iter (): Prepares shuffle batches of data for neural network training.
After data pretreatment, the data format is as follows:
Data | Shape | Data | Shape |
---|---|---|---|
x_train | [50000, 600] | y_train | [50000, 10] |
x_val | [5000, 600] | y_val | [5000, 10] |
x_test | [10000, 600] | y_test | [10000, 10] |
CNN convolutional neural network
Configuration items
The configurable parameters of CNN are shown below in cnn_model.py.
class TCNNConfig(object): """CNN configuration parameter """ embedding_DIM = 64 # word vector dimension seq_length = 600 # sequence length num_classes = 10 # number of categories num_filters = 128 # number of convolution cores Kernel_size = 5 # Convolved kernel size VOCab_size = 5000 # Vocabulary expressions Small hidden_DIM = 128 # Full connection layer neurons Dropout_keep_PROb = 0.5 # Dropout retention ratio Learning_rate = 1e-3 # batch_size = 64 # Num_epochs = 10 # print_per_batch = 100 # How many times to output a result Save_per_batch = 10 # Every number of rounds saved to tensorboardCopy the code
CNN model
See the implementation of cnn_model.py for details.
The general structure is as follows:
Training and validation
Run python run_cnn.py train to begin training.
If you have been trained before, delete tensorBoard/TextCNN to avoid overlapping tensorBoard training results.
Configuring CNN model... Configuring TensorBoard and Saver... Loading training and validation data... Time usage: 0:00:14 Training and evaluating... Epoch: 1 Iter: 0, Train Loss: 2.3, Train Acc: 10.94%, Val Loss: 2.3, Val Acc: 8.92%, Time: 0:00:01 * Iter: 100, Train Loss: 0.88, Train Acc: 73.44%, Val Loss: 1.2, Val Acc: 68.46%, Time: 0:00:04 * Iter: 200, Train Loss: 0.88, Train Acc: 73.44%, Val Loss: 1.2, Val Acc: 68.46%, Time: 0:00:04 * Iter: 200, Train Loss: 0.38, Train Acc: 92.19%, Val Loss: 0.75, Val Acc: 77.32%, Time: 0:00:07 * Iter: 300, Train Loss: 0.22, Train Acc: 92.19%, Val Loss: 0.46, Val Acc: 87.08%, Time: 0:00:09 * Iter: 400, Train Loss: 0.24, Train Acc: 90.62%, Val Loss: 0.4, Val Acc: 88.62%, Time: 0:00:12 * Iter: 500, Train Loss: 0.16, Train Acc: 96.88%, Val Loss: 0.36, Val Acc: 90.38%, Time: 00:15 * Iter: 600, Train Loss: 0.084, Train Acc: 96.88%, Val Loss: 0.35, Val Acc: 91.36%, Time: 0:00:17 * Iter: 700, Train Loss: 0.21, Train Acc: 93.75%, Val Loss: 0.26, Val Acc: 92.58%, Time: 0:00:20 * Epoch: 2 Iter: 800, Train Loss: 0.07, Train Acc: 98.44%, Val Loss: 0.24, Val Acc: 94.12%, Time: 0:00:23 * Iter: 900, Train Loss: 0.092, Train Acc: 96.88%, Val Loss: 0.27, Val Acc: 92.86%, Time: 0:00:25 Iter: 1000, Train Loss: 0.092, Train Acc: 96.88%, Val Loss: 0.27, Val Acc: 92.86%, Time: 0:00:25 Iter: 1000, Train Loss: 0.17, Train Acc: 95.31%, Val Loss: 0.28, Val Acc: 92.82%, Time: 0:00:28 Iter: 1100, Train Loss: 0.2, Train Acc: 93.75%, Val Loss: 0.23, Val Acc: 93.26%, Time: 0:00:31 Iter: 1200, Train Loss: 0.081, Train Acc: 98.44%, Val Loss: 0.25, Val Acc: 92.96%, Time: 0:00:33 Iter: 1300, Train Loss: 0.052, Train Acc: 100.00%, Val Loss: 0.24, Val Acc: Iter: 1400, Train Loss: 0.1, Train Acc: 95.31%, Val Loss: 0.22, Val Acc: 94.12%, Time: 0:00:39 Iter: 1500, Train Loss: 0.12, Train Acc: 98.44%, Val Loss: 0.23, Val Acc: 93.58%, Time: 0:00:41 Epoch: 3 Iter: 1600, Train Loss: 0.1, Train Acc: 96.88%, Val Loss: 0.26, Val Acc: 92.34%, Time: 0:00:44 Iter: 1700, Train Loss: 0.018, Train Acc: 100.00%, Val Loss: 0.22, Val Acc: 93.46%, Time: 0:00:47 Iter: 1800, Train Loss: 0.036, Train Acc: 100.00%, Val Loss: 0.28, Val Acc: 92.72%, Time: 0:00:50 No optimization for a long Time, auto-stopping...Copy the code
The best effect on the validation set was 94.12%, and it stopped after only 3 iterations.
Accuracy and error are shown in the figure:
test
Run Python run_cnn.py test to test on the test set.
Configuring CNN model... Loading test data... Testing... Test Loss: 0.14, Test Acc: 96.04% Precision, Recall and F1-score... Precision Recall F1-Score Support Sports 0.99 0.99 0.99 1000 Finance 0.96 0.99 0.97 1000 House 1.00 1.00 1.00 1000 Home 0.95 0.91 0.93 1000 Education 0.95 0.89 0.92 1000 Technology 0.94 0.97 0.95 1000 fashion 0.95 0.97 0.96 1000 politics 0.94 0.94 0.94 1000 games 0.97 0.96 0.97 1000 Entertainment 0.95 0.98 0.97 1000 AVG/Total 0.96 0.96 0.96 10000 Confusion Matrix... [[991 0 0 2 1 0 4 1] [0 992 0 0 2 1 0 50 0] [0 1 996 0 11 0 0 1] [0 14 0 912 7 15 9 29 3 11] [29 0 12 892 22 18 21 10 14] [0 0 10 1 968 4 3 12 2] [10 0 94 4 971 0 2 9] [1 16 0 4 18 12 1 941 16] [2 41 5 4 5 10 1 962 6] [ 1 0 1 6 4 3 5 0 1 979]] Time usage: 0:00:05Copy the code
The accuracy of the test set reached 96.04%, and the precision, recall and F1-Score of all kinds exceeded 0.9.
It can also be seen from the confusion matrix that the classification effect is very good.
RNN recurrent neural network
Configuration items
RNN the configurable parameters are shown below, in rnn_model.py.
class TRNNConfig(object): """RNN configuration parameter """ # model parameter embedding_DIM = 64 # word vector dimension seq_length = 600 # sequence length num_classes = 10 # number of classes VOCab_size = 5000 # Word expression small Num_layers = 2 # Hidden layers hidden_DIM = 128 # Hidden layers neurons RNN = 'GRu' # LSTM or GRu dropout_keep_PROb = 0.8 # Dropout retention ratio Learning_rate = 1e-3 # batch_size = 128 # Num_epochs = 10 # print_per_batch = 100 # How many times to output a result Save_per_batch = 10 # Every number of rounds saved to tensorboardCopy the code
RNN model
See the rnn_model.py implementation for details.
The general structure is as follows:
Training and validation
This part of the code is very similar to run_cnn.py, with minor changes to the model and parts of the directory.
Run Python run_rnn.py train to begin training.
If you have been trained before, delete tensorboard/textrnn to avoid overlapping tensorboard training results.
Configuring RNN model... Configuring TensorBoard and Saver... Loading training and validation data... Time usage: 0:00:14 Training and evaluating... Epoch: 1 Iter: 0, Train Loss: 2.3, Train Acc: 8.59%, Val Loss: 2.3, Val Acc: 11.96%, Time: 0:00:08 * Iter: 100, Train Loss: 0.95, Train Acc: 64.06%, Val Loss: 1.3, Val Acc: 53.06%, Time: 0:01:15 * Iter: 200, Train Loss: 0.95, Train Acc: 64.06%, Val Loss: 1.3, Val Acc: 53.06%, Time: 0:01:15 * Iter: 200, Train Loss: 0.61, Train Acc: 79.69%, Val Loss: 0.94, Val Acc: 69.88%, Time: 0:02:22 * Iter: 300, Train Loss: 0.49, Train Acc: 85.16%, Val Loss: 0.63, Val Acc: 81.44%, Time: 0:03:29 * Epoch: 2 Iter: 400, Train Loss: 0.23, Train Acc: 92.97%, Val Loss: 0.6, Val Acc: 82.86%, Time: 0:04:36 * Iter: 500, Train Loss: 0.27, Train Acc: 92.97%, Val Loss: 0.47, Val Acc: 86.72%, Time: 0:05:43 * Iter: 600, Train Loss: 0.13, Train Acc: 98.44%, Val Loss: 0.43, Val Acc: 87.46%, Time: 0:06:50 * Iter: 700, Train Loss: 0.24, Train Acc: 91.41%, Val Loss: 0.46, Val Acc: 87.12%, Time: 0:07:57 Epoch: 3 Iter: 800, Train Loss: 0.11, Train Acc: 96.09%, Val Loss: 0.49, Val Acc: 87.02%, Time: 0:09:03 Iter: 900, Train Loss: 0.15, Train Acc: 96.09%, Val Loss: 0.55, Val Acc: 85.86%, Time: 0:10:10iter: 1000, Train Loss: 0.17, Train Acc: 96.09%, Val Loss: 0.43, Val Acc: 89.44%, Time: 0:11:18 * Iter: 1100, Train Loss: 0.25, Train Acc: 93.75%, Val Loss: 0.42, Val Acc: 88.98%, Time: 0:12:25 Epoch: 4iter: 1200, Train Loss: 0.14, Train Acc: 96.09%, Val Loss: 0.39, Val Acc: 89.82%, Time: 0:13:32 * Iter: 1300, Train Loss: 0.2, Train Acc: 96.09%, Val Loss: 0.43, Val Acc: 88.68%, Time: 0:14:38 Iter: 1400, Train Loss: 0.012, Train Acc: 100.00%, Val Loss: 0.37, Val Acc: 90.58%, Time: 0:15:45 * Iter: 1500, Train Loss: 0.15, Train Acc: 96.88%, Val Loss: 0.39, Val Acc: 90.58%, Time: 0:16:52 Epoch: 5 Iter: 1600, Train Loss: 0.075, Train Acc: 97.66%, Val Loss: 0.41, Val Acc: 89.90%, Time: 0:17:59 Iter: 1700, Train Loss: 0.042, Train Acc: 98.44%, Val Loss: 0.41, Val Acc: 90.08%, Time: 0:19:06 Iter: 1800, Train Loss: 0.08, Train Acc: 97.66%, Val Loss: 0.38, Val Acc: 91.36%, Time: 0:20:13 * Iter: 1900, Train Loss: 0.089, Train Acc: 98.44%, Val Loss: 0.39, Val Acc: 90.18%, Time: 0:21:20 Epoch: 6 Iter: 2000, Train Loss: 0.092, Train Acc: 96.88%, Val Loss: 0.36, Val Acc: 91.42%, Time: 0:22:27 * Iter: 2100, Train Loss: 0.062, Train Acc: 98.44%, Val Loss: 0.39, Val Acc: 90.56%, Time: 0:23:34 Iter: 2200, Train Loss: 0.053, Train Acc: 98.44%, Val Loss: 0.39, Val Acc: 90.02%, Time: 0:24:41:2300, Train Loss: 0.12, Train Acc: 96.09%, Val Loss: 0.37, Val Acc: 90.84%, Time: 0:25:48 Epoch: 7 Iter: 2400, Train Loss: 0.014, Train Acc: 100.00%, Val Loss: 0.41, Val Acc: 90.38%, Time: 0:26:55 Iter: 2500, Train Loss: 0.14, Train Acc: 96.88%, Val Loss: 0.37, Val Acc: 91.22%, Time: 0:28:01 Iter: 2600, Train Loss: 0.11, Train Acc: 96.88%, Val Loss: 0.43, Val Acc: 89.76%, Time: 0:29:08 Iter: 2700, Train Loss: 0.089, Train Acc: 97.66%, Val Loss: 0.37, Val Acc: 91.18%, Time: 0:30:15 Epoch: 8 Iter: 2800, Train Loss: 0.0081, Train Acc: 100.00%, Val Loss: 0.44, Val Acc: 90.66%, Time: 0:31:22 Iter: 2900, Train Loss: 0.017, Train Acc: 100.00%, Val Loss: 0.44, Val Acc: 89.62%, Time: 0:32:29 Iter: 3000, Train Loss: 0.061, Train Acc: 96.88%, Val Loss: 0.43, Val Acc: 40.04%, Time: 0:33:36 No optimization for a long Time, stopping...Copy the code
The best effect on the verification set was 91.42%, which was stopped after 8 rounds of iteration, much slower than CNN.
Accuracy and error are shown in the figure:
test
Run Python run_rnn.py test to test on the test set.
Testing... Test Loss: 0.21, Test Acc: 94.22% Precision, Recall and F1-score... Precision Recall F1-Score Support Sports 0.99 0.99 0.99 1000 Finance 0.91 0.99 0.95 1000 House 1.00 1.00 1.00 1000 Home 0.97 0.73 0.83 1000 Education 0.91 0.92 0.91 1000 Technology 0.93 0.96 0.94 1000 fashion 0.89 0.97 0.93 1000 politics 0.93 0.93 0.93 1000 games 0.95 0.97 0.96 1000 Entertainment 0.97 0.96 0.97 1000 AVG/Total 0.94 0.94 0.94 10000 Confusion Matrix... [0 990 1 1 1 1 0 6 0 0] [0 2 996 1 1 0 0 0 0 0 0] [2 71 1 731 51 20 88 28 3 5] [1 3 0 7 918 23 4 31 9 4] [1 3 0 3 0 964 3 5 21 0] [1 0 1 7 1 3 972 0 6 9] [0 16 0 0 22 26 0 931 23] [23 0 0 22 12 0 972 7] [1 0 1 7 1 3 972 0 6 9] [0 16 0 0 22 26 0 931 23] [23 0 0 22 12 0 972 7] [ 0 3 1 1 7 3 11 5 9 960]] Time usage: 0:00:33Copy the code
In the test set, the accuracy reached 94.22%, and all kinds of precision, recall and F1-Score, except for household category, exceeded 0.9.
It can be seen from the confusion matrix that the classification effect is excellent.
Comparing the two models, it can be seen that RNN’s performance in household classification is not very ideal, and other categories are not much different from CNN.
The parameters can be further adjusted to achieve better results.
Address: github.com/gaussic/tex…
This article is from the global artificial intelligence wechat public account