♚ \
Author: Jclian, like algorithms, love to share, hope to make more like-minded friends, together in the path of learning Python further!
background
Text classification is one of the common important tasks in NLP. Its main function is to train the input text and text category into a model, so that it has certain generalization ability and can predict the new text well. It is widely used and plays an important role in many fields, such as spam filtering, public opinion analysis and news classification.
project
First, we need to find an example of data. We choose THUCNews, which is generated by filtering and filtering the historical data of SINA news RSS subscription channel from 2005 to 2011, and contains 740,000 news documents (2.19GB), all in UTF-8 plain text format. Based on the original Sina news classification system, we selected 10 candidate categories: sports, entertainment, home, real estate, education, fashion, current politics, games, technology, finance and economics.
Text categorization item structure
Next, we try to train a text classification model using Kashgari-TF, where we use CNN-LSTM and the complete Python code (text_Classification_model_train.py) is as follows:
# -*- coding: utf-8 -*-
# time: 2019-08-13 11:16
# place: Pudong Shanghai
from kashgari.tasks.classification importDef load_data(data_type):with open('./data/cnews.%s.txt' % data_type, 'r', encoding='utf-8') as f:
content = [_.strip() for _ in f.readlines() if _.strip()]
x, y = [], []
for line in content:
label, text = line.split(maxsplit=1)
y.append(label)
x.append([_ for _ in text])
returnTrain_x, train_y = load_data('train')
valid_x, valid_y = load_data('val')
test_x, test_y = load_data('test'(train_x, train_y, valid_x, batch_size= 116, epochs=5Evaluate (test_x, test_y) evaluate(test_x, test_y)'text_classification_model')
Copy the code
The output model results are as follows:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 2544) 0
_________________________________________________________________
layer_embedding (Embedding) (None, 2544.100) 553200
_________________________________________________________________
conv1d (Conv1D) (None, 2544.32) 9632
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 1272.32) 0
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM) (None, 100) 53600
_________________________________________________________________
dense (Dense) (None, 10) 1010
=================================================================
Total params: 617.442
Trainable params: 617.442
Non-trainable params: 0
Copy the code
The training times of the model were set as 5 epochs and the BATCH_size was set as 16. After the training of the model, the results on the training set and verification set are as follows:
The data set | accuracy | loss |
---|---|---|
The training set | 0.9661 | 0.1184 |
Validation set | 0.9204 | 0.2567 |
The results on the test set are as follows:
Precision Recall F1-Score Support sports0.9852 0.9970 0.9911 1000entertainment0.9938 0.9690 0.9813 1000household0.9384 0.8830 0.9098 1000Real estate0.9490 0.9680 0.9584 1000education0.9650 0.8820 0.9216 1000fashion0.9418 0.9710 0.9562 1000The current politics0.9732 0.9450 0.9589 1000The game0.9454 0.9700 0.9576 1000Science and technology0.8910 0.9560 0.9223 1000Finance and economics,0.9566 0.9920 0.9740 1000
accuracy 0.9533 10000
macro avg 0.9539 0.9533 0.9531 10000
weighted avg 0.9539 0.9533 0.9531 10000
Copy the code
In general, the effect of the model training is quite good. Next, it’s time to test the predictive power of the model to see if it can generalize specific text categories.
test
Now that we have the trained text_ClassiFICation_model, let’s use this model to predict the new data with the following code (model_predictor.py) :
# -*- coding: utf-8 -*-
# time: 2019-08-14 00:21
# place: Pudong Shanghai
importLoaded_model = kashgari.utils.load_model('text_classification_model')
text = CFLD was founded in 1998, formerly known as Langfang Huaxia Real Estate Development Co., LTD., with an initial registered capital of 2 million yuan, of which Wang Wenxue contributed 1.6 million yuan and Langfang Rongtong Material Trading Co., Ltd. contributed 400,000 yuan. After several equity transfers and capital increases, the company was restructed into a joint-stock company in 2007. Completed backdoor listing in 2011. '
x = [[_ for _ in text]]
label = loaded_model.predict(x)
print('Forecast category :%s' % label)
Copy the code
Here are the results:
Original text 1: CFLD was founded in 1998, formerly langfang Huaxia Real Estate Development Co., LTD., with an initial registered capital of 2 million yuan, including 1.6 million yuan invested by Wang Wenxue and 400,000 yuan invested by Langfang Rongtong Material Trading Co., LTD. After several equity transfers and capital increases, the company was transformed into a joint-stock company in 2007. Completed backdoor listing in 2011.
Today’s popular short-sleeved shirts can be roughly divided into Hawaiian shirts, Cuban shirts, and bowling shirts. There are slightly different styles, but sometimes a shirt can contain more than one style. The most obvious feature of “guba (collar) shirt” lies in the “collar”, which is usually designed as a V collar and slightly turned out, so it lacks the common “first button” of the shirt collar. The tailoring from the clothes to the collar takes shape as a whole, making the whole loose and comfortable.
译 文 : Zhou Qi joined Xinjiang Guanghui Basketball Club in 2014 and won the championship of National Youth Basketball League and national Club Youth League successively on behalf of the club’s youth team. After being promoted to the first team, Zhou won the 25th Asian Champions League Cup with the team in 2016. In the 2016-2017 season, Zhou played a major role in winning the first championship for Xinjiang Guanghui. He played with an injury in the finals, which made him a great story.
Jay Chou, the executive producer of the racing film “Kick Ass”, released the director’s sidelight on The 13th, not only the real racing pictures are exposed, dozens of millions of racing cars in the international professional track, mountain road speed, the scene is huge and shocking, more revealed
According to a report released by the Intellectual Property Owners Association in the United States, the number of Property Owners in the United States has increased in recent years
conclusion
Although the text classification effect of our above test is not bad, there are some classification errors.
Click to become a Registered member of the Community.