Python natural language processing: Easy to get started with text sorting

♚ \

Author: Jclian, like algorithms, love to share, hope to make more like-minded friends, together in the path of learning Python further!

background

Text classification is one of the common important tasks in NLP. Its main function is to train the input text and text category into a model, so that it has certain generalization ability and can predict the new text well. It is widely used and plays an important role in many fields, such as spam filtering, public opinion analysis and news classification.

project

First, we need to find an example of data. We choose THUCNews, which is generated by filtering and filtering the historical data of SINA news RSS subscription channel from 2005 to 2011, and contains 740,000 news documents (2.19GB), all in UTF-8 plain text format. Based on the original Sina news classification system, we selected 10 candidate categories: sports, entertainment, home, real estate, education, fashion, current politics, games, technology, finance and economics.

Text categorization item structure

Next, we try to train a text classification model using Kashgari-TF, where we use CNN-LSTM and the complete Python code (text_Classification_model_train.py) is as follows:

# -*- coding: utf-8 -*-
# time: 2019-08-13 11:16
# place: Pudong Shanghai

from kashgari.tasks.classification importDef load_data(data_type):with open('./data/cnews.%s.txt' % data_type, 'r', encoding='utf-8') as f:
        content = [_.strip() for _ in f.readlines() if _.strip()]

    x, y = [], []
    for line in content:
        label, text = line.split(maxsplit=1)
        y.append(label)
        x.append([_ for _ in text])

    returnTrain_x, train_y = load_data('train')
valid_x, valid_y = load_data('val')
test_x, test_y = load_data('test'(train_x, train_y, valid_x, batch_size= 116, epochs=5Evaluate (test_x, test_y) evaluate(test_x, test_y)'text_classification_model')
Copy the code

The output model results are as follows:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, 2544)              0
_________________________________________________________________
layer_embedding (Embedding)  (None, 2544.100)         553200
_________________________________________________________________
conv1d (Conv1D)              (None, 2544.32)          9632
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 1272.32)          0
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM)       (None, 100)               53600
_________________________________________________________________
dense (Dense)                (None, 10)                1010
=================================================================
Total params: 617.442
Trainable params: 617.442
Non-trainable params: 0
Copy the code

The training times of the model were set as 5 epochs and the BATCH_size was set as 16. After the training of the model, the results on the training set and verification set are as follows:

The data set	accuracy	loss
The training set	0.9661	0.1184
Validation set	0.9204	0.2567

The results on the test set are as follows:

Precision Recall F1-Score Support sports0.9852    0.9970    0.9911      1000entertainment0.9938    0.9690    0.9813      1000household0.9384    0.8830    0.9098      1000Real estate0.9490    0.9680    0.9584      1000education0.9650    0.8820    0.9216      1000fashion0.9418    0.9710    0.9562      1000The current politics0.9732    0.9450    0.9589      1000The game0.9454    0.9700    0.9576      1000Science and technology0.8910    0.9560    0.9223      1000Finance and economics,0.9566    0.9920    0.9740      1000

    accuracy                         0.9533     10000
   macro avg     0.9539    0.9533    0.9531     10000
weighted avg     0.9539    0.9533    0.9531     10000
Copy the code

In general, the effect of the model training is quite good. Next, it’s time to test the predictive power of the model to see if it can generalize specific text categories.

test

Now that we have the trained text_ClassiFICation_model, let’s use this model to predict the new data with the following code (model_predictor.py) :

# -*- coding: utf-8 -*-
# time: 2019-08-14 00:21
# place: Pudong Shanghai

importLoaded_model = kashgari.utils.load_model('text_classification_model')

text = CFLD was founded in 1998, formerly known as Langfang Huaxia Real Estate Development Co., LTD., with an initial registered capital of 2 million yuan, of which Wang Wenxue contributed 1.6 million yuan and Langfang Rongtong Material Trading Co., Ltd. contributed 400,000 yuan. After several equity transfers and capital increases, the company was restructed into a joint-stock company in 2007. Completed backdoor listing in 2011. '

x = [[_ for _ in text]]

label = loaded_model.predict(x)
print('Forecast category :%s' % label)
Copy the code

Here are the results:

Original text 1: CFLD was founded in 1998, formerly langfang Huaxia Real Estate Development Co., LTD., with an initial registered capital of 2 million yuan, including 1.6 million yuan invested by Wang Wenxue and 400,000 yuan invested by Langfang Rongtong Material Trading Co., LTD. After several equity transfers and capital increases, the company was transformed into a joint-stock company in 2007. Completed backdoor listing in 2011.

Today’s popular short-sleeved shirts can be roughly divided into Hawaiian shirts, Cuban shirts, and bowling shirts. There are slightly different styles, but sometimes a shirt can contain more than one style. The most obvious feature of “guba (collar) shirt” lies in the “collar”, which is usually designed as a V collar and slightly turned out, so it lacks the common “first button” of the shirt collar. The tailoring from the clothes to the collar takes shape as a whole, making the whole loose and comfortable.

译文 : Zhou Qi joined Xinjiang Guanghui Basketball Club in 2014 and won the championship of National Youth Basketball League and national Club Youth League successively on behalf of the club’s youth team. After being promoted to the first team, Zhou won the 25th Asian Champions League Cup with the team in 2016. In the 2016-2017 season, Zhou played a major role in winning the first championship for Xinjiang Guanghui. He played with an injury in the finals, which made him a great story.

Jay Chou, the executive producer of the racing film “Kick Ass”, released the director’s sidelight on The 13th, not only the real racing pictures are exposed, dozens of millions of racing cars in the international professional track, mountain road speed, the scene is huge and shocking, more revealed

According to a report released by the Intellectual Property Owners Association in the United States, the number of Property Owners in the United States has increased in recent years

conclusion

Although the text classification effect of our above test is not bad, there are some classification errors.

Click to become a Registered member of the Community.

Python natural language processing: Easy to get started with text sorting

background

project

test

conclusion

Related Posts

Kubernetes Informer source code parsing and deep use [4/4]: Dynamic package source parsing and dynamic use of Informer

How do NameNode and SecondNameNode work?

Elasticsearch: Improve onerous Elasticsearch aggregation with Sampler aggregation