This article will show you how to use Keras-Bert to implement text multi-categorization tasks, with Bert being fine-tuned.

The project structure

Keras-bert text multi-category item structure

The Python third-party modules are as follows:

pandas==0.234.
Keras==2.31.
keras_bert==0.83. 0
numpy==1.164.
Copy the code

The data set

The multi-classification data sets adopted in this paper are Sougou sub-classification data set and THUCNews data set, which are introduced as follows:

  • Sougou small classification dataset

There are five categories, namely sports, health, military, education and automobile. It is divided into training set and test set, in which the training set has 800 samples for each classification and the test set has 100 samples for each classification.

  • THUCNews data set

There are 10 categories: Sports, Finance, real estate, home furnishing, Education, Technology, Fashion, politics, games and entertainment. The data set is divided into: training set: 5000 * 10, test set: 1000 * 10.

Model training

The complete code for the model training script model_train.py is as follows:

# -*- coding: utf- 8 -- * -import json
import codecs
import pandas as pd
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers importAdam # Suggested length <=510
maxlen = 300
BATCH_SIZE = 8
config_path = './chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'


token_dict = {}
with codecs.open(dict_path, 'r'.'utf-8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)


class OurTokenizer(Tokenizer):
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
                R.append(c)
            else:
                R.append('[UNK]'The remaining characters are [UNK]return R


tokenizer = OurTokenizer(token_dict)


def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])


class DataGenerator:

    def __init__(self, data, batch_size=BATCH_SIZE):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size ! =0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
            np.random.shuffle(idxs)
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append(y)
                if len(X1) == self.batch_size or i == idxs[- 1] : X1 = seq_padding(X1) X2 = seq_padding(X2) Y = seq_padding(Y) yield [X1, X2], Y [X1, X2, Y] = [], [], [] def create_cls_model(num_labels): bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)for layer in bert_model.layers:
        layer.trainable = True

    x1_in = Input(shape=(None,))
    x2_in = Input(shape=(None,))

    x = bert_model([x1_in, x2_in])
    cls_layer = Lambda(lambda x: x[:, 0] (x) # Take [CLS] corresponding vector and use it for classification P = Dense(NUM_labels, activation='softmax')(cls_layer) # pile = model ([x1_in, x2_in], p'categorical_crossentropy',
        optimizer=Adam(1e-5), # use small enough learning rate metrics=['accuracy']
    )
    # model.summary()

    return model


if __name__ == '__main__'Data processing, reading training set and test setprint("begin data processing...")
    train_df = pd.read_csv("data/cnews/cnews_train.csv").fillna(value="")
    test_df = pd.read_csv("data/cnews/cnews_test.csv").fillna(value="")

    labels = train_df["label"].unique()
    with open("label.json"."w", encoding="utf-8") as f:
        f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))

    train_data = []
    test_data = []
    for i in range(train_df.shape[0]):
        label, content = train_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        train_data.append((content, label_id))

    for i in range(test_df.shape[0]):
        label, content = test_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        test_data.append((content, label_id))

    print("finish data processing!"Model = create_cls_model()len(labels))
    train_D = DataGenerator(train_data)
    test_D = DataGenerator(test_data)

    print("begin model training...")
    model.fit_generator(
        train_D.__iter__(),
        steps_per_epoch=len(train_D),
        epochs=3,
        validation_data=test_D.__iter__(),
        validation_steps=len(test_D)
    )

    print("finish model training!") # Save the model.save('cls_cnews.h5')
    print("Model saved!")

    result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))
    print("Model evaluation Results :", result)
Copy the code

The model structure is as follows:

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
model_2 (Model)                 (None, None, 768)    101677056   input_1[0] [0]                    
                                                                 input_2[0] [0]                    
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 768)          0           model_2[1] [0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 10)           7690        lambda_1[0] [0]                   
==================================================================================================
Total params: 101.684.746
Trainable params: 101.684.746
Non-trainable params: 0
Copy the code

In the above model, we take out the corresponding vector of [CLS], and then connect the whole connection layer. Softmax function is used as the activation function to complete the construction of multi-classification model, which is very simple and convenient.

Model to evaluate

The complete code for the model evaluation script model_evaluate.py is as follows:

# -*- coding: utf- 8 --*- # Model evaluation scriptimport json
import numpy as np
import pandas as pd
from keras.models import load_model
from keras_bert import get_custom_objects
from sklearn.metrics import classification_report

from model_train import token_dict, OurTokenizer

maxlen = 300# load_model = load_model("cls_cnews.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json"."r", encoding="utf-8"Loads (f.read()) def predict_single_text(text): Text =text [:maxlen] x1, x2 = tokenizer.encode(first=text) x1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1
    X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen elsePredicted = model.predict([[X1], [x2]]) y = np.argmax(predicted[[X1], [x2]])0])
    returnLabel_dict [STR (y)] def evaluate(): test_df = pd.read_csv("data/cnews/cnews_test.csv").fillna(value="")
    true_y_list, pred_y_list = [], []
    for i in range(test_df.shape[0) :print("predict %d samples" % (i+1))
        true_y, content = test_df.iloc[i, :]
        pred_y = predict_single_text(content)
        true_y_list.append(true_y)
        pred_y_list.append(pred_y)

    return classification_report(true_y_list, pred_y_list, digits=4)


output_data = evaluate()
print("model evaluate result:\n")
print(output_data)
Copy the code

Run the code above to evaluate the two datasets, and the results are as follows:

  • Sougou data set

Model parameters: BATCH_size = 8, maxlen = 256, epoch=10

Evaluation Results:

Precision Recall F1-Score Support sports0.9802    1.0000    0.9900        99health0.9495    0.9495    0.9495        99military1.0000    1.0000    1.0000        99education0.9307    0.9495    0.9400        99The car0.9895    0.9495    0.9691        99

    accuracy                         0.9697       495
   macro avg     0.9700    0.9697    0.9697       495
weighted avg     0.9700    0.9697    0.9697       495
Copy the code
  • THUCNews data set

Model parameters: BATCH_size = 8, MAXlen = 300, epoch=3

Evaluation Results:

Precision Recall F1-Score Support sports0.9970    0.9990    0.9980      1000entertainment0.9890    0.9890    0.9890      1000household0.9949    0.7820    0.8757      1000Real estate0.8006    0.8710    0.8343      1000education0.9753    0.9480    0.9615      1000fashion0.9708    0.9980    0.9842      1000The current politics0.9318    0.9560    0.9437      1000The game0.9851    0.9950    0.9900      1000Science and technology0.9689    0.9970    0.9828      1000Finance and economics,0.9377    0.9930    0.9645      1000

    accuracy                         0.9528     10000
   macro avg     0.9551    0.9528    0.9524     10000
weighted avg     0.9551    0.9528    0.9524     10000
Copy the code

Model to predict

The complete code for the model prediction script model_predictor.py is as follows:

# -*- coding: utf- 8 - -*-
# @Time : 2020/12/23 15:28# @author: Jclian91 # @file: model_predictor. py # @place: Yangpu, Shanghai #import time
import json
import numpy as np

from model_train import token_dict, OurTokenizer
from keras.models import load_model
from keras_bert import get_custom_objects

maxlen = 256# load_model = load_model("cls_sougou.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json"."r", encoding="utf-8") as f: label_dict = json.loads(f.read()) s_time = time.time() #"When it comes to hard-drive suVs, what models come to mind? By what is known as the "hegemony" Toyota prado (configuration) | inquiry, or called "lynx" pajero car, or is "deceit philandering man car" feelings Mercedes G," \
       "The Prince of the Desert. In short, as countries around the world pay more and more attention to environmental protection, those big-engine off-road SUVs will gradually disappear in the near future, so rather than miss it," \
       "Why don't you get your hands on one of those hard-core suVs you want while you're still young? And what I want to talk to you about today is the world's top 10 hard-hitting suVs," \
       "Off-road fans might want to think about which one is your favorite, and get started without further ado."Text =text [:maxlen] x1, x2 = tokenizer.encode(first=text) x1 = x1 + [0] * (maxlen-len(x1)) if len(x1) < maxlen else x1
X2 = x2 + [0] * (maxlen-len(x2)) if len(x2) < maxlen elsePredicted = model.predict([[X1], [x2]]) y = np.argmax(predicted[[X1], [x2]])0])


print("Original: %s" % text)
print("Prediction label: %s" % label_dict[str(y)])
e_time = time.time()
print("cost time:", e_time-s_time)
Copy the code

We make model predictions on new samples.

  • Sougou data set
What models come to mind when you think of a hard-drive off-road SUV? By what is known as the "hegemony" Toyota prado (configuration) | inquiry, or called "lynx" pajero car, or is "deceit philandering emotional man car" Mercedes G, prince "desert" way. In short, as countries around the world pay more and more attention to the protection of the environment, those large displacement cross-country SUV in the near future will gradually disappear in our sight, so rather than miss it, it is better to take advantage of the young, in the lifetime to get a can let you admire the hard cross-country SUV. Today, I'd like to talk to you about the world's top 10 hard-style SUVs. After watching them, you might want to think about which one is your favorite. Let's get started. Prediction tag: Car original text: # Beauty30According to the Elson Air Force Base website, the U.S. Air Force recently launched an elephant walk off the coast of Alaska30A fighter jet and2The tankers took off from Elson Air Force Base and completed the "Elephant Walk" exercise off the Alaskan coast. During the 13th Five-Year Plan period (2016-2020), the unified compilation of textbooks for compulsory education in China will cover all grades. Ordinary senior high school three subjects unified compilation textbooks already covered20Provinces, estimated2022To achieve full coverage in all provinces by the year 2000,2025Year to achieve full coverage of all grades. The revision of the compulsory education curriculum plan and curriculum standards will be completed next year, tian Huisheng, director of the Ministry's teaching materials bureau, said at a press conference yesterday. Prediction tag: EducationCopy the code
  • THUCNews data set
Beijing time12month26Day,2020- 21The NBA Christmas Game is on as promised. The Los Angeles Lakers face the Dallas Mavericks at home in a matchup. All out, Lakers138- 115.Sweep the Mavericks, pick up their first win of the season, and give it to the other team2In a row. Prediction tag: Sports In the past two years, began to constantly upgrade the phone's screen, high refresh rate have become a trend, even if again good performance and can't smooth real display, the phone's screen will largely affect the use feeling, so the screen is the Windows mobile phone hardware, can foresee the future high-end phones at the same time of impact performance, must be to improve the claim to the screen. 原文: songjiang sheshan plate has been too long without luxury housing market, making many high-end buyers in the region lonely. Fortunately, however, the international trade sheshan original villa that comes on stage hot not long ago satisfied the demand of this kind of client at a draught. Whether it is community planning, product type, decoration and configuration, or design techniques, compared with the surrounding tens of millions of villas, sheshan original villa is also unrestrained, the more surprising is that the quality of the villas only need the price of the apartment! Prediction tag: Real estateCopy the code

conclusion

Github:github.com/percent4/ke… .

We’ll see how to use Keras-Bert for text multi-label sorting.

Read more

Top 10 Sand Sculptures and fun GitHub apps \

5 minutes to master Python object references \

Slow application? You are probably writing fake Python\

Special recommendation \

\

Click below to read the article and join the community