Regular season: Chinese news text title classification

I. Scheme introduction

1.1 Introduction:

Text classification is to automatically classify and mark text sets (or other entities or objects) according to certain classification system or standard with the help of computer. This competition is for news title text classification. Contestants need to train a news classification model according to the provided news title text and category label, and then classify the news title text in the test set. Accuracy = the correct number of classification/the total number of required classification is used in the evaluation index. PaddleNLP, the core development library of the PaddleNLP text field, has a simple and easy-to-use API for the whole process of the text field, multi-scene application examples, a very rich pre-training model, and is deeply adapted to the PaddleNLP 2. X version.

1.2 Data Introduction:

THUCNews is generated by filtering the historical data of SINA news RSS subscription channel from 2005 to 2011. It contains 740,000 news documents (2.19 GB), all in UTF-8 plain text format. On the basis of the original sina news classification system, the data set of this competition is reintegrated into 14 candidate classification categories: finance and economics, lottery, real estate, stock, home, education, technology, society, fashion, politics, sports, constellation, games and entertainment. A total of 832,471 training data were provided.

Data sets will be provided in the format of training set and validation set: original title +\ T + tag, and test set: Original title.

1.3 Baseline:

The title of the competition is a more conventional short text multi-classification task. This project is mainly based on PaddleNLP, which fine-tunes the training data provided by the pre-training model Robert to complete the training and optimization of the news 14 classification model. Finally, the trained model is used to predict the test data and generate and submit the result file.

Note that the operation of this project requires the GPU environment of the ultimate edition! If the video memory is not enough attention to appropriately change the batchsize!

【 Principle 】 Classical pre-training model -BERT

2. Data reading and analysis

1. Data analysis

Enter the directory for storing match data sets
%cd /home/aistudio/data/data103654/
Copy the code
/home/aistudio/data/data103654
Copy the code
# use pandas to read the dataset
import pandas as pd
train = pd.read_table('train.txt', sep='\t',header=None)  # training set
dev = pd.read_table('dev.txt', sep='\t',header=None)      # validation set
test = pd.read_table('test.txt', sep='\t',header=None)    # test set
Copy the code
print(F "train data set length:{len(train)}\t dev Data set length{len(dev)}\t test Data set length{len(test)}")
Copy the code
Train: 752471 Dev: 80000 Test: 83599Copy the code
Add column names for better processing of data
train.columns = ["text_a".'label']
dev.columns = ["text_a".'label']
test.columns = ["text_a"]
Copy the code
# Splicing training and validation sets for statistical analysis
total = pd.concat([train,dev],axis=0)
Copy the code
# create font directory fonts
%cd ~
# !mkdir .fonts

Copy the font file to this path! cp data/data61659/simhei.ttf .fonts/Copy the code
/home/aistudio
cp: cannot create regular file '.fonts/': Not a directory
Copy the code
# Total category tag distribution statistics
print(total['label'].value_counts())
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# specify the default font
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['font.family'] ='sans-serif'
# Fix the issue where the minus sign '-' appears as a box
mpl.rcParams['axes.unicode_minus'] = False

total['label'].value_counts().plot.bar()
plt.show()
Copy the code
Science and Technology 162245 stock 153949 Sports 130982 entertainment 92228 politics 62867 society 50541 education 41680 finance 36963 home 32363 games 24283 real estate 19922 fashion 13335 lottery 7598 Constellation 3515 Name: Label, DTYPE: Int64 / opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/matplotlib/font_manager. Py: 1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))Copy the code

# Maximum text length
max(total['text_a'].str.len())
Copy the code
48
Copy the code
# Statistical analysis of text length. Through the analysis, it can be seen that the text is short, the longest is 48
total['text_a'].map(len).describe()
Copy the code
Count 832471.000000 mean 19.388112 STD 4.097139 min 2.000000 25% 17.000000 50% 20.000000 75% 23.000000 Max 48.000000 Name: text_a, dtype: float64Copy the code
The statistical analysis of the length of the test set shows that the length distribution is similar to that of the training data
test['text_a'].map(len).describe()
Copy the code
Count 83599.000000 mean 19.815022 STD 3.883845min 3.000000 25% 17.000000 50% 20.000000 75% 23.000000 Max 84.000000 Name: text_a, dtype: float64Copy the code

2. Save the data

Save the processed dataset file
train.to_csv('train.csv', sep='\t', index=False)  Save the training set in the format text_a,label
dev.to_csv('dev.csv', sep='\t', index=False)      Save the validation set in the format text_a,label
test.to_csv('test.csv', sep='\t', index=False)    Save the test set in text_a format
Copy the code

3. Build baseline model based on PaddleNLP

1 PaddleNLP Environment preparation

# Download the latest version of PaddlenLP
# !pip install --upgrade paddlenlp
Copy the code
Import the required third-party libraries
import math
import numpy as np
import os
import collections
from functools import partial
import random
import time
import inspect
import importlib
from tqdm import tqdm
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import IterableDataset
from paddle.utils.download import get_path_from_url
Import the related packages required by paddlenLP
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import MapDataset
from paddle.dataset.common import md5file
from paddlenlp.datasets import DatasetBuilder
Copy the code

2. Data set definition

# Define 14 categories to be categorized
label_list=list(train.label.unique())
print(label_list)
Copy the code
[' technology 'and' sports', 'politics',' stocks', 'entertainment' and 'education', 'home' and 'business', 'property', 'social', 'game', 'lottery', 'sign', 'fashion']Copy the code
# id to label conversion
id_label_dict={}
for i in range(0.len(label_list)):
    id_label_dict[i]=label_list[i]
print(id_label_dict)
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# label to ID conversion
label_id_dict={}
for i in range(0.len(label_list)):
    label_id_dict[label_list[i]]=i
print(label_id_dict)
Copy the code
{0: 'technology', 1: 'sports', 2:' politics', 3: 'stocks', 4:' entertainment '5:' education ', 6: 'home', 7: 'finance', 8: 'property', 9: 'society', 10: 'game', 11: 'lottery' 12: 'constellation, 13:' fashion '} {' science and technology: 0, 'sports' : 1,' politics' : 2, 'stock' : 3, 'entertainment', 4, 'education', 5, 'home' : 6, 'business' : 7, 'property' : 8, 'social' : 9, 'games' : 10, 'lottery ': 11,' Constellation ': 12, 'fashion ': 13}Copy the code
def read(pd_data) :
    for index, item in pd_data.iterrows():       
        yield {'text_a': item['text_a'].'label': label_id_dict[item['label']]}
Copy the code
# Training set, test set
from paddle.io import Dataset, Subset
from paddlenlp.datasets import MapDataset
from paddlenlp.datasets import load_dataset

train_dataset = load_dataset(read, pd_data=train,lazy=False)
dev_dataset = load_dataset(read, pd_data=dev,lazy=False)
Copy the code
for i in range(5) :print(train_dataset[i])
Copy the code
{'text_a': 0} {'text_a': 1} {'text_a': 0} {'text_a': 1} {'text_a': 0} 'US says it supports emergency humanitarian assistance to North Korea ', 'label': 2} {' TEXt_A ': 2} {' TEXt_A ': 3} {' Text_A ': 3}Copy the code

3. Load the pre-training model

The Roberta-WWM-Ext-Large model with better effect in the Chinese field is used this time. The pre-training model generally "works miracles with great efforts", and the large pre-training model can achieve better effect than the Base model
MODEL_NAME = "roberta-wwm-ext-large"
The fine-Tune network definition is completed by specifying the name of the model you want to use and the number of categories for the text classification, which is sorted by stitching together a Full Connected network after pre-training the model
model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) Set num_classes to 14
Tokenizer can convert the raw input text into an input data format acceptable to the model. Note that the Tokenizer class corresponds to the selected model, as you can see in the PaddleNLP documentation
tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained(MODEL_NAME)
Copy the code

PaddleNLP supports not only RoBERTa pre-training models, but also ERNIE, BERT, Electra and other pre-training models. See PaddleNLP model for details

The following table summarizes the various pre-training models currently supported by PaddleNLP. Users can use the models provided by PaddleNLP to complete tasks such as question answering, sequence sorting, token sorting, etc. At the same time, 22 kinds of pre-training parameter weights are provided for users, including 11 kinds of pre-training weights of Chinese language models.

Model Tokenizer Supported Task Model Name
BERT BertTokenizer BertModel

BertForQuestionAnswering

BertForSequenceClassification

BertForTokenClassification
bert-base-uncased

bert-large-uncased

bert-base-multilingual-uncased

bert-base-cased

bert-base-chinese

bert-base-multilingual-cased

bert-large-cased

bert-wwm-chinese

bert-wwm-ext-chinese
ERNIE ErnieTokenizer

ErnieTinyTokenizer
ErnieModel

ErnieForQuestionAnswering

ErnieForSequenceClassification

ErnieForTokenClassification
Groeb - 1.0 -

ernie-tiny

Groeb - 2.0 - en

Groeb - 2.0 - large - en
RoBERTa RobertaTokenizer RobertaModel

RobertaForQuestionAnswering

RobertaForSequenceClassification

RobertaForTokenClassification
roberta-wwm-ext

roberta-wwm-ext-large

rbt3

rbtl3
ELECTRA ElectraTokenizer ElectraModel

ElectraForSequenceClassification

ElectraForTokenClassification

electra-small

electra-base

electra-large

chinese-electra-small

chinese-electra-base

Note: Among them, Chinese pre-training models include Bert-Base-Chinese, Bert-WWM-Chinese, Bert-WWM-Ext-Chinese, ErNIE -1.0, ErNIE – Tiny, Roberta-WWM-Ext, Roberta-wwm-ext-large, RBT3, RBTL3, Chinese-electra-Base, Chinese-electra-Small, etc.

4. Define data processing functions

Define data loading and processing functions
def convert_example(example, tokenizer, max_seq_length=128, is_test=False) :
    qtconcat = example["text_a"]
    encoded_inputs = tokenizer(text=qtconcat, max_seq_len=max_seq_length)  # tokenizer handles to a format acceptable to the model
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

# define dataloader
def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None) :
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    The training dataset is randomly shuffled, while the test dataset is not shuffled
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)

    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True)
Copy the code

4. Model training

1. Set the overparameter

# parameter Settings:
The batch size can be reduced if the video memory is insufficient
batch_size = 360
# Maximum truncation length of text sequence, which needs to be determined according to the specific length of text, the maximum length is not more than 512. Based on text length analysis, it can be seen that the maximum length of the text is 48, so set this parameter to 48
max_seq_length = max(total['text_a'].str.len())
Copy the code

2. Data processing

Process the data into a format that the model can read into
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]

Trainset iterator
train_data_loader = create_dataloader(
    train_dataset,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

Validation set iterator
dev_data_loader = create_dataloader(
    dev_dataset,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
Copy the code

3. Set evaluation indicators

The learning rate applicable to BERT’s Transformer model is warmup’s dynamic learning rate.

# Define overparameters, Loss, optimizer, etc
from paddlenlp.transformers import LinearDecayWithWarmup


# Define the maximum learning rate during training
learning_rate = 4e-5
# Training rounds
epochs = 32
# Study rate preheating ratio
warmup_proportion = 0.1
# Weight attenuation coefficient, similar to model regular term strategy, to avoid model overfitting
weight_decay = 0.01

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

# AdamW optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias"."norm"])
    ])

criterion = paddle.nn.loss.CrossEntropyLoss()  # Cross entropy loss function
metric = paddle.metric.Accuracy()              # Accuracy evaluation index
Copy the code

4. Model evaluation

Ps: During model training, you can input the nvidia-smI command on the terminal or click the “performance monitoring” option at the bottom to check the usage of video memory and properly adjust batchsize to prevent unexpected pause due to insufficient video memory.

Define model training validation evaluation function
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader) :
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.8f, accu: %.8f" % (np.mean(losses), accu))  Evaluate the effect on the output validation set
    model.train()
    metric.reset()
    return  np.mean(losses), accu  # return accuracy
Copy the code

5. Model training

Fixed random seed to facilitate reproducibility of results
# seed = 1024
seed = 512
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
%cd ~
Copy the code
/home/aistudio
Copy the code

Ps: During model training, you can input the nvidia-smI command on the terminal or check the video memory occupation through the performance monitoring option at the bottom right. If the video memory is insufficient, you should properly adjust the batchSize value.

# Model training:
import paddle.nn.functional as F
from visualdl import LogWriter

save_dir='/'
writer = LogWriter("./log")
tic_train = time.time()
global_step = 0
best_val_acc=0
tic_train = time.time()
accu=0

for epoch in range(1, epochs + 1) :for step, batch in enumerate(train_data_loader, start=1) : input_ids, segment_ids, labels = batch logits = model(input_ids, segment_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()
        global_step+=1
        if global_step % 40= =0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.8f, accu: %.8f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    40 / (time.time() - tic_train)))
            tic_train = time.time()

        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
        
        # eval more than 100 times, or eval a little more. Too small doesn't make sense
        if global_step % 200= =0 and global_step>=3000:
            # Evaluate the current training model
            eval_loss, eval_accu = evaluate(model, criterion, metric, dev_data_loader)
            print("eval on dev loss: {:.8}, accu: {:.8}".format(eval_loss, eval_accu))
            # add eval log display
            writer.add_scalar(tag="eval/loss", step=global_step, value=eval_loss)
            writer.add_scalar(tag="eval/acc", step=global_step, value=eval_accu)
            # add train log display
            writer.add_scalar(tag="train/loss", step=global_step, value=loss)
            writer.add_scalar(tag="train/acc", step=global_step, value=acc)
            save_dir = "best_checkpoint"
            # add save
            if eval_accu>best_val_acc:
                if not os.path.exists(save_dir):
                    os.mkdir(save_dir)
                best_val_acc=eval_accu
                print(The F "model is saved in{global_step}Step, the optimal eval accuracy is{best_val_acc:8.f}!")
                save_param_path = os.path.join(save_dir, 'best_model.pdparams')
                paddle.save(model.state_dict(), save_param_path)
                fh = open('best_checkpoint/best_model.txt'.'w', encoding='utf-8')
                fh.write(The F "model is saved in{global_step}Step, the optimal eval accuracy is{best_val_acc:8.f}!")
                fh.close()

tokenizer.save_pretrained(save_dir)
Copy the code
Global Step 40, epoch: 1, Batch: 40, Loss: 2.66168261, ACCU: 0.03826389, Speed: 0.61 Step /s Global Step 80, epoch: 1, Batch: 80, loss: 2.55217147, Accu: 0.06201389, Speed: 0.60 Step /s Global Step 120, epoch: 1, Batch: 120, loss: 2.35012722, ACCU: 0.13446759, speed: 0.60 step/sCopy the code
save_dir='/'
tokenizer.save_pretrained(save_dir)
Copy the code
Test the scores of optimal model parameters on the validation set
evaluate(model, criterion, metric, dev_data_loader)
Copy the code

Five, the prediction

Restart orkillall -9 pythonTo release the cache

1. Import various class libraries

Import the required third-party libraries
import math
import numpy as np
import os
import collections
from functools import partial
import random
import time
import inspect
import importlib
from tqdm import tqdm
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import IterableDataset
from paddle.utils.download import get_path_from_url
Import the related packages required by paddlenLP
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import MapDataset
from paddle.dataset.common import md5file
from paddlenlp.datasets import DatasetBuilder
Copy the code

2. Load the model

The Roberta-WWM-Ext-Large model with better effect in the Chinese field is used this time. The pre-training model generally "works miracles with great efforts", and the large pre-training model can achieve better effect than the Base model
MODEL_NAME = "roberta-wwm-ext-large"
The fine-Tune network definition is completed by specifying the name of the model you want to use and the number of categories for the text classification, which is sorted by stitching together a Full Connected network after pre-training the model
model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=14) Set num_classes to 14
Tokenizer can convert the raw input text into an input data format acceptable to the model. Note that the Tokenizer class corresponds to the selected model, as you can see in the PaddleNLP documentation
tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained(MODEL_NAME)
Copy the code
Load the model parameters of the round with the best effect on the validation set
import os
import paddle

seed = 1024
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)

params_path = '88.73671 / best_checkpoint best_model. Pdparams'
if params_path and os.path.isfile(params_path):
    Load model parameters
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
Copy the code

3. Load the test dataset

Read the test set file to make the prediction
import pandas as pd
test = pd.read_csv('~/data/data103654/test.txt', header=None, names=['text_a'])  
Copy the code
print(max(test['text_a'].str.len()))
Copy the code
label_list=['technology'.'sports'.'politics'.'stock'.'entertainment'.'education'.'home'.'business'.'real estate'.'social'.'games'.'lottery'.'sign'.'fashion']
print(label_list)
Copy the code
# Define the categories to be categorized
id_label_dict={}
for i in range(0.len(label_list)):
    id_label_dict[i]=label_list[i]
print(id_label_dict)
Copy the code
! head -n5 ~/data/data103654/test.txtCopy the code

5. Data processing

Define data loading and processing functions
def convert_example(example, tokenizer, max_seq_length=48, is_test=False) :
    qtconcat = example["text_a"]
    encoded_inputs = tokenizer(text=qtconcat, max_seq_len=max_seq_length)  # tokenizer handles to a format acceptable to the model
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

# Define the model prediction function
def predict(model, data, tokenizer, label_map, batch_size=1) :
    examples = []
    Process the input data (in list format) into a format acceptable to the model
    for text in data:
        input_ids, segment_ids = convert_example(
            text,
            tokenizer,
            max_seq_length=48,
            is_test=True)
        examples.append((input_ids, segment_ids))

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input id
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment id
    ): fn(samples)

    # Seperates data into some batches.
    batches = []
    one_batch = []
    for example in examples:
        one_batch.append(example)
        if len(one_batch) == batch_size:
            batches.append(one_batch)
            one_batch = []
    if one_batch:
        # The last batch whose size is less than the config batch_size setting.
        batches.append(one_batch)

    results = []
    model.eval(a)for batch in batches:
        input_ids, segment_ids = batchify_fn(batch)
        input_ids = paddle.to_tensor(input_ids)
        segment_ids = paddle.to_tensor(segment_ids)
        logits = model(input_ids, segment_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results  # return forecast results
Copy the code
Define a preprocessing function for the data that specifies the list format for the model input
def preprocess_prediction_data(data) :
    examples = []
    for text_a in data:
        examples.append({"text_a": text_a})
    return examples

Format the test set data
data1 = list(test.text_a)
examples = preprocess_prediction_data(data1)
Copy the code

6. Start predicting

# Make predictions for the test set
results = predict(model, examples, tokenizer, id_label_dict, batch_size=128)   
Copy the code
print(results)
Copy the code
# Store the prediction results of list format as TXT file, submit format requirement: one category per line
def write_results(labels, file_path) :
    with open(file_path, "w", encoding="utf8") as f:
        f.writelines("\n".join(labels))

write_results(results, "./result.txt")
Copy the code
Zip () zip (); zip (); zip ()
!zip 'submission.zip' 'result.txt'
Copy the code
! head result.txtCopy the code

Please note that the submission format is REQUIRED to be ZIP. Find the generated submission. Zip file in the home directory, download it locally and submit it on the competition page.