“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

Quick install simpleTransformers

Simpletransformers project address: hub.fastgit.org/ThilinaRaja…

SimpleTransformers

simpletransformers.ai/

Quick installation:

  • Install using Conda

1) Create a virtual environment

conda create -n st python pandas tqdm
conda activate st
Copy the code

2) Install the CUDA environment

Conda install Pytorch >=1.6 cudatoolKit =11.0 -c PytorchCopy the code

3) Install SimpleTransformers

pip install simpletransformers
Copy the code
  1. Install wandb

Wandb for tracking and visualizing Weights and Biases in web browsers (Wandb)

pip install wandb
Copy the code

Currently supported tasks:

task model
Binary and multicategory text classification ClassificationModel
Conversational AI (Chatbot Training) ConvAIModel
Language generated LanguageGenerationModel
Language model training/fine-tuning LanguageModelingModel
Multi-label text classification MultiLabelClassificationModel
Multimodal classification (combination of text and image data) MultiModalClassificationModel
Named entity recognition NERModel
Question and answer QuestionAnsweringModel
Return to the ClassificationModel
Sentence pair classification ClassificationModel
Text representation generation RepresentationModel

Where can I download the pre-training model?

For the pretraining model, see the Hugging Face documentation.

According to the model_type given in the documentation, the pretrained model can be loaded as long as the dictionary value of model_name is set correctly in arGS

【 practice 01】 Text classification

The data set

We used CLUE’s as the benchmark dataset

Select data set: IFLYTEK’ long text classification

Chinese Language Comprehension (CLUE)

www.cluebenchmarks.com/dataSet_sea…

In order to better serve the Chinese language understanding, task and industry, as a supplement to the general language model evaluation, we will improve the infrastructure by collecting, sorting and publishing Chinese tasks and standardized evaluation, and ultimately promote the development of Chinese NLP.

This paper was accepted by the International Conference on Computational Linguistics (COLING2020)

  • IFLYTEK’ long text classification

Download: github.com/CLUEbenchma…

The dataset contains more than 17,000 long text annotated data about app descriptions, including various application topics related to daily life, with a total of 119 categories: “Taxi “:0,” map navigation “:1,” Free WIFI”:2,” Car Rental “:3,… ., “women” : 115, “business” : 116, the “receipt” : 117, the “other” : 118 (use a scale of 0-118), respectively. Each piece of data has three attributes: category ID, category name, and text content from front to back.

Amount of data: Training set (12,133), validation set (2,599), Test set (2,600)

{" label ":" 110 ", "label_des" : "community supermarket", "sentence" : "Pupukuai Delivery Supermarket was founded in 2016, focusing on creating a mobile terminal 30-minute instant delivery one-stop shopping platform, commodity categories include fruits, vegetables, meat, poultry, eggs and milk, seafood and aquatic products, food and oil seasoning, drinks, leisure food, daily necessities, takeout, etc. Pupu company hopes to become a faster, better, more and more economical online retail platform with a new business model and a more efficient warehousing and distribution model, bringing consumers a better consumption experience, and at the same time promoting the process of Food safety in China and becoming an Internet company respected by the society. , simple and simple, good and fast,1. More clear and friendly delivery time prompt 2. Some optimizations to ensure user privacy 3. 4. Fixed some known bugs "}Copy the code

The data processing

Simple Transformers requires that the data be included in at least two columns of Pandas DataFrames. Simply name the text and label of the column, and SimpleTransformers will handle the data. The first column contains text of type STR. The second column contains the label of type int. For multiclass classifications, labels should be integers starting at 0.

import json import pandas as pd def load_clue_iflytek(path,mode=None): Data = [] with open(Path, "r", encoding=" UTF-8 ") as fp: if mode == 'train' or mode =='dev': for idx, line in enumerate(fp): line = json.loads(line.strip()) label = int(line["label"]) text = line['sentence'] data.append([text, label]) data_df = pd.DataFrame(data, columns=["text", "labels"]) return data_df elif mode == 'test': for idx, line in enumerate(fp): line = json.loads(line.strip()) text = line['sentence'] data.append([text]) data_df = pd.DataFrame(data, columns=["text"]) return data_dfCopy the code

Model building and training

First to configuration parameters, Simple Transformers have dict args, details about each args, can have a reference: simpletransformers. Ai/docs/tips – a…

1) Parameter configuration

Argparse def data_config(parser): parser.add_argument("--trainset_path", type=str, default="data/Chinese_Spam_Message/train.json", Parser. add_argument("--testset_path", type= STR, default="data/Chinese_Spam_Message/test.txt", Help =" test set path ") parser. Add_argument ("--reprocess_input_data", type=bool, default=True, help=" Even if there is a cache file for input data in cache_DIR, Add_argument ("--overwrite_output_dir", type=bool, default=True, help=" If True, the trained model will be saved to ouput_dir. Add_argument ("--use_cached_eval_features", type=bool, default=True, help=" Evaluation during training uses cache features, Setting this to False will cause recalculation of features at each evaluation step ") parser.add_argument("-- output_DIR ", type= STR, default="outputs/", help=" Stores all outputs, Add_argument ("--best_model_dir", type= STR, default="outputs/best_model/", outputs: outputs/best_model/ Parser def model_config(parser): Add_argument ("--max_seq_length", type=int, default=64, help=" Maximum sequence length supported by model ") parser. Add_argument ("--model_type", Type = STR, default=" Bert ", help=" model type Bert/Roberta ") # To load the model of the previously saved model instead of the default model, change model_name to the path of the directory containing the saved model. parser.add_argument("--model_name", type=str, default="./outputs/bert", Parser. add_argument("--manual_seed", type=int, default=0, help=" To produce reproducible results, ") Parser. Add_argument ("--learning_rate", type=int, default= 4E-5, Parser def train_config(parser): Add_argument ("--num_train_epochs", type=int, default=3, help=" Number of training iterations ") parser. Add_argument ("--wandb_kwargs", type=dict, default={"name": "bert"}, help="") parser.add_argument("--n_gpu", type=int, default=1, Help =" Number of GPU used for training ") parser. Add_argument ("--train_batch_size", type=int, default=64) parser.add_argument("--eval_batch_size", type=int, default=32) return parser def set_args(): parser = argparse.ArgumentParser() parser = data_config(parser) parser = model_config(parser) parser = train_config(parser) args,_ = parser.parse_known_args() return argsCopy the code

2) Model building and training

from simpletransformers.classification import ClassificationModel from sklearn.metrics import f1_score, accuracy_score import logging def f1_multiclass(labels, preds): Return f1_score(labels, preds, average='micro') # ClassificationModel = ClassificationModel(args. Model_type, args. Num_labels =num_labels, args=vars(args)) Model.train_model (train,eval_df=dev) # Model predictions result, model_outputs, wrong_predictions = model.eval_model(dev, Predictions, raw_Predictions = model.predict(Test ["text"][0]) print(raw_predictions) print(raw_predictions)Copy the code

Predicted results

Notebook computer performance is limited, in order to ensure running, Maxlen only used 64, and only trained for 3 rounds. The effect of F1 value is not very good. In addition to F1 value, other evaluation indicators can be added, such as accuracy rate, accuracy rate, recall rate, etc

{"eval_loss" = 1.8086365330510024,"f1" = 0.5917660638707195," MCC "= 0.5727319886339782}Copy the code

The complete code

import json import pandas as pd from simpletransformers.classification import ClassificationModel from sklearn.metrics import f1_score, accuracy_score import logging def load_clue_iflytek(path,mode=None): Data = [] with open(Path, "r", encoding=" UTF-8 ") as fp: if mode == 'train' or mode =='dev': for idx, line in enumerate(fp): line = json.loads(line.strip()) label = int(line["label"]) text = line['sentence'] data.append([text, label]) data_df = pd.DataFrame(data, columns=["text", "labels"]) return data_df elif mode == 'test': for idx, line in enumerate(fp): line = json.loads(line.strip()) text = line['sentence'] data.append([text]) data_df = pd.DataFrame(data, Columns =["text"]) return data_df # config import argparse def data_config(parser): parser.add_argument("--trainset_path", type=str, default="data/Chinese_Spam_Message/train.json", Parser. add_argument("--testset_path", type= STR, default="data/Chinese_Spam_Message/test.txt", Help =" test set path ") parser. Add_argument ("--reprocess_input_data", type=bool, default=True, help=" Even if there is a cache file for input data in cache_DIR, Add_argument ("--overwrite_output_dir", type=bool, default=True, help=" If True, the trained model will be saved to ouput_dir. Add_argument ("--use_cached_eval_features", type=bool, default=True, help=" Evaluation during training uses cache features, Setting this to False will cause recalculation of features at each evaluation step ") parser.add_argument("-- output_DIR ", type= STR, default="outputs/", help=" Stores all outputs, Add_argument ("--best_model_dir", type= STR, default="outputs/best_model/", outputs: outputs/best_model/ Parser def model_config(parser): Add_argument ("--max_seq_length", type=int, default=64, help=" Maximum sequence length supported by model ") parser. Add_argument ("--model_type", Type = STR, default=" Bert ", help=" model type Bert/Roberta ") # To load the model of the previously saved model instead of the default model, change model_name to the path of the directory containing the saved model. parser.add_argument("--model_name", type=str, default="./outputs/bert", Parser. add_argument("--manual_seed", type=int, default=0, help=" To produce reproducible results, ") Parser. Add_argument ("--learning_rate", type=int, default= 4E-5, Parser def train_config(parser): Add_argument ("--num_train_epochs", type=int, default=3, help=" Number of training iterations ") parser. Add_argument ("--wandb_kwargs", type=dict, default={"name": "bert"}, help="") parser.add_argument("--n_gpu", type=int, default=1, Help =" Number of GPU used for training ") parser. Add_argument ("--train_batch_size", type=int, default=64) parser.add_argument("--eval_batch_size", type=int, default=32) return parser def set_args(): parser = argparse.ArgumentParser() parser = data_config(parser) parser = model_config(parser) parser = train_config(parser) args,_ = parser.parse_known_args() return args def f1_multiclass(labels, preds): return f1_score(labels, preds, average='micro') if __name__ == "__main__": args = set_args() logging.basicConfig(level=logging.INFO) transformers_logger = logging.getLogger("transformers") Transformers_logger. SetLevel (logging. WARNING) # training model train = load_clue_iflytek (". / data/iflytek/train. Json ", mode='train') dev = load_clue_iflytek("./data/iflytek/dev.json", mode='dev') test = load_clue_iflytek("./data/iflytek/test.json", Mode ='test') num_labels = len(train["labels"].unique()) print(train.shape) print(dev.shape ClassificationModel(args. Model_type, args. Model_name, num_labels=num_labels, args= Vars (args)) Model.train_model (train,eval_df=dev) # Model predictions result, model_outputs, wrong_predictions = model.eval_model(dev, f1=f1_multiclass)Copy the code

Simpletransformers is fast, but only for quick application or baseline writing. You need to change the model structure, flexible combination method or master Transformer and other high degree of freedom python library

NLP cute new, shallow talent, mistakes or imperfect place, please criticize!!