PaddleNLP taobao product review sentiment analysis
Aistudio address: aistudio.baidu.com/aistudio/pr…
Making address: hub.fastgit.org/livingbody/…
Under the background of the rapid development of E-commerce in China, almost all e-commerce websites support consumers to score and comment on the workmanship, delivery and price of products. Posting a large number of messages and comments on network platforms has become a popular form of the Internet, and this situation is bound to bring massive information to the Internet.
For sellers, they can obtain the actual needs of customers from the review information, so as to improve product quality and enhance their competitiveness. On the other hand, for some unknown experience products, customers can through the network to get the product information, especially the experience of some unknown products, customers in order to reduce the risk of their own more inclined to get other customer’s opinions and views, these comments for potential buyers is a treasure, as the important basis of decision making.
For customers, they can learn from other people’s purchase history and review information to better assist them in making purchase decisions.
Therefore, by using data mining technology to analyze a large number of customer comments, we can dig out the characteristics of these information, and the obtained information is conducive to manufacturers to improve their own products and related services, and enhance the core competitiveness of businesses.
Data labels are {0: ‘negative’, 1: ‘neutral’, 2: ‘positive’}
! pip install --upgrade paddlenlpCopy the code
Second, data processing
1. Data viewing
See that each piece of data contains a comment and the corresponding label, 0 or 1. 0 represents negative comment, 1 represents neutral comment, and 2 represents positive comment.
# decompression
#! Unzip data/data94812/ Chinese Taobao review data set.zip
Copy the code
! head -n9 train.txtCopy the code
2. Data set format conversion
def read(data_path) :
data=['label'+'\t'+'text_a\n']
with open(data_path, 'r', encoding='utf-8-sig') as f:
lines=f.readlines()
# 3 acts as one record
for i in range(int(len(lines)/3)) :Read the first action
word = lines[i*3].strip('\n')
Read the third behavior tag
label = lines[i*3+2].strip('\n')
data.append(label+'\t'+word+'\n')
i=i+1
return data
with open('formated_train.txt'.'w') as f:
f.writelines(read('train.txt'))
with open('formated_test.txt'.'w') as f:
f.writelines(read('test.txt'))
Copy the code
! head formated_train.txtCopy the code
Label Text_a 0 quality is great! Very good, very beautiful, quality problems do not know, have to use to know, this express is too rubbish 0 price above anyway after the discount is the market sales price, Engage in activities also casually say, very delicious 0 beautiful good quality is express not to force, send a piece to send two days 0 baby is very satisfied, send small just good, quality is good, worth buying, is the express is too slow, Yesterday just arrived 0 suitcase is quite good! It looks like it's pretty solid and you'll know if it's really good when you use it! And the zipper is smooth! Express is too bad, too slow 0 quality is very good, delivery is fast, decisive to praise! The Courier was annoyed and asked him to wait for me for five minutes, but I called him three times in five minutes! The color is very small and fresh, the same as the picture, the sticker sent is very cute. The wheels are very quiet on the tiles, which would be perfect if the details were better. Cost-effective. The express didn't get a full mark one because they didn't deliver the goods, and two because THEY gave me a look when I said I was signing the goods after inspecting them. Although installation master not to force, but things are really very force! Love the praise!Copy the code
! head formated_test.txtCopy the code
The label text_a 0 is very beautiful, very light, as described by the seller, the delivery speed is not good, the design is very beautiful, the customer service attitude is very good, well, the quality is not good, but it will pail, And express good slow ah 0 value ah ha ha ha, is the color random I have to why the color, the results are all the same color, but good things can, value for money. The seller is very nice, warm and thoughtful, packed tightly, and tried to solve the delivery problems. I want to complain about the delivery. It's terrible. 0 is much better than imagined, sound is particularly good, but also with surround sound, in short, very like, is this price is not very affordable 0 bought a little bigger, but also can wear, children like it very much, three sets, spring and autumn can wear, The quality is ok, the blue velvet feels a bit old and a bit expensive. The express company made a little mistake, but the boss soon fixed it. The quality of 0 is average, the side board is multi-layer board, the express delivery is too slow, the goods will be delivered on February 8th, and it will arrive on February 21st. 0 is very good, the store is very careful, the packaging is very good, but the express is too slow, I hope this aspect or improve! 0 yummy, but the portion is too small and a little expensiveCopy the code
3. Override the read method to read from the defined data set
By viewing the courseware, 3 acts as a record, including evaluation content, evaluation classification, evaluation positive and negative labels, according to the file structure, customize the data set
from paddlenlp.datasets import load_dataset
def read(data_path) :
with open(data_path, 'r', encoding='utf-8') as f:
# skip column names
next(f)
for line in f:
label, word= line.strip('\n').split('\t')
yield {'text': word, 'label': label}
# data_path is an argument to the read() method
train_ds = load_dataset(read, data_path='formated_train.txt',lazy=False)
test_ds = load_dataset(read, data_path='formated_test.txt',lazy=False)
dev_ds = load_dataset(read, data_path='formated_test.txt',lazy=False)
Copy the code
print(len(train_ds))
print(train_ds.label_list)
for idx in range(10) :print(train_ds[idx])
Copy the code
13339 None {'text': 'Good quality! ', 'label': '0'} {'text': 'label': '0'} {'text': } {' label': '0'} {'text': } {' label': '0'} {'text': 'The baby is very satisfied, the size is just right, the quality is good, it is worth buying, but the express is too slow, arrived yesterday ', 'label': '0'} {'text':' luggage is good! It looks like it's pretty solid and you'll know if it's really good when you use it! And the zipper is smooth! } {' label': '0'} {' label': '0'} {' label': '0'} {' label': '0'} The Courier was annoyed and asked him to wait for me for five minutes, but I called him three times in five minutes! ', 'label': '0'} {'text': 'the color is small and fresh, just like the picture, the sticker is very cute. The wheels are very quiet on the tiles, which would be perfect if the details were better. Cost-effective. The express didn't get a full mark one because they didn't deliver the goods, and two because THEY gave me a look when I said I was signing the goods after inspecting them. ', 'label': '0'} {'text': '0'} Love the praise! ', 'label': '0'} {'text': '0'} Express delivery at the supermarket door to make a phone call to go, the result was taken away by others, at least sent back. & hellip ; What a mess! ', 'label': '0'}Copy the code
Use the pre-training model
1. Select the pre-training model
import paddlenlp as ppnlp
Set the name of the model you want to use
MODEL_NAME = "Groeb - 1.0"
ernie_model = ppnlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)
model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=3)
Copy the code
[2021-06-15 23:40:55.574] [INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to / home/aistudio /. Paddlenlp/models/groeb - 1.0 [the 23:40:55 2021-06-15, 577] [INFO] - Downloading ernie_v1_chn_base. Pdparams 100% from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams | █ █ █ █ █ █ █ █ █ █ | 392507/392507 [00:13<00:00, 29747.03 IT /s] [2021-06-15 23:41:15.870] [INFO] -weights from pretrained Model Not Used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'CLs.predictions. Layer_norm. Bias '] [2021-06-15 23:41:16.263] [INFO] - Already cached / home/aistudio /. Paddlenlp/models/groeb - 1.0 / ernie_v1_chn_base pdparams / opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/paddle/fluid/dygraph/the layers. The py: 1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err))) / opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/paddle/fluid/dygraph/the layers. The py: 1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err)))Copy the code
2. Call PPNLP. Transformers. ErnieTokenizer for data processing
The pre-training model ERNIE deals with Chinese data on a word-by-word basis. PaddleNLP already has a corresponding Tokenizer built in for various pre-trained models. Specify the name of the model you want to use to load the corresponding Tokenizer.
Tokenizer is used to convert the raw input text into a form of input data that the model can accept.
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)
Copy the code
[2021-06-15 23:41:18.286] [INFO] - Downloading vocab.txt from Downloading details https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt | 100% █ █ █ █ █ █ █ █ █ █ | 90/90 [00:00 "00:00, 3163.87 it/s]Copy the code
You can see from the code above that ERNIE’s model output has two tensor’s.
- Sequence_output represents the semantic characteristics of each input token. Shape is (1, num_tokens, hidden_size). It is generally used for sequence annotation, question answering and other tasks.
- Pooled_output is the semantic feature representation of the whole sentence, shape is (1, hidden_size). It is generally used for text classification, information retrieval and other tasks. \
NOTE:
- If you want to use groeb – tiny training model, the corresponding tokenizer paddlenlp should be used. The transformers. ErnieTinyTokenizer. From_pretrained (‘ groeb – tiny ‘)
- The code example above shows the data processing steps required to pre-train the model using Transformer classes. For ease of use, PaddleNLP also provides a higher-level API that returns the data format required by the model with one click.
3. Read data
Use the paddles.io.DataLoader interface to load data asynchronously from multiple threads.
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import convert_example, create_dataloader
# Model run batch size
batch_size = 200
max_seq_length = 128
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
Stack(dtype="int64") # label
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
dev_data_loader = create_dataloader(
dev_ds,
mode='dev',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
Copy the code
4. Set up fine-tune optimization strategy and access evaluation indicators
from paddlenlp.transformers import LinearDecayWithWarmup
import paddle
# Maximum learning rate during training
learning_rate = 5e-5
# Training rounds
epochs = 5 # 3
# Study rate preheating ratio
warmup_proportion = 0.1
# Weight attenuation coefficient, similar to model regular term strategy, to avoid model overfitting
weight_decay = 0.01
num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias"."norm"])
])
criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
Copy the code
4. Model training and evaluation
The process of model training usually includes the following steps:
- Extract a batch data from the Dataloader
- Feed batch data to Model for forward calculation
- Forward calculation results are passed to the loss function to calculate loss. The forward calculation results are transmitted to the evaluation method to calculate the evaluation index.
- Loss Indicates that the gradient is updated. Repeat the above steps.
Each time an EPOCH is trained, the program will evaluate the effect of the current model training.
The # checkpoint folder is used to save the training model! mkdir /home/aistudio/checkpointCopy the code
The mkdir: always create a directory '/ home/aistudio/checkpoint: File existsCopy the code
import paddle.nn.functional as F
from utils import evaluate
global_step = 0
for epoch in range(1, epochs + 1) :for step, batch in enumerate(train_data_loader, start=1) : input_ids, segment_ids, labels = batch logits = model(input_ids, segment_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
if global_step % 10= =0 :
print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_grad()
evaluate(model, criterion, metric, dev_data_loader)
Copy the code
Global Step 270, epoch: 5, Batch: 2, Loss: 0.43869, ACC: 0.80500 Global Step 280, Epoch: 5, Batch: 12, Loss: 0.41893, ACC: 0.80792 Global Step 290, epoch: 5, Batch: 22, Loss: 0.44744, ACC: 0.80523 Global Step 300, epoch: 5, Batch: 32, loss: 0.38977, ACC: 0.80797 Global Step 310, epoch: 5, Batch: 42, loss: 0.39923, ACC: 0.80774 Global Step 320, epoch: 5, Batch: 52, loss: 0.36724, ACC: 0.81029 Global Step 330, epoch: 5, Batch: 62, loss: Eval Loss: 0.51191, Accu: 0.74444Copy the code
5. Model prediction
Training Well preserved training can be used for prediction. Customize the prediction data by calling predict(), as shown in the following example code.
from utils import predict
data = [
{"text":'The hotel is rather old and the rooms on sale are mediocre. In general '},
{"text":'I was very excited to show it, but then I saw that at the end of the screening, there was an animated Mickey Mouse episode.'},
{"text":'For an old four-star hotel, the rooms are still neat and quite nice. The airport pick-up service is very good, you can check in in the car, save time. '},
]
label_map = {0: 'negative'.1: 'neutral'.2: 'positive'}
results = predict(
model, data, tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(data):
print('Data: {} \t Lable: {}'.format(text, results[idx]))
Copy the code
Data: {'text': 'The hotel is old and the rooms on sale are mediocre. Overall average '} Lable: neutral Data: {' text ':' with the feelings of very excited, but watch found that after the screening, a set of Mickey Mouse cartoons'} Lable: negative Data: {' text ': 'For an old four-star hotel, the rooms are still neat and quite nice. The airport pick-up service is very good, you can check in in the car, save time. '} Lable: positiveCopy the code