“This is the fourth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

The Baseline of the competition is provided in this paper. “Hugging Face” pre-training model and pipeline method are adopted to get text summaries quickly. Rouge_L score is 0.22158458 after the test, and Rouge_L takes the 8th place

Game information

Link: www.datafountain.cn/competition…

The problem is introduced

Background of the problem

With the rapid development of the Internet and social media, all kinds of news articles emerge one after another, and readers find it difficult to effectively discover what they are interested in when faced with massive news information. In order to help readers quickly understand news articles and find the news content they are interested in, this training competition aims to build a high-quality summary generation model to help readers quickly understand news content by generating short summaries for news documents.

The problem task

Based on real news articles, an efficient abstract generation model is established by using machine learning technology to generate corresponding content summaries for news documents.

schedule

This competition is entitled training competition and is open permanently without special notice

Online registration will be officially opened at 15:00, September 27, 2021; Open ranking of training data before October 11, 2021; The ranking and prizes will be announced on December 06, 2021.

This competition is a training competition with no prize money. As of December 6th, the top 3 teams will receive a CCF membership plus a commemorative medal, and the top 50 teams will receive an e-certificate issued by the platform.

Code requirements

Submission requirements

Participants submit in CSV file format, and participants submit in CSV result format, encoding format utF8. Using the “\t” delimiter, the submitt.csv file fields are as follows:

The field name type Value range Field to explain
Index Int The index
Target Str string Generated summary

Submitted sample

The following is an example:

0 I like it
1 It is good

Evaluation standard

The rouge-L value is used for evaluation. The scoring algorithm is as follows:

Where LCS (X, Y) is the length of the longest common subsequence of X and Y, m and N represent the length of manual annotation abstract and machine automatic abstract respectively (generally, the number of words contained), Rlcs and Plcs represent recall rate and accuracy rate respectively, and Flcs represents rouge-L.

The data shows that

The data is based on newspaper news of CNN and Daily Mail, including news articles and abstracts, etc. This kind of data set is widely used in abstract generation and reading comprehension and other application scenarios.

The data folder contains four files, in order:

The file type The file name The file content
The training set train.csv Training data sets, corresponding summaries
The test set test.csv Test data set, no corresponding digest
Fields that Field Description XLSX Specific instructions for training set/test set fields
Submit the sample submission.csv There are only two fields Index \t Target

The official baseline

Click here for the official baseline

Rouge F (Rougr-L) : 0.24

Rouge p: 0.34

Rouge r: 0.19

Baseline

Install the Rouge and Transformers dependencies first.

Here is the code flow:

The preparatory work

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - guide package -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the import re import json import torch import random import numpy as np import pandas as pd from rouge import Rouge from tqdm import tqdm from transformers import pipeline from Import AutoTokenizer, AutoModelForSeq2SeqLM def set_seed(seed): Torch. Manual_seed (seed) # CPU sets the seed for the CPU to generate random numbers, To make is to determine the results of the torch. Cuda. Manual_seed (seed) # gpu set random seed the torch for the gpu. The backends. Cudnn. The deterministic = True # cudnn Np.random. Seed (seed) # npy random. Seed (seed) # random and transforms # label): rouge = Rouge() rouge_score = rouge.get_scores(output, label) rouge_L_f1 = 0 rouge_L_p = 0 rouge_L_r = 0 for d in rouge_score: rouge_L_f1 += d["rouge-l"]["f"] rouge_L_p += d["rouge-l"]["p"] rouge_L_r += d["rouge-l"]["r"] print("rouge_f1:%.2f" % (rouge_L_f1 / len(rouge_score))) print("rouge_p:%.2f" % (rouge_L_p / len(rouge_score))) print("rouge_r:%.2f" % (rouge_L_r / len(rouge_score))) set_seed(0)Copy the code

Data is read

Process the pandas. DataFrameDataFrame convenient operation

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- read data -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- train_path = '. / CCFNewsSummary train_dataset. CSV '# custom training set path with open(train_path,'r',encoding='utf-8') as f: Train_data_all = F.readlines () test_path='./CCFNewsSummary/test_dataset.csv open(test_path,'r',encoding='utf-8') as f: test_data=f.readlines() train = pd.DataFrame([],columns=["Index","Text","Abstract"]) test = pd.DataFrame([],columns=["Index","Text"]) for idx,rows in enumerate(train_data_all): train.loc[idx] = rows.split("\t") for idx,rows in enumerate(test_data): test.loc[idx] = rows.split("\t")Copy the code

Load the multilingual T5 model

The training model download address: huggingface co/csebuetnlp /…

Go to -files-save. Gitattributes and readme. md and download them all in the same file.

As shown in the figure below, you can load the pretraining model locally

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- loading model T5 】 【 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip())) model_name = "./PretrainModel/mT5_multilingual_XLSum" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name)Copy the code

Sample test

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - to see a single sample prediction results -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- I = 0 article_text = "train" [" Text "] [I] article_abstract =  train["Abstract"][i] input_ids = tokenizer( [WHITESPACE_HANDLER(article_text)], return_tensors="pt", padding="max_length", truncation=True, max_length=512 )["input_ids"] output_ids = model.generate( input_ids=input_ids, max_length=512, min_length=int(len(article_text)/32), no_repeat_ngram_size=3, num_beams=5 )[0] summary = tokenizer.decode( output_ids, skip_special_tokens=True, Clean_up_tokenization_spaces =False) print(F "Generate: \n{summary}") print(f"Label: \n{article_abstract}") print_rouge_L(summary,article_abstract)Copy the code

Although the generated sentences are a bit strange, Rouge_L has a score of 0.34, which is still good compared to the list

Evaluate Rouge_L for 10 samples

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- forecasting -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - multi_sample = 10 for independence idx, article_text in tqdm(enumerate(train["Text"][:multi_sample]),total=multi_sample): input_ids = tokenizer( [WHITESPACE_HANDLER(article_text)], return_tensors="pt", padding="max_length", truncation=True, max_length=512 )["input_ids"] output_ids = model.generate( input_ids=input_ids, max_length=512, min_length=int(len(article_text)/32), no_repeat_ngram_size=3, num_beams=5 )[0] summary = tokenizer.decode( output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) train.loc[idx,"summary"] = summary print_rouge_L(train["summary"][:multi_sample],train["Abstract"][:multi_sample])Copy the code

Rouge_f: 0.22

Rouge_p: 0.17

Rouge_r: 0.31

Multiple tests still work, with 10 samples taking 4 minutes, and an estimated 1,000 samples taking close to 7 hours.

To predict

It takes about 6-7 hours

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- forecasting -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - for independence idx, article_text in tqdm(enumerate(test["Text"]),total=1000): input_ids = tokenizer( [WHITESPACE_HANDLER(article_text)], return_tensors="pt", padding="max_length", truncation=True, max_length=768 )["input_ids"] output_ids = model.generate( input_ids=input_ids, max_length=512, min_length=int(len(article_text)/32), no_repeat_ngram_size=3, num_beams=5 )[0] summary = tokenizer.decode( output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) test.loc[idx,"Text"] = summary # test[["Index","Text"]].to_csv("T5summit01.csv",index=False,header=False,sep="\t")Copy the code

A list of results