The introduction

Recently I have learned convolutional neural network, and I want to start a small project to practice. The data set of the project is from Github, and the content is positive and negative evaluation of automobile after-sales. Pytorch is used to train the model and complete the dichoclassification of an evaluation in the Test set.

Principle: Use convolution to extract local features and capture key information similar to N-gram.

1. Data preprocessing

In natural language processing, the unavoidable topic is word vector, I use torchtext tool library to achieve the construction of word vector

Word segmentation is

def tokenizer(text) : # create a tokenizer function
    regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
Copy the code

The word divider uses the Chinese word segmentation tool Jieba library to divide words and returns the separated words in a list.

Go to the stop

def get_stop_words() :
    file_object = open('D:\\MyStudy\\program\\text-classification-master\\text-cnn\\data\\stopwords.txt',encoding='UTF-8')
    stop_words = []
    for line in file_object.readlines():
        line = line[:-1]
        line = line.strip()
        stop_words.append(line)
    return stop_words
Copy the code

Download the stop words list in advance and return the processed stop words as a list.

The data processing

def load_data(args) :
    print('Loading data... ')
    stop_words = get_stop_words() # load stop words table
    If you need to set the length of the text, Text = data.Field(sequential=True, tokenize=tokenizer, fix_length=args.max_len, stop_words=stop_words) '''
    text = data.Field(sequential=True, lower=True, tokenize=tokenizer, stop_words=stop_words)
    label = data.Field(sequential=False)

    text.tokenize = tokenizer
    train, val = data.TabularDataset.splits(
            path='D:\\MyStudy\\program\\text-classification-master\\text-cnn\\data\\',
            skip_header=True,
            train='train.tsv',
            validation='validation.tsv'.format='tsv',
            fields=[('index'.None), ('label', label), ('text', text)],
        )

    if args.static:
        text.build_vocab(train, val, vectors=Vectors(name="data\\eco_article.vector")) # Change this to your own word vector
        args.embedding_dim = text.vocab.vectors.size()[-1]
        args.vectors = text.vocab.vectors

    else: text.build_vocab(train, val)

    label.build_vocab(train, val)

    train_iter, val_iter = data.Iterator.splits(
            (train, val),
            sort_key=lambda x: len(x.text),
            batch_sizes=(args.batch_size, len(val)), The training set is set to batCH_size, and the validation set is used for testing
            device=-1
    )
    args.vocab_size = len(text.vocab)
    args.label_num = len(label.vocab)
    return train_iter, val_iter

Copy the code

Torchtext is generally used in the following steps: 1. Define an object with data.field () and pre-set parameters. Here, text and label are defined separately. Data.tabulardataset ().spilts() reads train and val. Build_vocab (trian,val) and label.build_vocab(trian,val) are used to construct word vectors for training text and labels. That splits bacth from data.iterator.Splits ().

Since then, data preprocessing is completed.

2. Model building

CNN architecture is adopted. With the help of PyTorch, the overall network architecture is: embedding layer, dimension processing, convolution layer, activation function, pooling layer, multi-channel feature extraction, Dropout layer, and full connection layer.

Embedded layer

The parameters of the embedding layer include the size of the word vector and the embedding dimension.

Convolution layer

Transform the output dimension of the embedded layer to the dimension suitable for the input of the convolution layer, and store the convolution layers of the three parallel channels with self.convs= nn.moudlelist (nn.conv2() for FSZ in filter_sizes), and return a list of convolution layers.

The activation function

X = F. ralu (conv(x) for conv in self. Convs) implants nonlinearity

Pooling, down-sampling

Multi-channel feature extraction and combination

X = [x_item.view(x_item.size(0), -1) for x_item in x] flattens the dimensions of operation results of different convolution kernels.

Dropout prevents overfitting

Full connection layer output

The model building part of the code is as follows

class TextCNN(nn.Module) :
    # Multi-channel TextCNN
    def __init__(self, args) :
        super(TextCNN, self).__init__()
        self.args = args

        label_num = args.label_num # number of tags
        filter_num = args.filter_num Number of convolution kernels
        filter_sizes = [int(fsz) for fsz in args.filter_sizes.split(', ')]
        vocab_size = args.vocab_size
        embedding_dim = args.embedding_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        if args.static: # If pretrained word vectors are used, load them ahead of time, and set freeze to True when no fine-tuning is needed
            self.embedding = self.embedding.from_pretrained(args.vectors, freeze=not args.fine_tune)

        self.convs = nn.ModuleList(
            [nn.Conv2d(1, filter_num, (fsz, embedding_dim)) for fsz in filter_sizes])
        self.dropout = nn.Dropout(args.dropout)
        self.linear = nn.Linear(len(filter_sizes)*filter_num, label_num)

    def forward(self, x) :
        The dimension of x is (batch_size, max_len). Max_len can be set via torchtext or automatically obtained as the maximum = length of the training sample
        x = self.embedding(x) When embedding is added, the dimensions of x are batch_size, max_len, embedding_DIM.

        # change the dimension to (batch_size, input_CHANEL =1, w=max_len, h= embedding_DIM)
        x = x.view(x.size(0), 1, x.size(1), self.args.embedding_dim)

        After convolution operation, the dimension of each operation result in x is (batch_size, out_CHANEL, W, h=1).
        x = [F.relu(conv(x)) for conv in self.convs]

        # After the maximum pooling layer, the dimension becomes (batch_size, out_CHANEL, w=1, h=1)
        x = [F.max_pool2d(input=x_item, kernel_size=(x_item.size(2), x_item.size(3))) for x_item in x]

        [Batch, out_CHANEL, W,h=1]
        x = [x_item.view(x_item.size(0), -1) for x_item in x]

        # Combine the features extracted from different convolution kernels, and the dimension becomes (Batch, sum:outchanel*w*h)
        x = torch.cat(x, 1)

        # dropout layer
        x = self.dropout(x)

        Full connection layer
        logits = self.linear(x)
        return logits
Copy the code

3. Model training and optimization

After the establishment of the model, the training process is entered, and the setting of hyperparameters is carried out first.

parser = argparse.ArgumentParser(description='TextCNN text classifier')

parser.add_argument('-lr'.type=float, default=0.001.help='Learning rate')
parser.add_argument('-batch-size'.type=int, default=128)
parser.add_argument('-epoch'.type=int, default=20)
parser.add_argument('-filter-num'.type=int, default=200.help=Number of convolution kernels.)
parser.add_argument('-filter-sizes'.type=str, default='June'.help='Different convolution kernel sizes')
parser.add_argument('-embedding-dim'.type=int, default=128.help='Dimension of word vector')
parser.add_argument('-dropout'.type=float, default=0.4)
parser.add_argument('-label-num'.type=int, default=2.help='Number of tags')
parser.add_argument('-static'.type=bool, default=False.help='Whether to use pre-trained word vectors')
parser.add_argument('-fine-tune'.type=bool, default=True.help='Is the pre-trained word vector fine-tuned?')
parser.add_argument('-cuda'.type=bool, default=False)
parser.add_argument('-log-interval'.type=int, default=1.help='How many iterations record a training status')
parser.add_argument('-test-interval'.type=int, default=100.help='How many iterations are carried out to test the validation set?')
parser.add_argument('-early-stopping'.type=int, default=1000.help='Number of iterations at early stop')
parser.add_argument('-save-best'.type=bool, default=True.help='Should I save when I get better accuracy?')
parser.add_argument('-save-dir'.type=str, default='model_dir'.help='Store training model location')

Copy the code
def train(args) :
    train_iter, dev_iter = data_processor.load_data(args) # Divide the data into training sets and validation sets
    print('Loading data completed')
    model = TextCNN(args)
    if args.cuda: model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
    steps = 0
    best_acc = 0
    last_step = 0
    model.train()
    for epoch in range(1, args.epoch + 1) :for batch in train_iter:
            feature, target = batch.text, batch.label
            # t_() transpose (max_len, batch_size) to (batch_size, max_len)
            with torch.no_grad():
                feature.t_()
                target.sub_(1)
            if args.cuda:
                feature, target = feature.cuda(), target.cuda()
            optimizer.zero_grad()
            logits = model(feature)
            loss = F.cross_entropy(logits, target)
            loss.backward()
            optimizer.step()
            steps += 1
            if steps % args.log_interval == 0:
                # torch. Max (logits, 1) : returns the element with the largest value in each row and its index (returns the column index of the largest element in that row)
                corrects = (torch.max(logits, 1) [1] == target).sum()
                train_acc = 100.0 * corrects / batch.batch_size
                sys.stdout.write(
                    '\rBatch[{}] - loss: {:.6f} acc: {:.4f}%({}/{})'.format(steps,
                                                                             loss.item(),
                                                                             train_acc,
                                                                             corrects,
                                                                             batch.batch_size))
            if steps % args.test_interval == 0:
                dev_acc = eval(dev_iter, model, args)
                if dev_acc > best_acc:
                    best_acc = dev_acc
                    last_step = steps
                    if args.save_best:
                        print('Saving best model, acc: {:.4f}%\n'.format(best_acc))
                        save(model, args.save_dir, 'best', steps)
                else:
                    if steps - last_step >= args.early_stopping:
                        print('\nearly stop by {} steps, acc: {:.4f}%'.format(args.early_stopping, best_acc))
                        raise KeyboardInterrupt

Copy the code

The training process starts by instantiating the model, then defining the optimizer, which I used was the Adam optimizer, and then the basic operations of the PyTorch training

for epoch in eopch_num:
	for bacth in batches:
		optimizer.zero_grad()# Gradient zero clearing
		logits = model(feature)
		loss = F.cross_entropy(logits,targets)# Cross entropy function
		loss.backward()# Backpropagation
		optimizer.step()
		step + =1
Copy the code

The testing process of validation sets is similar to the training process

def eval(data_iter, model, args) :
    corrects, avg_loss = 0.0
    for batch in data_iter:
        feature, target = batch.text, batch.label
        with torch.no_grad():
            feature.t_()
            target.sub_(1)
        if args.cuda:
            feature, target = feature.cuda(), target.cuda()
        logits = model(feature)
        loss = F.cross_entropy(logits, target)
        avg_loss += loss.item()
        corrects += (torch.max(logits, 1)
                     [1].view(target.size()) == target).sum()
    size = len(data_iter.dataset)
    avg_loss /= size
    accuracy = 100.0 * corrects / size
    print('\nEvaluation - loss: {:.6f} acc: {:.4f}%({}/{}) \n'.format(avg_loss,
                                                                       accuracy,
                                                                       corrects,
                                                                       size))
    return accuracy

def save(model, save_dir, save_prefix, steps) :
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    save_prefix = os.path.join(save_dir, save_prefix)
    save_path = '{}_steps_{}.pt'.format(save_prefix, steps)
    torch.save(model.state_dict(), save_path)

train(args)


Copy the code

After training, the correct rate of verification set can reach 90%