PyTorch text classification based on TorchText

Author | Dr. VAIBHAV KUMAR compile | source of vitamin k | Analytics In Diamag

Text classification is one of the most important applications of natural language processing. There are several ways to classify text in machine learning. But most of these classification technologies require a lot of pretreatment and a lot of computing resources. In this article, we use PyTorch for multi-class text classification because it has the following advantages:

PyTorch provides a powerful way to implement complex model architectures and algorithms with relatively little preprocessing and consumption of computational resources, including execution time.
The basic unit of PyTorch is tensors, which have the advantage of changing architectures at run time and distributing training across gpus.
PyTorch provides a powerful library called TorchText that contains scripts for preprocessing text and source code for some popular NLP datasets.

In this article, we will demonstrate multi-class text classification using TorchText, a powerful natural language processing library in PyTorch.

For this classification, a model consisting of an EmbeddingBag layer and a linear layer will be used. The EmbeddingBag handles variable-length text entries by calculating the average of the embedded values.

This model will be trained on the DBpedia data set, where text belongs to 14 classes. After successful training, the model predicts the class label of the input text.

DBpedia data set

DBpedia is a popular benchmark dataset in the field of natural language processing. It contains texts in 14 categories, such as companies, educational institutions, artists, films, etc.

It is actually a structured collection of content extracted from the information created by the Wikipedia project. The DBpedia dataset provided by TorchText has 63,000 text instances belonging to 14 classes. It includes 5600 training instances and 70,000 test instances.

TorchText to achieve text classification

First, we need to install the latest version of TorchText.

! PIP install torchtext = = 0.4Copy the code

After that, we’ll import all the necessary libraries.

import torch
import torchtext
from torchtext.datasets import text_classification
import os
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import time
from torch.utils.data.dataset import random_split
import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizerCopy the code

In the next step, we will define ngrams and Batch sizes. The Ngrams feature is used to capture important information about local word order.

We are using Bigram and the sample text in the dataset will be a list of single words plus Bigrams strings.

NGRAMS = 2
BATCH_SIZE = 16Copy the code

Now we will read the DBpedia data set provided by TorchText.

if not os.path.isdir('./.data'):
    os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['DBpedia'](
    root='./.data', ngrams=NGRAMS, vocab=None)Copy the code

After downloading the dataset, we will verify the length and number of labels of the downloaded dataset.

print(len(train_dataset))
print(len(test_dataset))Copy the code

print(len(train_dataset.get_labels()))
print(len(test_dataset.get_labels()))Copy the code

We will use CUDA architecture to speed up execution.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
deviceCopy the code

In the next step, we will define the classification model.

class TextSentiment(nn.Module): def __init__(self, vocab_size, embed_dim, num_class): super().__init__() self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True) self.fc = nn.Linear(embed_dim, num_class) self.init_weights() def init_weights(self): Initrange = 0.5 self. Embedding. Weight. Data. Uniform_ (- initrange initrange) self. Fc. Weight. Data. Uniform_ (- initrange, initrange) self.fc.bias.data.zero_() def forward(self, text, offsets): embedded = self.embedding(text, offsets) return self.fc(embedded) print(model)Copy the code

Now we will initialize the hyperparameters and define the function to generate the training Batch.

VOCAB_SIZE = len(train_dataset.get_vocab()) EMBED_DIM = 32 NUN_CLASS = len(train_dataset.get_labels()) model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device) def generate_batch(batch): label = torch.tensor([entry[0] for entry in batch]) text = [entry[1] for entry in batch] offsets = [0] + [len(entry) for  entry in text] offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) text = torch.cat(text) return text, offsets, labelCopy the code

In the next step, we will define functions to train and test the model.

def train_func(sub_train_): Train_loss = 0 train_acc = 0 DataLoader(sub_train_, batCH_size = BATch_size, shuffle=True, collate_fn=generate_batch) for i, (text, offsets, cls) in enumerate(data): optimizer.zero_grad() text, offsets, cls = text.to(device), offsets.to(device), cls.to(device) output = model(text, offsets) loss = criterion(output, cls) train_loss += loss.item() loss.backward() optimizer.step() train_acc += (output.argmax(1) == cls).sum().item() # Scheduler.step () return train_loss/len(sub_train_), train_ACC/len(sub_train_) def test(data_): loss = 0 acc = 0 data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch) for text, offsets, cls in data: text, offsets, cls = text.to(device), offsets.to(device), cls.to(device) with torch.no_grad(): output = model(text, offsets) loss = criterion(output, cls) loss += loss.item() acc += (output.argmax(1) == cls).sum().item() return loss / len(data_), acc / len(data_)Copy the code

We will train the model with five Epochs.

N_EPOCHS = 5 min_valid_loss = float('inf') criterion = torch.nn.CrossEntropyLoss().to(device) optimizer = SGD(Model.parameters (), lr=4.0) Scheduler = Torch. Optim.lr_scheduler.StepLR(Optimizer, 1, Gamma =0.9) train_len = int(len(train_dataset) * 0.95) sub_train_, sub_valid_ = \ random_split(train_dataset, [train_len, len(train_dataset) - train_len]) for epoch in range(N_EPOCHS): start_time = time.time() train_loss, train_acc = train_func(sub_train_) valid_loss, valid_acc = test(sub_valid_) secs = int(time.time() - start_time) mins = secs / 60 secs = secs % 60 print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs)) print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)') print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')Copy the code

Next, we will test our model on the test data set and check the accuracy of the model.

print('Checking the results of test dataset... ') test_loss, test_acc = test(test_dataset) print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')Copy the code

Now we will test our model on a single news-text string and predict the class tags for a given news-text.

DBpedia_label = {0: 'Company',
                1: 'EducationalInstitution',
                2: 'Artist',
                3: 'Athlete',
                4: 'OfficeHolder',
                5: 'MeanOfTransportation',
                6: 'Building',
                7: 'NaturalPlace',
                8: 'Village',
                9: 'Animal',
                10: 'Plant',
                11: 'Album',
                12: 'Film',
                13: 'WrittenWork'}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1
vocab = train_dataset.get_vocab()
model = model.to("cpu")Copy the code

Now, we will randomly extract some text from the test data and examine the predicted class labels.

The first prediction:

ex_text_str = "Brekke Church (Norwegian: Brekke kyrkje) is a parish church in Gulen Municipality in Sogn og Fjordane county, Norway. It is located in the village of Brekke. The church is part of the Brekke parish in the Nordhordland deanery in The Diocese of BjA Rgvin. The White, wooden Church, which has 390 seats was consecrated on 19 November 1862 by the local Dean Thomas Erichsen. The architect Christian Henrik Grosch made the designs for the church, which is the third church on the site." print("This is a %s news" %DBpedia_label[predict(ex_text_str, model, vocab, 2)])Copy the code

Second prediction:

ex_text_str2 = "Cerithiella superba is a species of very small sea snail, a marine gastropod mollusk in the family Newtoniellidae. This species is known from European waters. It was described by  Thiele, 1912." print("This text belongs to %s class" %DBpedia_label[predict(ex_text_str2, model, vocab, 2)])Copy the code

Third prediction:

ex_text_str3 = "  Nithari is a village in the western part of the state of Uttar Pradesh India bordering on New Delhi. Nithari forms part of the New Okhla Industrial Development Authority's planned industrial city Noida falling in Sector 31. Nithari made international news headlines in December 2006 when the skeletons of a number of apparently murdered women and children were unearthed in the village."

print("This text belongs to %s class" %DBpedia_label[predict(ex_text_str3, model, vocab, 2)])Copy the code

Therefore, in this way, we implement multi-class text classification using TorchText.

This is a simple and easy way to categorize text with very little preprocessing using the PyTorch library. It took less than five minutes to train the model on 5,600 training instances.

Rerun the code by changing ngram from 2 to 3 and see if the results have improved. The same can be done with other data sets provided by TorchText.

References:

‘Text Classification with TorchText’, PyTorch Tutorial
Allen Nie, ‘A Tutorial on TorchText’

The original link: analyticsindiamag.com/multi-class…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/

PyTorch text classification based on TorchText

DBpedia data set

TorchText to achieve text classification

Related Posts

I understand it step by step like this — Topic Model, LDA(Case code)

Machine learning: Colloquially, logically say the principle of Xgboost, for discussion reference

Python data analysis package | NumPy – 02