Classification of papers

Mission statement

  • Study topic: Paper classification (data modeling task), using existing data modeling, classification of new papers;
  • Content: Use the title of the paper to complete the category classification;
  • Learning achievements: Learned the basic method of text classification, TF-IDF, etc.

Data processing steps

In the original ArXiv papers, each paper has a corresponding category, and the category is filled in by the author. In this task, we can use the title and abstract of the paper to complete:

  • Process the title and abstract of the paper;
  • Processing of paper categories;
  • Construct text classification model;

Thinking of text classification

  • Idea 1: TF-IDF+ machine learning classifier

    • Tf-idf is directly used to extract text features, and classifiers are used for classification. SVM, LR, XGboost, etc., can be used for classifier selection
  • Idea 2: FastText

    • FastText is an entry-level word vector that allows you to quickly build classifiers using the FastText tool provided by Facebook
  • Idea 3: WordVec+ Deep learning classifier

    • WordVec is the word vector of advanced categories, and it can be classified by constructing deep learning classification. The network structure of deep learning classification can choose TextCNN, TextRnn or BiLSTM.
  • Idea 4: Bert word vectors

    • Bert is a highly matched word vector with strong modeling learning ability.

Specific code implementation and explanation

In order to facilitate you to get started with text classification, we choose train of thought 1 and train of thought 2 to explain to you. First complete the field read:


Import the required package
import seaborn as sns # For drawing
from bs4 import BeautifulSoup # used to crawl arxiv data
import re # Used in regular expressions to match the pattern of strings
import requests # For network connection, send network request, use domain name to get corresponding information
import json # Read data, our data is in JSON format
import pandas as pd # Data processing, data analysis
import matplotlib.pyplot as plt # Drawing tool
Copy the code
def readArxivFile(path, columns=['id'.'submitter'.'authors'.'title'.'comments'.'journal-ref'.'doi'.'report-no'.'categories'.'license'.'abstract'.'versions'.'update_date'.'authors_parsed'], count=None) :
    Define the function that reads the file path: file path columns: columns to select count: number of rows to read
    
    data  = []
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count:
                break
                
            d = json.loads(line)
            d = {col : d[col] for col in columns}
            data.append(d)

    data = pd.DataFrame(data)
    return data

data = readArxivFile('arxiv-metadata-oai-2020.json'['id'.'title'.'categories'.'abstract'].200000)
Copy the code

In order to facilitate data processing, we can splice the title and abstract together to complete classification.

data['text'] = data['title'] + data['abstract']

data['text'] = data['text'].apply(lambda x: x.replace('\n'.' '))
data['text'] = data['text'].apply(lambda x: x.lower())
data = data.drop(['abstract'.'title'], axis=1)
Copy the code

Since the original paper may have multiple categories, it also needs to be handled:

# Multiple categories, including subcategories
data['categories'] = data['categories'].apply(lambda x : x.split(' '))

# Single category, no subcategories
data['categories_big'] = data['categories'].apply(lambda x : [xx.split('. ') [0] for xx in x])
Copy the code

Then encode the categories, where there are multiple categories, so multiple encoding is required:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'].iloc[:])
Copy the code

Thinking a

Idea 1: Use TFIDF to extract features and limit up to 4000 words:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])
Copy the code

Since this is a multi-label classification, you can use Sklearn’s multi-label classification to encapsulate:

# Divide training set and verification set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
                                                 test_size = 0.2,random_state = 1)

# Build a multi-label classification model
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)
Copy the code
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test)))
Copy the code

Idea 2

Idea 2: Using the deep learning model, words are embedded and then trained. Encode the data set processing and truncate it:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data['text'].iloc[:100000], 
                                                    data_label[:100000],
                                                 test_size = 0.95,random_state = 1)
Copy the code

# parameter
max_features= 500
max_len= 150
embed_size=100
batch_size = 128
epochs = 5

from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence

tokens = Tokenizer(num_words = max_features)
tokens.fit_on_texts(list(data['text'].iloc[:100000]))

y_train = data_label[:100000]
x_sub_train = tokens.texts_to_sequences(data['text'].iloc[:100000])
x_sub_train = sequence.pad_sequences(x_sub_train, maxlen=max_len)
Copy the code

Define the model and complete the training:


# LSTM model
# Keras Layers:
from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping,ModelCheckpoint
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras.models import Model
from keras.optimizers import Adam

sequence_input = Input(shape=(max_len, ))
x = Embedding(max_features, embed_size, trainable=True)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool]) 
preds = Dense(19, activation="sigmoid")(x)

model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])
model.fit(x_sub_train, y_train, 
          batch_size=batch_size, 
          validation_split=0.2,
          epochs=epochs)
Copy the code