In the previous article, clustering algorithm was used to judge the author of a dream of Red Mansions without assumptions. In this article, we plan to use bayesian classification of supervised learning method, assuming that the last forty times are not written by Cao Xueqin, to do a verification.

I have read A good Bayes explanation: A simple explanation of Naive Bayes Classification

The basic process

Let’s put in the assumption that the first 80 and the last 40 are not written by the same person. Using the word vector constructed in the previous paper, stratified random sampling is carried out on the data set, for example, 60 out of the first 80 and 30 out of the last 40 are randomly selected for training, and then prediction is made. If the predictions are good, then our assumptions are correct. As a follow-up test, just use the first 80, and label the first 40 with zeros and ones. Do the same exercise. Since the same person wrote it, the prediction becomes more difficult, and if the accuracy is not as good as the previous step, the hypothesis is proved correct laterally.

The preparatory work

import os
import numpy as np
import pandas as pd
import re
import sys  
import matplotlib.pyplot as plt

text = pd.read_csv("./datasets/hongloumeng.csv")
import jieba
import jieba.analyse

vorc = [jieba.analyse.extract_tags(i, topK=1000) for i in text["text"]]
vorc = ["".join(i) for i in vorc]

from  sklearn.feature_extraction.text import CountVectorizer
vertorizer = CountVectorizer(max_features=5000)
train_data_features = vertorizer.fit_transform(vorc)

train_data_features = train_data_features.toarray()

train_data_features.shape
Copy the code

This is the preparatory work, generating word vectors.

Tags generated

labels = np.array([[0] * 80 + [1] * 40]).reshape(-1 ,1) # the target
labels.shape
Copy the code

This step generates the target labels.

Stratified random sampling

# Stratified sampling
from sklearn.model_selection import train_test_split
# train_data_features = train_data_features[0:80]
X_train, X_test, Y_train, Y_test = train_test_split(train_data_features, labels, 
                                                    test_size = 0.2, stratify=labels)
Copy the code

To explain too much here, stratify=labels means random sampling by target category.

Model training and prediction

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

gnb = GaussianNB()
gnb.fit(X_train, Y_train)

y_pred = gnb.predict(X_test)
Copy the code

To calculate the accuracy of the predicted value:

from sklearn.metrics import accuracy_score
accuracy_score(Y_test, y_pred)
# 0.875
Copy the code

Layered cross validation

From Sklearn. Model_selection import StratifiedShuffleSplit SSS = StratifiedShuffleSplit(N_split =10,test_size=0.2)for train_index, test_index in sss.split(train_data_features, labels):
    X_train, X_test = train_data_features[train_index], train_data_features[test_index]
    Y_train, Y_test = labels[train_index], labels[test_index]
    gnb = GaussianNB()
    gnb.fit(X_train, Y_train)
    Y_pred = gnb.predict(X_test)
    scores.append(accuracy_score(Y_test, Y_pred))
print(scores)
print(np.array(scores).mean())
Copy the code

The following output is displayed:

[0.91666666666666, 0.8333333333333334, 1.0, 0.875, 0.75, 0.8333333333333334, 0.8333333333333334, 0.9583333333333334, 0.8708333333333333, 0.875, 0.8333333333333334]Copy the code

As mentioned above, we can make an explanation for the hypothesis at the beginning of the article and preliminarily judge that our hypothesis is correct.

Cross validation

labels_val = np.array([[0] * 40 + [1] * 40]).reshape(-1 ,1) # the targetSss_val = StratifiedShuffleSplit (n_splits = 5, test_size = 0.2)Divided into 5 groups, test ratio is 0.25, training ratio is 0.75
scores = []
train_data_features_val = train_data_features[0:80]
for train_index, test_index in sss_val.split(train_data_features_val, labels_val):
    X_train, X_test = train_data_features_val[train_index], train_data_features_val[test_index]
    Y_train, Y_test = labels_val[train_index], labels_val[test_index]
    gnb = GaussianNB()
    gnb.fit(X_train, Y_train)
    Y_pred = gnb.predict(X_test)
    scores.append(accuracy_score(Y_test, Y_pred))
print(scores)
print(np.array(scores).mean())
Copy the code

The following output is displayed:

[0.8125, 0.875, 0.75, 0.875, 0.75Copy the code

After many calculations and averaging, the average score of the latter is lower than that of the former. The hypothesis at the beginning of the article is correct.

Full code: Github