In the previous article, clustering algorithm was used to judge the author of a dream of Red Mansions without assumptions. In this article, we plan to use bayesian classification of supervised learning method, assuming that the last forty times are not written by Cao Xueqin, to do a verification.
I have read A good Bayes explanation: A simple explanation of Naive Bayes Classification
The basic process
Let’s put in the assumption that the first 80 and the last 40 are not written by the same person. Using the word vector constructed in the previous paper, stratified random sampling is carried out on the data set, for example, 60 out of the first 80 and 30 out of the last 40 are randomly selected for training, and then prediction is made. If the predictions are good, then our assumptions are correct. As a follow-up test, just use the first 80, and label the first 40 with zeros and ones. Do the same exercise. Since the same person wrote it, the prediction becomes more difficult, and if the accuracy is not as good as the previous step, the hypothesis is proved correct laterally.
The preparatory work
import os
import numpy as np
import pandas as pd
import re
import sys
import matplotlib.pyplot as plt
text = pd.read_csv("./datasets/hongloumeng.csv")
import jieba
import jieba.analyse
vorc = [jieba.analyse.extract_tags(i, topK=1000) for i in text["text"]]
vorc = ["".join(i) for i in vorc]
from sklearn.feature_extraction.text import CountVectorizer
vertorizer = CountVectorizer(max_features=5000)
train_data_features = vertorizer.fit_transform(vorc)
train_data_features = train_data_features.toarray()
train_data_features.shape
Copy the code
This is the preparatory work, generating word vectors.
Tags generated
labels = np.array([[0] * 80 + [1] * 40]).reshape(-1 ,1) # the target
labels.shape
Copy the code
This step generates the target labels.
Stratified random sampling
# Stratified sampling
from sklearn.model_selection import train_test_split
# train_data_features = train_data_features[0:80]
X_train, X_test, Y_train, Y_test = train_test_split(train_data_features, labels,
test_size = 0.2, stratify=labels)
Copy the code
To explain too much here, stratify=labels means random sampling by target category.
Model training and prediction
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
gnb = GaussianNB()
gnb.fit(X_train, Y_train)
y_pred = gnb.predict(X_test)
Copy the code
To calculate the accuracy of the predicted value:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, y_pred)
# 0.875
Copy the code
Layered cross validation
From Sklearn. Model_selection import StratifiedShuffleSplit SSS = StratifiedShuffleSplit(N_split =10,test_size=0.2)for train_index, test_index in sss.split(train_data_features, labels):
X_train, X_test = train_data_features[train_index], train_data_features[test_index]
Y_train, Y_test = labels[train_index], labels[test_index]
gnb = GaussianNB()
gnb.fit(X_train, Y_train)
Y_pred = gnb.predict(X_test)
scores.append(accuracy_score(Y_test, Y_pred))
print(scores)
print(np.array(scores).mean())
Copy the code
The following output is displayed:
[0.91666666666666, 0.8333333333333334, 1.0, 0.875, 0.75, 0.8333333333333334, 0.8333333333333334, 0.9583333333333334, 0.8708333333333333, 0.875, 0.8333333333333334]Copy the code
As mentioned above, we can make an explanation for the hypothesis at the beginning of the article and preliminarily judge that our hypothesis is correct.
Cross validation
labels_val = np.array([[0] * 40 + [1] * 40]).reshape(-1 ,1) # the targetSss_val = StratifiedShuffleSplit (n_splits = 5, test_size = 0.2)Divided into 5 groups, test ratio is 0.25, training ratio is 0.75
scores = []
train_data_features_val = train_data_features[0:80]
for train_index, test_index in sss_val.split(train_data_features_val, labels_val):
X_train, X_test = train_data_features_val[train_index], train_data_features_val[test_index]
Y_train, Y_test = labels_val[train_index], labels_val[test_index]
gnb = GaussianNB()
gnb.fit(X_train, Y_train)
Y_pred = gnb.predict(X_test)
scores.append(accuracy_score(Y_test, Y_pred))
print(scores)
print(np.array(scores).mean())
Copy the code
The following output is displayed:
[0.8125, 0.875, 0.75, 0.875, 0.75Copy the code
After many calculations and averaging, the average score of the latter is lower than that of the former. The hypothesis at the beginning of the article is correct.
Full code: Github