Use the naive Bayes algorithm under the SKLearn package, which contains three models — Gaussian model, polynomial model and Bernoulli model. Please refer to naive Bayes — Scikit-learn 0.18.1 documentation for details. This paper will use bayesian polynomial model class to solve the problem of English mail classification.

Importing various packages

import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm import tqdm_notebook
from wordcloud import WordCloud
from sklearn.metrics import roc_curve, auc
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer

%matplotlib inline
Copy the code

The data set

Data is from Spam Dataset Kaggle, where normal mail is marked as HAM /0 and Spam is marked as Spam /1

data = pd.read_csv('spam_ham_dataset.csv')
data = data.iloc[:, 1:]
data.head()
Copy the code
label text label_num
0 ham Subject: enron methanol ; meter # : 988291\r\n… 0
1 ham Subject: hpl nom for january 9 , 2001\r\n( see… 0
2 ham Subject: neon retreat\r\nho ho ho , we ‘ re ar… 0
3 spam Subject: photoshop , windows , office . cheap … 1
4 ham Subject: re : indian springs\r\nthis deal is t… 0
data.info()
Copy the code
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5171 entries, 0 to 5170 Data columns (total 3 columns): label 5171 non-null object text 5171 non-null object label_num 5171 non-null int64 dtypes: Int64 (1), Object (2) Memory Usage: 121.3+ KBCopy the code
print('This data contains {} messages'.format(data.shape[0]))
Copy the code
The data contained 5,171 emailsCopy the code
print('Normal mail has {} items'.format(data['label_num'].value_counts()[0]))
print('There are {} spam messages'.format(data['label_num'].value_counts()[1]))

plt.style.use('seaborn')
plt.figure(figsize=(6, 4), dpi=100)
data['label'].value_counts().plot(kind='bar')
Copy the code
There were 3,672 pieces of regular mail and 1,499 pieces of junk mailCopy the code

New DataFrame

Create a new DataFrame where all the processing takes place

# only text and label_num are required
new_data = data.iloc[:, 1:]
length = len(new_data)
print('Mail quantity length =', length)
new_data.head()
Copy the code
Mail quantity length = 5171Copy the code
text label_num
0 Subject: enron methanol ; meter # : 988291\r\n… 0
1 Subject: hpl nom for january 9 , 2001\r\n( see… 0
2 Subject: neon retreat\r\nho ho ho , we ‘ re ar… 0
3 Subject: photoshop , windows , office . cheap … 1
4 Subject: re : indian springs\r\nthis deal is t… 0

View some details

for i in range(3):
    print(i, '\n', data['text'][i])
Copy the code
0 
 Subject: enron methanol ; meter # : 988291
this is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary
flow data provided by daren } .
please override pop ' s daily volume { presently zero } to reflect daily activity you can obtain from gas control . this change is needed asap for economics purposes . 1 Subject: hpl nom for january 9 , 2001 ( see attached file : hplnol 09 . xls ) - hplnol 09 . xls 2 Subject: neon retreat ho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !
i know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute .
on the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .
i think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up with a potential alternative for how we can get together on that weekend , and then you can let me know which you prefer .
the first option would be to have a retreat similar to what we ' ve done the past several years . this year we could go to the heartland country inn ( www . . com ) outside of brenham . it ' s a nice place , where we ' d have a 13 - bedroom and a 5 - bedroom house side by side . it ' s in the country , real relaxing , but also close to brenham and only about one hour and 15 minutes from here . we can golf , shop in the antique and craft stores in brenham , eat dinner together at the ranch , and spend time with each other . we ' d meet on saturday , and then return on sunday morning , just like what we ' ve done in the past .
the second option would be to stay here in houston , have dinner together at a nice restaurant , and then have dessert and a time for visiting and recharging at one of our homes on that saturday evening . this might be easier , but the trade off would be that we wouldn ' t have as much time together . i ' ll let you decide .
email me back with what would be your preference , and of course if you ' re available on that weekend . the democratic process will prevail - - majority vote will rule ! let me hear from you as soon as possible , preferably by the end of the weekend . and if the vote doesn ' t go your way , no complaining allowed ( like i tend to do ! )
have a great weekend , great golf , great fishing , great shopping , or whatever makes you happy !
bobby
Copy the code

pretreatment

case

The message contains case, so replace the first word with lower case

new_data['text'] = new_data['text'].str.lower()
new_data.head()
Copy the code
text label_num
0 subject: enron methanol ; meter # : 988291\r\n… 0
1 subject: hpl nom for january 9 , 2001\r\n( see… 0
2 subject: neon retreat\r\nho ho ho , we ‘ re ar… 0
3 subject: photoshop , windows , office . cheap … 1
4 subject: re : indian springs\r\nthis deal is t… 0

Stop words

Use stop words. You, me, be and other words in emails have no effect on classification, so you can disable them. Also note that all messages have the word subject at the beginning, which we also set to stop. Stopwords under the natural language processing toolkit NLTK is used here

stop_words = set(stopwords.words('english'))
stop_words.add('subject')
Copy the code

participles

Extract every word in a long sentence and filter out symbols, so use the RegexpTokenizer() function under NLTK, which takes a regular expression, for example:

string = 'I have a pen,I have an apple. (Uhh~)Apple-pen! ' # lyrics from PPAP
RegexpTokenizer('[a-zA-Z]+').tokenize(string) # Filter all symbols, return a list
Copy the code
['I'.'have'.'a'.'pen'.'I'.'have'.'an'.'apple'.'Uhh'.'Apple'.'pen']
Copy the code

Morphological reduction

In English, when a word has different tenses, such as love and loves, it means the same thing but in different tenses, so we have restoration and stem extraction. In this paper, the word form reduction method is used. See ZMonster’s Blog for more details: Comparison of word restoration tools

Use the WordNetLemmatizer() function from the NLTK package, for example:

word = 'loves'
print('{} is derived from {}'.format(word, WordNetLemmatizer().lemmatize(word)))
Copy the code
I love you nowCopy the code

To implement all of the above operations together, use apply for pandas

def text_process(text):
    tokenizer = RegexpTokenizer('[a-z]+') # matches only words. Since it is already all lowercase, you can write only [a-z]+
    lemmatizer = WordNetLemmatizer()
    token = tokenizer.tokenize(text) # participle
    token = [lemmatizer.lemmatize(w) for w in token if lemmatizer.lemmatize(w) not in stop_words] # stop words + word form reduction
    return token
Copy the code
new_data['text'] = new_data['text'].apply(text_process)
Copy the code

So now we have a cleaner data set

new_data.head()
Copy the code
text label_num
0 [enron, methanol, meter, follow, note, gave, m… 0
1 [hpl, nom, january, see, attached, file, hplno… 0
2 [neon, retreat, ho, ho, ho, around, wonderful,… 0
3 [photoshop, window, office, cheap, main, trend… 1
4 [indian, spring, deal, book, teco, pvr, revenu… 0

Training set and test set

The processed data set was divided into training set and test set with a ratio of 3:1

seed = 20190524 # Make the experiment repeatable
X = new_data['text']
y = new_data['label_num']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed) # 75% for training set and 25% for test set
Copy the code
train = pd.concat([X_train, y_train], axis=1) # training set
test = pd.concat([X_test, y_test], axis=1) # test set

train.reset_index(drop=True, inplace=True) Reset the subscript
test.reset_index(drop=True, inplace=True) # same as above
Copy the code
print('Training set contains {} messages, test set contains {} messages'.format(train.shape[0], test.shape[0]))
Copy the code
The training set contained 3878 messages and the test set contained 1293 messagesCopy the code

The number of spam and normal messages in the training set

print(train['label_num'].value_counts())
plt.figure(figsize=(6, 4), dpi=100)
train['label_num'].value_counts().plot(kind='bar')
Copy the code
0    2769
1    1109
Name: label_num, dtype: int64
Copy the code

The number of spam and normal messages in the test set

print(test['label_num'].value_counts())
plt.figure(figsize=(6, 4), dpi=100)
test['label_num'].value_counts().plot(kind='bar')
Copy the code
0    903
1    390
Name: label_num, dtype: int64
Copy the code

Characteristics of the engineering

If all the words are counted, there are still many words in the word list, which makes our model run slowly. Therefore, the words in 10 normal emails and 10 spam emails are randomly selected as the word list

ham_train = train[train['label_num'] = = 0]# Normal email
spam_train = train[train['label_num'] = = 1)# Spam

ham_train_part = ham_train['text'].sample(10, random_state=seed) # 10 normal emails randomly selected
spam_train_part = spam_train['text'].sample(10, random_state=seed) # 10 randomly selected spam messages

part_words = [] # part of the word
for text in pd.concat([ham_train_part, spam_train_part]):
    part_words += text
Copy the code
part_words_set = set(part_words)
print('There are {} words in the word list'.format(len(part_words_set)))
Copy the code
There are 1528 words in the word listCopy the code

This greatly reduces the vocabulary

CountVectorizer

Next we count the number of occurrences of each word, using sklearn’s CountVectorizer() function, as in:

words = ['This is the first sentence'.'And this is the second sentence']
cv = CountVectorizer() The # lowercase=True argument turns the letter to lowercase, but the data is already lowercase
count = cv.fit_transform(words)
print('cv.vocabulary_:\n', cv.vocabulary_) # return a dictionary
print('cv.get_feature_names:\n', cv.get_feature_names()) # return a list
print('count.toarray:\n', count.toarray()) # return sequence
Copy the code
cv.vocabulary_:
 {'this': 6, 'is': 2.'the': 5, 'first': 1, 'sentence': 4.'and': 0.'second': 3}
cv.get_feature_names:
 ['and'.'first'.'is'.'second'.'sentence'.'the'.'this']
count.toarray:
 [[0 1 1 0 1 1 1]
 [1 0 1 1 1 1 1]]
Copy the code

[x] [x] [x] [x] [x] [x] [x] [x] [x] [x

TfidfTransformer

The tF-IDF is then calculated, which reflects the importance of the word in the text. Use TfidfTransformer() under the Sklearn package, as in:

tfidf = TfidfTransformer()
tfidf_matrix = tfidf.fit_transform(count)
print('idf:\n', tfidf.idf_) # check idf
print('tfidf:\n', tfidf_matrix.toarray()) # see the tf - idf
Copy the code
Idf: [1.40546511 1.40546511 1. 1.40546511 1. [[0.0.57496187 0.4090901 0.0.4090901 0.4090901 0.4090901] [0.49844628 0.35464863 0.49844628 0.35464863 0.35464863 0.35464863]]Copy the code

[0 1 1 01 1 1] becomes [0.0.57496187 0.4090901 0.0.4090901 0.4090901 0.4090901 0.4090901 0.4090901]

Add a new column

Now let’s begin our calculations, but before we do, we’ll put the words into sentences, the format CountVectorizer knows

# Put the normal and spam words into sentences separated by Spaces. In CountVectorizer(), words are separated by Spaces
train_part_texts = [' '.join(text) for text in np.concatenate((spam_train_part.values, ham_train_part.values))]
Arrange all the words into sentences
train_all_texts = [' '.join(text) for text in train['text']]
# Test set all the words into sentences
test_all_texts = [' '.join(text) for text in test['text']]
Copy the code
cv = CountVectorizer()
part_fit = cv.fit(train_part_texts) # Use part of the sentence as a reference
train_all_count = cv.transform(train_all_texts) # Count the number of words for all messages in the training set
test_all_count = cv.transform(test_all_texts) # count the number of words for all messages in the test set
tfidf = TfidfTransformer()
train_tfidf_matrix = tfidf.fit_transform(train_all_count)
test_tfidf_matrix = tfidf.fit_transform(test_all_count)
Copy the code
print('Training set', train_tfidf_matrix.shape)
print('Test set', test_tfidf_matrix.shape)
Copy the code
Training Set (3878, 1513) Test Set (1293, 1513)Copy the code

Build a model

mnb = MultinomialNB()
mnb.fit(train_tfidf_matrix, y_train)
Copy the code
MultinomialNB (alpha = 1.0, class_prior = None, fit_prior = True)Copy the code

The accuracy of the model in the test set

mnb.score(test_tfidf_matrix, y_test)
Copy the code
0.9265274555297757
Copy the code
y_pred = mnb.predict_proba(test_tfidf_matrix)
fpr, tpr, thresholds = roc_curve(y_test, y_pred[:, 1])
auc = auc(fpr, tpr)
Copy the code
# roc curve
plt.figure(figsize=(6, 4), dpi=100)
plt.plot(fpr, tpr)
plt.title('roc = {:.4f}'.format(auc))
plt.xlabel('fpr')
plt.ylabel('tpr')
Copy the code
The Text (0, 0.5,'tpr')
Copy the code

Ipynb File to github

The resources

  1. Naive Bayes – scikit-learn 0.18.1 documentation
  2. Word restoration tool comparison ZMonster’s Blog
  3. Countvectorizer sklearn example – A Data Analyst
  4. Sklearn — Naive Bayes text classification
  5. Sklearn implements spam classification of Chinese data sets