- The Email Spam Detector in Python
- George Pipis
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: JohnieXu
- Proofreader: luochen1992, zenblo
Spam vs. Valid Mail (Ham)
The most common application of a spam detection model is to create a predictive text model. The original data set comes from this — Spam, which contains subsequent header rows and has two columns, the first column text for the message content and the second column target Spam or ham for Spam and non-spam, respectively.
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
spam_data = pd.read_csv('spam.csv')
spam_data['target'] = np.where(spam_data['target'] = ='spam'.1.0)
spam_data.head(10)
Copy the code
Output result:
Split the data into training sets and test sets
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'],
spam_data['target'],
random_state=0)
Copy the code
Build TF-IDF on n-gram
Use the TfidfVectorizer in the Sklearn library to convert and train the data X_train, ignoring data that occurs less than 5 times in the data dictionary, and letting N-grams value from 1 to 3 (single word, double tuple, and triple).
vect = TfidfVectorizer(min_df=5, ngram_range=(1.3)).fit(X_train) X_train_vectorized = vect.transform(X_train)
Copy the code
Add special characters
In addition to the basic characters, you need to add characters other than letters, digits, and underscores such as numbers, dollar signs, and lengths. Let’s write a function to do this:
def add_feature(X, feature_to_add) :
""" Returns sparse feature matrix with added feature. feature_to_add can also be a list of features. """
from scipy.sparse import csr_matrix, hstack
return hstack([X, csr_matrix(feature_to_add).T], 'csr')
# Training data
add_length=X_train.str.len()
add_digits=X_train.str.count(r'\d')
add_dollars=X_train.str.count(r'\$')
add_characters=X_train.str.count(r'\W')
X_train_transformed = add_feature(X_train_vectorized , [add_length, add_digits, add_dollars, add_characters])
# Test data
add_length_t=X_test.str.len()
add_digits_t=X_test.str.count(r'\d')
add_dollars_t=X_test.str.count(r'\$')
add_characters_t=X_test.str.count(r'\W')
X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t, add_dollars_t, add_characters_t])
Copy the code
Train logistic regression models
Next, a logistic regression model will be established and the AUC score of the test set will be counted.
clf = LogisticRegression(C=100, solver='lbfgs', max_iter=1000)
clf.fit(X_train_transformed, y_train)
y_predicted = clf.predict(X_test_transformed)
auc = roc_auc_score(y_test, y_predicted)
auc
Copy the code
Output result:
0.9674528462047772
Copy the code
Get the feature words that have the greatest impact on the results
Below is a list of the top 50 words that have an impact on the predicted results of spam.
feature_names = np.array(vect.get_feature_names() + ['lengthc'.'digit'.'dollars'.'n_char'])
sorted_coef_index = clf.coef_[0].argsort()
smallest = feature_names[sorted_coef_index[:50]]
largest = feature_names[sorted_coef_index[:-51: -1]]
Copy the code
It ranks the top 50 in terms of influence judged to be spam
largest
Copy the code
Output result:
array(['text'.'sale'.'free'.'uk'.'content'.'tones'.'sms'.'reply'.'order'.'won'.'ltd'.'girls'.'ringtone'.'to'.'comes'.'darling'.'this message'.'what you'.'new'.'www'.'co uk'.'std'.'co'.'about the'.'strong'.'txt'.'your'.'user'.'all of'.'choose'.'service'.'wap'.'mobile'.'the new'.'with'.'sexy'.'sunshine'.'xxx'.'this'.'hot'.'freemsg'.'ta'.'waiting for your'.'asap'.'stop'.'ll have'.'hello'.'http'.'vodafone'.'of the'], dtype='<U31')
Copy the code
The influential words judged as normal mail ranked top 50
smallest
Copy the code
Output result:
array(['ì_ wan'.'for 1st'.'park'.'1st'.'ah'.'wan'.'got'.'say'.'tomorrow'.'if'.'my'.'ì_'.'call'.'opinion'.'days'.'gt'.'its'.'lt'.'lovable'.'sorry'.'all'.'when'.'can'.'hope'.'face'.'she'.'pls'.'lt gt'.'hav'.'he'.'smile'.'wife'.'for my'.'trouble'.'me'.'went'.'about me'.'hey'.'30'.'sir'.'lovely'.'small'.'sun'.'silent'.'me if'.'happy'.'only'.'them'.'my dad'.'dad'], dtype='<U31')
Copy the code
conclusion
Here is an example of a practical and reproducible spam detection algorithm, such as a predictive algorithm that is one of the main tasks in the field of natural language processing (NLP). The model we developed above has an AUC score of 0.97, which is pretty good. The model can also be added to test features to more accurately identify frequent spam features, and vice versa.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.