Python develops spam detection applications

The Email Spam Detector in Python

George Pipis

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: JohnieXu

Proofreader: luochen1992, zenblo

Spam vs. Valid Mail (Ham)

The most common application of a spam detection model is to create a predictive text model. The original data set comes from this — Spam, which contains subsequent header rows and has two columns, the first column text for the message content and the second column target Spam or ham for Spam and non-spam, respectively.

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target'] = ='spam'.1.0)
spam_data.head(10)
Copy the code

Output result:

Split the data into training sets and test sets

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                 spam_data['target'], 
                 random_state=0)
Copy the code

Build TF-IDF on n-gram

Use the TfidfVectorizer in the Sklearn library to convert and train the data X_train, ignoring data that occurs less than 5 times in the data dictionary, and letting N-grams value from 1 to 3 (single word, double tuple, and triple).

vect = TfidfVectorizer(min_df=5, ngram_range=(1.3)).fit(X_train) X_train_vectorized = vect.transform(X_train)
Copy the code

Add special characters

In addition to the basic characters, you need to add characters other than letters, digits, and underscores such as numbers, dollar signs, and lengths. Let’s write a function to do this:

def add_feature(X, feature_to_add) :
    """ Returns sparse feature matrix with added feature. feature_to_add can also be a list of features. """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

# Training data
add_length=X_train.str.len()
add_digits=X_train.str.count(r'\d')
add_dollars=X_train.str.count(r'\$')
add_characters=X_train.str.count(r'\W')

X_train_transformed = add_feature(X_train_vectorized , [add_length, add_digits,  add_dollars, add_characters])

# Test data
add_length_t=X_test.str.len()
add_digits_t=X_test.str.count(r'\d')
add_dollars_t=X_test.str.count(r'\$')
add_characters_t=X_test.str.count(r'\W')

X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t,  add_dollars_t, add_characters_t])
Copy the code

Train logistic regression models

Next, a logistic regression model will be established and the AUC score of the test set will be counted.

clf = LogisticRegression(C=100, solver='lbfgs', max_iter=1000)

clf.fit(X_train_transformed, y_train)

y_predicted = clf.predict(X_test_transformed)

auc = roc_auc_score(y_test, y_predicted)
auc
Copy the code

Output result:

0.9674528462047772
Copy the code

Get the feature words that have the greatest impact on the results

Below is a list of the top 50 words that have an impact on the predicted results of spam.

feature_names = np.array(vect.get_feature_names() + ['lengthc'.'digit'.'dollars'.'n_char'])
sorted_coef_index = clf.coef_[0].argsort()
smallest = feature_names[sorted_coef_index[:50]]
largest = feature_names[sorted_coef_index[:-51: -1]]
Copy the code

It ranks the top 50 in terms of influence judged to be spam

largest
Copy the code

Output result:

array(['text'.'sale'.'free'.'uk'.'content'.'tones'.'sms'.'reply'.'order'.'won'.'ltd'.'girls'.'ringtone'.'to'.'comes'.'darling'.'this message'.'what you'.'new'.'www'.'co uk'.'std'.'co'.'about the'.'strong'.'txt'.'your'.'user'.'all of'.'choose'.'service'.'wap'.'mobile'.'the new'.'with'.'sexy'.'sunshine'.'xxx'.'this'.'hot'.'freemsg'.'ta'.'waiting for your'.'asap'.'stop'.'ll have'.'hello'.'http'.'vodafone'.'of the'], dtype='<U31')
Copy the code

The influential words judged as normal mail ranked top 50

smallest
Copy the code

Output result:

array(['ì_ wan'.'for 1st'.'park'.'1st'.'ah'.'wan'.'got'.'say'.'tomorrow'.'if'.'my'.'ì_'.'call'.'opinion'.'days'.'gt'.'its'.'lt'.'lovable'.'sorry'.'all'.'when'.'can'.'hope'.'face'.'she'.'pls'.'lt gt'.'hav'.'he'.'smile'.'wife'.'for my'.'trouble'.'me'.'went'.'about me'.'hey'.'30'.'sir'.'lovely'.'small'.'sun'.'silent'.'me if'.'happy'.'only'.'them'.'my dad'.'dad'], dtype='<U31')
Copy the code

conclusion

Here is an example of a practical and reproducible spam detection algorithm, such as a predictive algorithm that is one of the main tasks in the field of natural language processing (NLP). The model we developed above has an AUC score of 0.97, which is pretty good. The model can also be added to test features to more accurately identify frequent spam features, and vice versa.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.

Python develops spam detection applications

Spam vs. Valid Mail (Ham)

Split the data into training sets and test sets

Build TF-IDF on n-gram

Add special characters

Train logistic regression models

Get the feature words that have the greatest impact on the results

conclusion

Related Posts

Horovod (10) — Run on Spark

LaTeX2021 formula preparation, graphic installation, detailed tutorial, one article to read

R communication | improve xaringan slides on b grid