This post teaches you how to implement your own spam filter in under 100 lines of Python code.
While doing this hands-on exercise, you’ll work with natural language data, learn how to detect the words spammers use automatically, and learn how to use a Naive Bayes classifier for binary classification.
Classification
Classification is ubiquitous – many things around us can be divided into two or more classes based on their characteristics. In keeping with the theme of bikes from the previous post on regressionWe call a vehicle with one wheel a “unicycle”, a vehicle with two wheels and no motor a “bicycle”, And a vehicle with two wheels and a motor — a “motorcycle”number of wheels and the presence of a motor to classify the vehicle. Machines can do the same: being exposed to a sufficient amount of data describing the instances of different classes, they can learn about their characteristic properties and detect the classes of the new instances. Binary classification assumes that the choice is between just two classes.
Spam filtering
Spam filtering is a classic example of a binary classification task familiar to anybody who has ever used email Now you’ll learn how to implement your own one.
The task is to distinguish between two types of emails, “Spam” and “non-spam” often called “ham” The machine learning classifier will detect that an email is spam if it is characterised by certain features. The textual content of The email — Words like “Viagra” or “lottery” or phrases like “You ‘ve won a $100000000! Click here!” , the “Join now! — Is crucial in spam detection and offers some of the strongest cues:
Since you’ll be working with text data, you’ll be using the Python-based library Natural Language Toolkit (NLTK), which has rich functionality in natural language processing tasks. It is well supported and there is a helpful online book.
Start by importing the toolkit:
import nltk
To train the classifier, we need a representative dataset with both spam and ham emails. We’ll be using the Enron email dataset which contains emails of both types stored in plain text format.
You can download and unarchive any of the folders in the directory with the code — for example Download and unarchive the “Enron1 /” Folder, which contains 3672 legitimate (ham) emails and 1500 spam.
The spam detection algorithm will involve 5 steps:
Let’s get started!
Step 1: Loading the data
You need to read in the files from the spam and ham subfolders and keep them in two separate lists. To be able to iteratively read the files in a folder, add the following import statement:
import os
Then define the function initialise_lists as follows:
def init_lists(folder): a_list = [] file_list = os.listdir(folder) for a_file in file_list: f = open(folder + a_file, 'r') a_list.append(f.read()) f.close() return a_list
Now you can use this function to create spam and ham lists. For that, add the following two lines of code to the main part of your program:
spam = init_lists('enron1/spam/') ham = init_lists('enron1/ham/')
Next, Let’s combine the two lists keeping the labels. You can do this by creating a list of two-place tuples — Python objects, Where the first member of the pair stores the text of the email and the second one — its label. To make your code compact use list comprehensions:
all_emails = [(email, 'spam') for email in spam] all_emails += [(email, 'ham') for email in ham]
Now, the first tuple in all_emails list contains the first spam email read in from the spam subfolder, the following 1499 ones are spam emails as well, while the 1501th tuple contains the first ham email read in from the ham subfolder. You can always check if your data is loaded correctly by checking the size of the data structure:
print (len(all_emails))
This should print out 5172. Before starting to build the classifier, Let’s get rid of the spam and ham examples. This way it will be easier to organise the training data because any portion of all_emails will contain examples of both categories. Add the following import statement to the import statements block:
import random
You can then shuffle the examples using the following code in the main part of the program:
random.shuffle(all_emails)
Step 2: Preprocessing the data
Currently, the data is stored as lines of text, for example:
The purpose of the email is to recap the kickoff meeting held on yesterday. (ham email #013 in enron1/)
People are getting rich using this system! Now it’s your turn! We’ve cracked the code and will show you… (spam email #046 in enron1/)
Subject: popular meds at lowest prices (spam email #838 in enron1/)
To be able to use the words in these texts as features for your classifier, you need to preprocess the data and normalise it (so that different forms of the same word are treated as the same word) by:
- Splitting the text by white Spaces and punctuation marks — the tools that are used for this purpose are calledtokenizers.and you can use a tokenizer provided with the NLTK. If you run an NLTK tokenizer on the ham sentence from the example Above, you ‘ll get:
- linking the different forms of the same word (for example, priceand prices, is and areTo each other — the tools that can do that are calledlemmatizers.and you can again use one of those that come with the NLTK. For example, if you run a lemmatizer on the spam sentence from above, you’ll get:
- converting all words to lowercase so that the classifier does not treat People, people and PEOPLE as three separate features.
Add the following import statement to the import statements block to include the tokenizer and lemmatizer:
from nltk import word_tokenize, WordNetLemmatizer
Then define the preprocess function that will take a sentence as an input and will return the result of all the preprocessing operations:
def preprocess(sentence): return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence)]
Step 3: Extracting the features
Once text is pre-processed, you can extract the features characterising spam and ham emails. The first thing to notice is that some words like “The”, “is” or “of” appear in all emails, Don’t have much content to them and are therefore not going to help you distinguish spam from ham. Such words are called stopwords and they can be disregarded during classification. NLTK has a corpus of stopwords for several languages including English, which you can import as:
from nltk.corpus import stopwords
Now you can access the stopword list for English as follows:
The stoplist = stopwords. Words (" English ")
For example, if you remove the stopwords from the ham sentence above, you’ll get:
This set of words summarises the content of the original sentence perfectly.
To extract the features — words that can tell the program whether the email is spam or ham — you’ll need To do the following:
- read in the text of the email;
- preprocess it using the function preprocess(sentence) defined above;
- for each word that is not in the stopword list, either calculate how frequently it occurs in the text, or simply register the fact that the word occurs in the email. The former approach is called the bag-of-words (bow, for short). and it allows the classifier to notice that certain keywords may occur in both types of emails but with different Frequencies, for example the word “Viagra” is much more frequent in spam than ham emails. PythonCounter subclass allows to apply the bow model.
To use the Counter subclass, add the import statement:
from collections import Counter
You can control which model you want to use with the parameter setting in the following code (the simple word occurrence model is the default):
def get_features(text, setting): if setting=='bow': return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist} else: return {word: True for word in preprocess(text) if not word in stoplist}
The code above uses dictionary comprehensions.
Now you can extract the features from the emails and pair them with the email class label (” spam “or” ham “). Add the following line of code to the main part of the program if you want to use the bow model:
all_features = [(get_features(email, 'bow'), label) for (email, label) in all_emails]
and the following one for the default model:
all_features = [(get_features(email, ''), label) for (email, label) in all_emails]
Step 4: Training a classifier
Now that the data is in the correct format, you can split it into a training set that will be used to train the classifier, and a test set that will be used to evaluate it. Typically, the data is split using 80% for training and the other 20% for testing.
Define a function train that will take the set of features and the proportion of the examples assigned to the training set as arguments:
def train(features, samples_proportion): train_size = int(len(features) * samples_proportion) train_set, test_set = features[:train_size], features[train_size:] print ('Training set size = ' + str(len(train_set)) + ' emails') print ('Test set size = ' + str(len(test_set)) + ' emails')
The print statements will help you make sure that the data is split correctly: If you use 80% of the data in “enron1/” to train the classifier, the training set should contain 4137 emails, And the test set — 1035.
You can apply any classifier of your choice — a number of them come with NLTK. This post will show You how to use Naive Bayes classifier, which is a simple yet powerful classification algorithm that has been widely applied to spam filtering before.
The classifier tries to choose the most probable class, or label, among the two classes, spam and ham, i.e. based on what it has learned about the features (presence or frequency of words in the emails of each type). More precisely, it’s trying to choose the most probable class given the words in the e-mail:
Don’t worry if the formula looks a bit complicated at first. It simply says that the classifier will assign the class (denoted as cWith a hat) it will choose among the two classes by looking which of the two probabilities — “spam” given the words in the emailP(spam | words), or “ham” given the words in the email P(ham | words)— Is higher (thusargmax). These probabilities cannot be directly estimated, but Bayes rule allows you to swap the conditions and get:
Now the classifier will need to compare these fractions for the two classes. When comparing two fractions, you can disregard the denominator because it stays the same for both classes, and directly compare the product P(words|spam)P(spam) with P(words|ham)P(ham). Probabilities P(spam) and P(ham) are called the prior probabilities, And they show the distribution of “spam” and “ham” classes in the training set. The probabilities P (words | spam) and P(words|ham) are calledconditional probabilities Of having a particular set of features if the email is “spam” or if it is “ham”. Naive Bayes classifier assumes that each feature (word) occurs in a text independently of all other words, so we can multiply the conditional probabilities for each of the words directly. In short, the algorithm will say that an email isspam if:
and ham otherwise.
The probabilities are calculated on the training data: For example, P(spam) in “enron1” dataset equals 0.29 (since there are 1500 spam out of the 5172), But you don’t have to estimate the probabilities yourself — the NLTK implementation of the algorithm will do it for you.
The probabilities are calculated on the training data: For example, P(spam) in “enron1” dataset equals 0.29 (since there are 1500 spam out of the 5172), But you don’t have to estimate the probabilities yourself — the NLTK implementation of the algorithm will do it for you.
To use this classifier, make sure that it has been imported:
from nltk import NaiveBayesClassifier, classify
Then add the following code to the train function to train a model based on the training dataset:
classifier = NaiveBayesClassifier.train(train_set) return train_set, test_set, classifier
Now you have a working spam filter, which you can use by calling to the train function from the main part of the program using the appropriate data split, for example:
Train_set, test_set, classifier = train(all_features, 0.8)
Now you have train and test set and also the classifier ready to detect your spam emails.
Step 5: Evaluating your classifier performance
So far so good! But how do we know if your classifier is doing a good job at detecting spam messages?
One way to check this is to evaluate its performance on the test dataset – the part of the data you’ve set apart (20% if you’ve been following the settings from above). Let’s add a new function called evaluate to your program. It will take the train_set, test_set and classifier as arguments:
def evaluate(train_set, test_set, classifier): print ('Accuracy on the training set = ' + str(classify.accuracy(classifier, train_set))) print ('Accuracy of the test set = ' + str(classify.accuracy(classifier, test_set)))
Accuracy on the training set will tell you how good the classifier is at learning the relevant information about the features and detecting the classes on the very same data it has been trained on. The accuracy on the test set shows how well it generalises this knowledge when applying it to the emails it hasn’t seen before.
To see how well the classifier performs on both sets, call this function from the main part of the program:
evaluate(train_set, test_set, classifier)
Remember, that you have shuffled the data early on to avoid any bias, So every time you run your program you’ll be getting slightly different results — for example, the accuracy on the training set might vary between 95% and 98%, and the accuracy on the test set might vary between 92% and 95%.
From a natural language perspective, another thing to check is which words the classifier finds most discriminative when detecting a spam message. You can do this by adding just one line of code to the evaluate function:
classifier.show_most_informative_features(20)
This will show you the top 20 most informative words (you can change this number).
For example:Tells you that in this dataset an email containing the word “prescription” is around 160 times more likely to be spam Than ham, and the word “pain” is around 101 times more likely to be spam than ham. Again, the set of words will vary from one run to another since you shuffle the data before splitting.
A recap of what you’ve learned in this post:
- Classification is involved in many tasks in machine learning
- Spam filtering is a binary classification task where you need to detect whether an email belongs to a “spam” or “ham” class
- Word occurrence and frequency are some of the most informative features for spam detection
- Before you can use words as features you need to preprocess the text data
- NLTK provides wide functionality for natural language processing
- Naive Bayes is a simple yet quite powerful machine learning algorithm that can be used for binary classification.
Want to learn more? Check out our two-day Data Science Bootcamp:
https://cambridgecoding.com/datascience-bootcamp
The full script:
from __future__ import print_function, division import nltk import os import random from collections import Counter from nltk import word_tokenize, WordNetLemmatizer from nltk.corpus import stopwords from nltk import NaiveBayesClassifier, classify stoplist = stopwords.words('english') def init_lists(folder): a_list = [] file_list = os.listdir(folder) for a_file in file_list: f = open(folder + a_file, 'r') a_list.append(f.read()) f.close() return a_list def preprocess(sentence): lemmatizer = WordNetLemmatizer() return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(unicode(sentence, errors='ignore'))] def get_features(text, setting): if setting=='bow': return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist} else: return {word: True for word in preprocess(text) if not word in stoplist} def train(features, samples_proportion): train_size = int(len(features) * samples_proportion) # initialise the training and test sets train_set, test_set = features[:train_size], features[train_size:] print ('Training set size = ' + str(len(train_set)) + ' emails') print ('Test set size = ' + str(len(test_set)) + ' emails') # train the classifier classifier = NaiveBayesClassifier.train(train_set) return train_set, test_set, classifier def evaluate(train_set, test_set, classifier): # check how the classifier performs on the training and test sets print ('Accuracy on the training set = ' + str(classify.accuracy(classifier, train_set))) print ('Accuracy of the test set = ' + str(classify.accuracy(classifier, test_set))) # check which words are most informative for the classifier classifier.show_most_informative_features(20) if __name__ == " __main__" : # initialise the data spam = init_lists('enron1/spam/') ham = init_lists('enron1/ham/') all_emails = [(email, 'spam') for email in spam] all_emails += [(email, 'ham') for email in ham] random.shuffle(all_emails) print ('Corpus size = ' + str(len(all_emails)) + ' emails') # extract the features all_features = [(get_features(email, ''), label) for (email, label) in all_emails] print ('Collected ' + str(len(all_features)) + ' feature sets') # train the classifier train_set, # evaluate evaluate(test_set, classifier) # evaluate(test_set, classifier)
ABOUT THE AUTHOR
EKATERINA KOCHMAR
Ekaterina is a research associate at the Computer Laboratory of the University of Cambridge. Her research focuses on Natural Language Processing (NLP) applications and Machine Learning methods. She holds a PhD in Natural Language and Information Processing and an MPhil in Advanced Computer Science from the University of Cambridge.