Original link:tecdat.cn/?p=4027
\
Business background
The application of E-mail has become very extensive, which has brought great convenience to people’s life. However, as a by-product of its development — spam, it has brought a lot of trouble to the majority of users, network administrators and ISP(Internet service providers). The problem of spam is becoming more and more serious and has been widely concerned by researchers. Spam usually refers to electronic mail that is forced into users’ mailboxes without their permission. For the spam using mass sending technology, we must use some technical means to carry out anti-spam work. At present, anti-spam technology mainly includes spam filtering technology, security management of mail server and improvement of simple mail communication protocol (SMTP).
WEKA text segmentation preprocessing
Firstly, two kinds of mail documents in the training set folder are analyzed, and the characteristics of the two kinds of files can be automatically analyzed from different angles, and the algorithm can be written to build the classification model.
First set up the working directory, and read the classified text file
You can see the frequency histogram of spam and non-spam
Then word frequency matrix file is obtained by word segmentation of the original corpus
The classification histogram of each word frequency is obtained
After the word frequency matrix is obtained, the classifier is modeled
2. Analyze the attributes in corpus and find out the attributes contributing to classification (that is, those words only appear in positive, those words only appear in negative, and those words appear in both categories)
3. Find the classification rules that distinguish positive from negative (i.e., which words together result in positive and which words together result in negative)
It can be seen from the result that cell efficiengcy however breast rates and cell efficiengcy have great influence on the final classification result, for example, “however” is generally a negative word.
WEKA text word segmentation results comparison
The accuracy and confusion matrix of each classifier are obtained below:
NaiveBayes |
---|
Logistic |
J48 |
RandomForest |
SVM |
OneR |
conclusion
Spam filtering based on discriminant method has attracted little attention in modern research. The results clearly show that the classification method based on random forest and SVM model can effectively improve the accuracy and accuracy of spam filtering compared with the traditional method.