Identification and classification of spam messages

background

Spam messages are sent using a base station or a program, and often want to receive normal and needed messages (wake up, no girl professes to you (bushi)). When the mobile phone vibrates, can’t wait to open when the greeted is to buy lottery, buy lottery this kind of message. It’s so annoying

Therefore, after a ton of analysis: must be because suffered a large number of junk messages, my goddess can not send messages for me, I have been single. Hateful spam text messages, and today I’m going to code against you.

Data sources: teddy cup edu.tipdm.org/course/4255…

The general idea is as follows:

1. Remove x from the text

2. Jieba performs Chinese word segmentation

3. Remove the stop word from the text

4. Convert the list to a string after removal (for later data analysis)

5. Separate text data from labels

6. The string is vectorized by TF-IDF to obtain each word and the frequency of occurrence of each word (one-HOT can only know whether there is one word,coutVectorizer knows the frequency of each word,TF-IDF uses algorithm to calculate words) TF: number of words/total number of words IDF:log((N+1)/(N(x)+1) +1 N= total number of text in the training set (lines), N(x)= number of text containing word x (lines containing x) TF-IDF algorithm is TF*IDF

7. Machine learning prediction is performed after data structuring

Basic data processing

import pandas as pd
import re
import jieba

def data_process(file='message80W1.csv'):
    data = pd.read_csv(file, header=None, index_col=0)
    data.columns = ['label'.'message']
    n = 5000

    a = data[data['label'] = =0#]. Sample (n)5000B = data[data['label'] = =1#]. Sample (n)5000Data_new = pd.concat([a, b], axis=0Data_dup = data_new['message']. Drop_duplicates ()# delete duplicate columns data_qumin = data_dup.apply(lambda x: re.sub)'x'.' ', x))# remove the x character in each sample data_dup by replacing it' '

    jieba.load_userdict('newdic1.txt'Data_cut = data_qumin.apply(lambda x: Jieba.lcut (x))# cut(lcut) cut_all (default: False) stopWords = pd.read_csv('stopword.txt', encoding='GB18030', sep='hahaha', header=None) # read the text sep specifies the delimiter as'hahaha'The default is stopWords = ['≮'.'≯.'indicates'.'≮'.' '.'can'.'month'.'day'.The '-'] + list(stopWords.iloc[:, 0Data_after_stop = data_cut. Apply (lambda x: [I]for i in x if i not inLoc [data_after_stop.index, alllabels = data_new.loc[data_after_stop.index, alllabels = data_new.loc[data_after_stop.index, alllabels = data_new.loc]'label']#index is the name of the row that forms a label only dataframe adata = data_after_stop.apply(lambda x:' '.join(x))# convert the list to a string delimited by' '

    returnData_after_stop, labels # Data_after_stop is a list. Adata is a string and labels are labelsCopy the code

Prediction of results

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text importCountVectorizer, TfidfTransformer adata, data_after_stop, lables = data_process() labels_tr, labels_te = train_test_split(adata, lables, test_size=0.2CountVectorizer = countVectorizer () data_tr = countvectorizer. fit_transform(data_tr)# Only consider the word frequency of each occurrence of a numerical value plus one X_tr = TfidfTransformer().fit_transform(data_tr.toarray()).toarray()# use tF-IDF to calculate data_te = CountVectorizer (vocabulary = CountVectorizer vocabulary_.) fit_transform (data_te) # vocabulary = CountVectorizer. Vocabulary_ to protect X_te = TfidfTransformer().fit_transform(data_te.toarray()).toarray()# Fit (X_tr, labels_tr)# put the training set data model.score(X_te, labels_te)# test set data to run the scoreCopy the code

The neural network is used to call the default parameters, which can be adjusted or used in other models for better results

Run points results

Word cloud drawing

# Draw word cloudsfrom data_process import data_process
from wordcloud import WordCloud
import matplotlib.pyplot as plt

adata, data_after_stop, labels = data_process()

word_fre = {}
for i in data_after_stop[labels == 1]:# pass through the data tag1(list)for j in i:
        if j not inWord_fre.keys (): create a word_fre without the key name [j] =1
        else:
            word_fre[j] += 1

mask = plt.imread('duihuakuan.jpg'Wc = WordCloud(mask=mask, background_color='white', font_path=r'C:\Windows\Fonts\simhei.ttf'Fit_words (word_fre)# Add dictionary plt.imshow(WC)# Draw pictureCopy the code

If the for loop is 1, it represents the word cloud of spam messages, and if it is 0, it represents the word cloud of normal messages. The difficulty here is that the parameters of font_path need to be adjusted to use a font that can be displayed

conclusion

Using the identification of spam messages, immediately have goddess to chat with me ~~