3.6 Naive Bayes practice
3.6.1 Naive Bayes’ microblog comment screening
Take weibo comments for example. In order not to affect the development of weibo, we need to block vulgar comments, so we need to build a fast filter. If a comment uses vulgar language such as negative or insulting, then it will be marked as inappropriate content. Filtering this kind of content is a common requirement. Create two types for this problem: vulgar class and non-vulgar class, represented by 1 and 0 respectively.
3.6.1.1 Implementation of Naive Bayes for microblog comment screening
We think of text as a vector of words or a vector of terms, that is, we turn sentences into vectors. Consider the occurrence of all the words in the document, decide which words to include in the vocabulary or set of words you want, and then you must convert each document into a vector on the vocabulary. For simplicity, we first assume that the text has been segmented and stored in the list, and the term vectors are classified and labeled. Write the code as follows:
# -*- coding: UTF-8 -*- "" "" def loadDataSet(): PostingList = [[' my 'and' dog ', 'has',' flea ', 'the problems',' help 'and' do '], [# segmentation of entry 'maybe', 'not' and 'take', 'question' and 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', Worthless 'worthless', 'dog', 'food', 'stupid']] Return postingList,classVec if __name__ == '__main__': postingList,classVec if __name__ == '__main__' postingLIst, classVec = loadDataSet() for each in postingLIst: print(each) print(classVec)Copy the code
It can be seen from the running result that postingList is stored in the list of entries, classVec is stored in the category of each entry, 1 represents the vulgar category, 0 represents the non-vulgar category.
Moving on to the code, we have already said that we need to create a vocabulary and convert the segmented terms into term vectors.
# -*- coding: UTF-8 -*- "" "" def loadDataSet(): PostingList = [[' my 'and' dog ', 'has',' flea ', 'the problems',' help 'and' do '], [# segmentation of entry 'maybe', 'not' and 'take', 'question' and 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', Worthless 'worthless', 'dog', 'food', 'stupid']] ClassVec = [0,1,0,1,0,1,1] # return postingList,classVec "" Returns: vocabSet Returns: vocabSet Returns "" def createVocabList(dataSet): VocabSet = set([]) for document in dataSet: VocabSet = vocabSet | set (document) # take and return the list (vocabSet) "" "function description: according to vocabList vocabulary, inputSet to quantify, Parameters: vocabList - createVocabList inputSet - list of entries to be syncopated Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) # for word in inputSet, 1 if word in vocabList returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) return return document vector if __name__ == '__main__': postingList, classVec = loadDataSet() print('postingList:\n',postingList) myVocabList = createVocabList(postingList) print('myVocabList:\n',myVocabList) trainMat = [] for postinDoc in postingList: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) print('trainMat:\n', trainMat)Copy the code
As you can see from the results, postingList is the original list of terms and myVocabList is the glossary. MyVocabList is a collection of all occurrences of words, with no repeating elements. What is a vocabulary for? Yes, it’s used to vectorize terms, so if a word appears once in the vocabulary, it’s listed as a 1, and if it doesn’t, it’s listed as a 0. TrainMat is a list of all term vectors. It contains vectors of terms vectorized according to myVocabList.
We’ve got the vector of terms. Next, we can train the naive Bayes classifier with the entry vector.
# -* -coding: utF-8 -* -import numpy as NP "" "" def loadDataSet(): PostingList = [[' my 'and' dog ', 'has',' flea ', 'the problems',' help 'and' do '], [# segmentation of entry 'maybe', 'not' and 'take', 'question' and 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', Worthless 'worthless', 'dog', 'food', 'stupid']] ClassVec = [0,1,0,1,0,1,1] # return postingList,classVec "" Returns: vocabSet Returns: vocabSet Returns "" def createVocabList(dataSet): VocabSet = set([]) for document in dataSet: VocabSet = vocabSet | set (document) # take and return the list (vocabSet) "" "function description: according to vocabList vocabulary, inputSet to quantify, Parameters: vocabList - createVocabList inputSet - list of entries to be syncopated Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) # for word in inputSet, 1 if word in vocabList returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) return document vector "" TrainMatrix - trainCategory - trainCategory label vector, loadDataSet Returns classVec: P0Vect - not the vulgar conditional probability arrays p1Vect - vulgar kind of conditional probability pAbusive - the probability of document as part of the vulgar "" "def trainNB0 (trainMatrix, trainCategory) : Palso = = = = = = = = = = = = = = Sum (trainCategory)/float(numTrainDocs) # create numpy. Zeros array, p0Num = np.zeros(numWords); P1Num = np.zeros(numWords) # initialize denominator to 0.0 p0Denom = 0.0; P1Denom = 0.0 for I in range (numTrainDocs) : # statistics as part of the vulgar the conditional probability of the required data, namely, P (w0 | 1), P (w1 | 1), P (w2) | 1... if trainCategory [I] = = 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: # Count the data required for conditional probability belonging to the non-vulgar class, P (w0 | 0), P (w1 | 0), P (w2 | 0)... p0Num + = trainMatrix [I] p0Denom + = sum (trainMatrix [I]) p1Vect = p1Num/p1Denom # floor p0Vect = P0Num /p0Denom return p0Vect,p1Vect, palso causes/leads #Copy the code
The result is as follows: p0V stores the probability that each word belongs to category 0, which is the non-vulgar category. For example, p0V has the 6th to last probability of being in the non-vulgar category. Similarly, the 6th penultimate probability of p1V is 0.15789474, which is about 15.79% probability of stupid being vulgar. As we all know, stupid in Chinese means stupid. Obviously, this word belongs in the vulgar category. PAb is the probability of all vulgar samples taking up all samples. It can be seen from classVec that there are three vulgar classes and three non-vulgar classes. So the probability of the vulgar category is 0.5. Therefore p0V deposit is P (question | not vulgar class) = 0.0833, P (is not vulgar | class) = 0.0417, all the way to the P (dog | not vulgar class) = 0.0417, the conditional probability of the word. Similarly, p1V stores the conditional probability that each word belongs to the vulgar category. PAb is the prior probability.
Now that you’ve trained the classifier, use it to classify.
# -* -coding: utF-8 -* -import numpy as NP from functools import reduce "" "" def loadDataSet(): def loadDataSet(): Manual annotation postingList = [[' my 'and' dog ', 'has',' flea ', 'the problems',' help 'and' do '], [# segmentation of entry 'maybe', 'not' and 'take', 'question' 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] ClassVec = [0,1,0,1,0,1,1] # return postingList,classVec "" Parameters: dataSet - dataSet Returns: VocabSet - return don't duplicate entry list, which is the vocabulary ", "" # create a contains no repeat in all documents list | for two collections and sets, word set def createVocabList (dataSet) : VocabSet = set([]) # dataSet for document in dataSet: VocabSet = vocabSet | set (document) # # set with the print (vocabSet) return the list (# vocabSet) to generate a list containing all of the word "" "function description: according to vocabList vocabulary, Parameters: vocabList-createvocabList Specifies the list of entries to be returned. InputSet Returns: Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) # for word in inputSet, 1 if word in vocabList returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) #print(returnVec) #print(returnVec) TrainMatrix - trainCategory - trainCategory label vector, loadDataSet Returns classVec: P0Vect - not the vulgar conditional probability arrays p1Vect - vulgar kind of conditional probability pAbusive - the probability of document as part of the vulgar "" "def trainNB (trainMatrix, trainCategory) : Article # calculation training document number numTrainDocs = len (trainMatrix) # print (" numTrainDocs: "+ STR (numTrainDocs) each document number entry numWords = # calculation Len (trainMatrix[0]) #print("numWords:" + numWords) palso = sum(trainCategory)/float(numTrainDocs) # create numpy. Zeros array, p0Num = np.zeros(numWords); P1Num = np.zeros(numWords) # initialize denominator to 0.0 p0Denom = 0.0; P1Denom = 0.0 for I in range (numTrainDocs) : # statistics as part of the vulgar the conditional probability of the required data, namely, P (w0 | 1), P (w1 | 1), P (w2) | 1... if trainCategory [I] = = 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: # Count the data required for conditional probability belonging to the non-vulgar class, P (w0 | 0), P (w1 | 0), P (w2 | 0)... p0Num + = trainMatrix [I] p0Denom + = sum (trainMatrix [I]) p1Vect = p1Num/p1Denom # floor p0Vect = P0Num /p0Denom #print(p1Num) #print(p1Denom) #print(p0Denom) return p0Vect,p1Vect, p1Denom #print(p0Denom) return p0Vect,p1Vect, p1Denom The conditional probability array belongs to the non-vulgar class, and the document belongs to the probability of the vulgar class. Vec2Classify - p0Vec - conditional probability array of non-vulgar classes p1Vec - Conditional probability array of vulgar classes pClass1 - Probability of documents belonging to vulgar classes Def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): # Merge the sequence into a single value by applying a two-parameter function to the entries of a sequence from left to right. #, for example, the reduce (lambda x, y, x + y, [1, 2, 3, 4, 5]) calculation is ((((1 + 2) + 3) + 4) + 5). P1 = reduce(lambda x,y:x*y, vec2Classify * p1Vec) Vec2Classify * p0Vec) * (1.0-pclass1) print('p0:',p0) print('p1:',p1) if p1 > 0: return 1 else: Return 0 """ def testingNB(): ## Step 1: load data print("Step 1: load data..." MyVocabList = createVocabList(listPosts) # trainMat=[] for postinDoc in listPosts: TrainMat. Append (setOfWords2Vec(myVocabList, postinDoc)) # Print (trainMat) # Step 2: Training... print("Step 2: training..." P0V,p1V,pAb = trainNB(np.array(trainMat), Np.array (listClasses)) # Step 3: Testing print("Step 3: testing..." ThisDoc = np.array(setOfWords2Vec(myVocabList, testEntry)) ## Step 4: show the result print("Step 4: show the result..." ) if classifyNB(thisDoc,p0V,p1V,pAb): print(testEntry,' classifyNB ') else: Print (' enter ', 'enter ') print(' enter ',' enter ') Garbage] # thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry) if classifyNB(thisDoc,p0V,p1V,pAb): Print (testEntry, 'vulgar') # else perform classification and print the classification results: print (testEntry, 'belongs to the vulgar class') # perform classification and print the classification results if __name__ = = "__main__ ': testingNB()Copy the code
We tested two terms, which also need to be vectorized before using classifiers, and then used the classifyNB() function to calculate the probability of entry vectors belonging to the vulgar and non-vulgar classes using naive Bayes formula. The running results are as follows:
You can see that the algorithm written this way can’t classify, p0 and P1 both evaluate to zero, so there’s obviously a problem here. So what’s the solution?
We already have a probability of zero in the calculation. If the new instance text contains such a participle with a probability of zero, then the probability that the final text belongs to a category is zero. Obviously, this is not reasonable, and to reduce this effect, we can initialize the occurrence of all words to 1 and initialize the denominator to 2. This method of Smoothing is called Laplace Smoothing, and it is often used to solve the problem of zero probability.
In addition, another problem encountered is downflow, which is caused by multiplying too many very small numbers. Those of you who have studied math know that when you multiply two decimals, they get smaller and smaller, and that creates a down overflow. In the program, the result may be 0 if you round the decimal place. To solve this problem, take the natural log of the product. Logarithms can be used to avoid overflow or floating-point rounding errors. At the same time, there is no loss in using natural logarithms. The graph below shows the curves of f(x) and ln(f(x)).
Figure 1
If you examine these two curves, you will find that they increase or decrease simultaneously in the same area, and that they reach an extreme value at the same point. Their values are different, but it doesn’t affect the final result. The final code looks like this.
# -* -coding: utF-8 -* -import numpy as NP "" "" def loadDataSet(): def loadDataSet(): Manual annotation postingList = [[' my 'and' dog ', 'has',' flea ', 'the problems',' help 'and' do '], [# segmentation of entry 'maybe', 'not' and 'take', 'question' 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] ClassVec = [0,1,0,1,0,1,1] # return postingList,classVec "" Parameters: dataSet - dataSet Returns: VocabSet - return don't duplicate entry list, which is the vocabulary ", "" # create a contains no repeat in all documents list | for two collections and sets, word set def createVocabList (dataSet) : VocabSet = set([]) # dataSet for document in dataSet: VocabSet = vocabSet | set (document) # # set with the print (vocabSet) return the list (# vocabSet) to generate a list containing all of the word "" "function description: according to vocabList vocabulary, Parameters: vocabList-createvocabList Specifies the list of entries to be returned. InputSet Returns: Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) # for word in inputSet, 1 if word in vocabList returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) #print(returnVec) #print(returnVec) TrainMatrix - trainCategory - trainCategory label vector, loadDataSet Returns classVec: P0Vect - not the vulgar conditional probability arrays p1Vect - vulgar kind of conditional probability pAbusive - the probability of document as part of the vulgar "" "def trainNB (trainMatrix, trainCategory) : Article # calculation training document number numTrainDocs = len (trainMatrix) # print (" numTrainDocs: "+ STR (numTrainDocs) each document number entry numWords = # calculation Len (trainMatrix[0]) #print("numWords:" + numWords) palso = sum(trainCategory)/float(numTrainDocs) P0Num = np.ones(numWords); P1Num = np.ones(numWords) # initialize the denominator to 2.0 p0Denom = 2.0; P1Denom = 2.0 for I in range (numTrainDocs) : # statistics as part of the vulgar the conditional probability of the required data, namely, P (w0 | 1), P (w1 | 1), P (w2) | 1... if trainCategory [I] = = 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: # Count the data required for conditional probability belonging to the non-vulgar class, P (w0 | 0), P (w1 | 0), P (w2 | 0)... p0Num + = trainMatrix [I] p0Denom + = sum (trainMatrix [I]) p1Vect = np. The log (p1Num/p1Denom) # floor p0Vect = np.log(p0Num/p0Denom) #print(p1Num) #print(p0Num) #print(p1Denom) #print(p0Denom) return p0Vect,p1Vect,pAbusive # return the conditional probability array that belongs to the vulgar class, the conditional probability array that belongs to the non-vulgar class, and the probability that documents belong to the vulgar class. Vec2Classify - p0Vec - conditional probability array of non-vulgar classes p1Vec - Conditional probability array of vulgar classes pClass1 - Probability of documents belonging to vulgar classes Def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): # Merge the sequence into a single value by applying a two-parameter function to the entries of a sequence from left to right. P0 = sum(vec2Classify * p0Vec) + np.log(1.0-pclass1) Print ('p0:',p0) print('p1:',p1) if P1 > p0: return 1 else: return 0 "" "" def testingNB(): ## print "" def testingNB(): # MyVocabList = createVocabList(listPosts) # trainMat=[] for postinDoc in listPosts: TrainMat. Append (setOfWords2Vec(myVocabList, postinDoc)) # Print (trainMat) # Step 2: Training... print("Step 2: training..." P0V,p1V,pAb = trainNB(np.array(trainMat), Np.array (listClasses)) # Step 3: Testing print("Step 3: testing..." ThisDoc = np.array(setOfWords2Vec(myVocabList, testEntry)) ## Step 4: show the result print("Step 4: show the result..." ) if classifyNB(thisDoc,p0V,p1V,pAb): print(testEntry,' classifyNB ') else: Print (' enter ', 'enter ') print(' enter ',' enter ') Garbage] # thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry) if classifyNB(thisDoc,p0V,p1V,pAb): Print (testEntry, 'vulgar') # else perform classification and print the classification results: print (testEntry, 'belongs to the vulgar class') # perform classification and print the classification results if __ name__ = = \ '__main__ \' : testingNB()Copy the code
The results are as follows.
1.nb_word2vec_v1.py for complete code, see nb_word2vec_v1.py for 1.nb_word2vec.
3.6.1.2 Naive Bayes micro blog comment screening implementation – call sklearn library
Call skLearn library data input and the previous section is the same, mainly call library process, directly look at the code, the code is very detailed, I do not want to go into detail.
# -*- coding: utf-8 -*- import numpy as np from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import Train_test_split """ parameter: No Returns: "" def loadDataSet(): def loadDataSet(): Manual annotation postingList = [[' my 'and' dog ', 'has',' flea ', 'the problems',' help 'and' do '], [# segmentation of entry 'maybe', 'not' and 'take', 'question' 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] ClassVec = [0,1,0,1,0,1,1] # return postingList,classVec "" Parameters: dataSet - dataSet Returns: VocabSet - return don't duplicate entry list, which is the vocabulary ", "" # create a contains no repeat in all documents list | for two collections and sets, word set def createVocabList (dataSet) : VocabSet = set([]) # dataSet for document in dataSet: VocabSet = vocabSet | set (document) # # set with the print (vocabSet) return the list (# vocabSet) to generate a list containing all of the word "" "function description: according to vocabList vocabulary, Parameters: vocabList-createvocabList Specifies the list of entries to be returned. InputSet Returns: Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) # for word in inputSet, 1 if word in vocabList returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) #print(returnVec) #print(returnVec) #print(returnVec) #print(returnVec) ## Step 1: load data print("Step 1: load data..." MyVocabList = createVocabList(listPosts) # trainMat=[] for postinDoc in listPosts: TrainMat. Append (setOfWords2Vec(myVocabList, postinDoc)) #print(trainMat) ## print(trainMat) # init NB..." GNB = GaussianNB() # Step 4: training... print("Step 4: training..." ## Print ("Step 5: Testing...") ## print(" testing...") ) testEntry = ['love', 'my', 'dalmation'] #testEntry = ['stupid', Array (setOfWords2Vec(myVocabList, PredictedLabel = gnb.predict([thisDoc]) #predictedLabel = gnb.fit(trainMat, listClasses).predict(thisDoc) ## Step 6: show the result print("Step 6: show the result..." ) print(predictedLabel) if (predictedLabel == 0): print(" belongs to non-pulp class ") else: print(" belongs to pulp class ") if __name__ == '__main__': testingNB()Copy the code
The result is as follows.
1. Nb_word2vec_v2.py
3.6.2 Naive Bayes classification of Iris flowers
In the last chapter, the author used KNN to classify iris flowers, so is it ok in this chapter? The answer is yes. We still use two ways to complete, one is to write all the algorithm, the other is to call skLearn API to achieve.
3.6.2.1 Realization of iris flower classification in naive Bayes actual combat
Let’s just go to the code.
# -*- coding: utf-8 -*-
import numpy as np
import csv#用于处理csv文件
import random#用于随机数
"""
函数说明:加载数据
Parameters:
filename - 文件名
split - 分隔符
trainingSet - 训练集
testSet - 测试集
Returns:
无
"""
def loadDataset(filename, split, trainSet = [], testSet = []):
with open(filename, 'rt') as csvfile:
#从csv中读取数据并返回行数
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
#保存数据集到训练集和测试集#random.random()返回随机浮点数
if random.random() < split:
trainSet.append(dataset[x])
else:
#将获得的测试数据放入测试集中
testSet.append(dataset[x])
"""
函数说明:分割数据
Parameters:
dataSet - 数据集
Returns:
data_X - 特征数据集
data_Y - 标签数据集
"""
def segmentation_Data(dataSet):
#得到文件行数
Lines = len(dataSet)
#返回的NumPy矩阵,解析完成的数据:4列
data_X = np.zeros((Lines,4))
data_Y = []
for x in range(Lines):
data_X[x,:] = dataSet[x][0:4]
if dataSet[x][-1] == 'Iris-setosa':
data_Y.append(1)
elif dataSet[x][-1] == 'Iris-versicolor':
data_Y.append(2)
elif dataSet[x][-1] == 'Iris-virginica':
data_Y.append(3)
return data_X, data_Y
"""
函数说明:将切分的实验样本词条整理成不重复的词条列表,也就是词汇表
Parameters:
dataSet - 整理的样本数据集
Returns:
vocabSet - 返回不重复的词条列表,也就是词汇表
"""
#创建一个包含在所有文档中出现的不重复的列表 |用于求两个集合并集,词集
def createVocabList(dataSet):
#创建一个空的不重复列表
vocabSet = set([])
#遍历数据集
for document in dataSet:
vocabSet = vocabSet | set(document) #取并集
#print(vocabSet)
return list(vocabSet)#生成一个包含所有单词的列表
"""
函数说明:根据vocabList词汇表,将inputSet向量化,向量的每个元素为1或0
Parameters:
vocabList - createVocabList返回的列表
inputSet - 切分的词条列表
Returns:
returnVec - 文档向量,词集模型
"""
def setOfWords2Vec(vocabList, inputSet):
#创建一个其中所含元素都为0的向量
returnVec = [0] * len(vocabList)
#遍历每个词条
for word in inputSet:
#如果词条存在于词汇表中,则置1
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
pass
#print("the word: %s is not in my Vocabulary!" % word)
#print(returnVec)
return returnVec #返回向量
"""
函数说明:朴素贝叶斯分类器训练函数
Parameters:
trainMatrix - 训练文档矩阵,即setOfWords2Vec返回的returnVec构成的矩阵
trainCategory - 训练类别标签向量,即loadDataSet返回的classVec
Returns:
p1Vect,p2Vect,p3Vect,pAbusive,pBbusive,pCbusive
"""
def trainNB(trainMatrix,trainCategory):
#计算训练的文档条数
numTrainDocs = len(trainMatrix)
#print("numTrainDocs:" + str(numTrainDocs))
#计算每篇文档的词条数
numWords = len(trainMatrix[0])
#print("numWords:" + str(numWords))
count = np.full(3, 0.0)
for i in range(len(trainCategory)):
if trainCategory[i] == 1:
count[0] += 1
elif trainCategory[i] == 2:
count[1] += 1
else:
count[2] += 1
pbusive = []
#计算先验概率
for i in range(3):
pb = count[i] /float(numTrainDocs)
pbusive.append(pb)
#print(pbusive)
#创建numpy.ones数组,词条出现数初始化为1,拉普拉斯平滑
pNum = np.ones((3,numWords))
#print(pNum)
#分母初始化为0.0#避免其中一项为0的影响
pDenom = np.full(3, 2.0)
#print(pDenom)
for i in range(numTrainDocs):
#统计属于低俗类的条件概率所需的数据,即P(w0|1),P(w1|1),P(w2|1)···
if trainCategory[i] == 1:
pNum[0] += trainMatrix[i]
pDenom[0] += sum(trainMatrix[i])
elif trainCategory[i] == 2:
pNum[1] += trainMatrix[i]
pDenom[1] += sum(trainMatrix[i])
else:
pNum[2] += trainMatrix[i]
pDenom[2] += sum(trainMatrix[i])
pVect = []
#避免下溢出问题
for i in range(3):
pV = np.log(pNum[i]/pDenom[i]) #相除
pVect.append(pV)
return pVect, pbusive #返回条件概率数组
"""
函数说明:朴素贝叶斯分类器分类函数
Parameters:
vec2Classify - 待分类的词条数组
pVec
pClass
lables - 标签
Returns:
最大概率的标签
"""
def classifyNB(vec2Classify, pVec, pClass,lables):
#概率列表
p = []
#从左到右对一个序列的项累计地应用有两个参数的函数,以此合并序列到一个单一值
for i in range(len(lables)):
result = sum(vec2Classify * pVec[i]) + np.log(pClass[i])
p.append(result)
#返回p中元素从小到大排序后的索引值
# 按照升序进行快速排序,返回的是原数组的下标。
# 比如,x = [30, 10, 20, 40]
# 升序排序后应该是[10,20,30,40],他们的原下标是[1,2,0,3]
# 那么,numpy.argsort(x) = [1, 2, 0, 3]
sortedpIndices = np.argsort(p)
#返回最大概率标签
return lables[sortedpIndices[-1]]
"""
函数说明:计算准确率
Parameters:
testSet - 测试集
predictions - 预测值
Returns:
返回准确率
"""
#计算准确率
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x] == predictions[x]:
correct += 1
return (correct/float(len(testSet)))*100.0
"""
函数说明:测试朴素贝叶斯分类器
Parameters:
无
Returns:
无
"""
def testingNB():
## Step 1: load data
print("Step 1: load data...")
#创建数据
#prepare data
trainSet = []#训练数据集
testSet = []#测试数据集
split = 0.8#分割的比例
#lables = ['Iris-setosa','Iris-versicolor','Iris-virginica']
lables = [1, 2, 3]
loadDataset('C:/TensorFlow/irisdata.txt', split, trainSet, testSet)
#数据集分割
train_X,train_Y = segmentation_Data(trainSet)
test_X,test_Y = segmentation_Data(testSet)
print('Train set: ' + repr(len(trainSet)))
print('Test set: ' + repr(len(testSet)))
#print(train_X)
#print(train_Y)
#创建实验样本
#listPosts,listClasses = loadDataSet()
#创建词汇表
myVocabList = createVocabList(train_X)
#向量化样本
trainMat=[]
for postinDoc in train_X:
#将实验样本向量化
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
#print(trainMat)
## Step 2: training...
print("Step 2: training...")
#训练朴素贝叶斯分类器
pV,pb = trainNB(np.array(trainMat),np.array(train_Y))
## Step 3: testing
print("Step 3: testing...")
#测试样本
#testEntry = [5.1,3.5,1.4,0.2]
#testEntry = [6.8,2.8,4.8,1.4]
thisDoc = []
predictedLabel = []
for postinDoc in test_X:
#将实验样本向量化
thisDoc.append(setOfWords2Vec(myVocabList, postinDoc))
#测试样本向量化
#thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
for i in range(len(thisDoc)):
result = classifyNB(thisDoc[i],pV,pb,lables)
predictedLabel.append(result)
## Step 4: show the result
print("Step 4: show the result...")
#print(predictedLabel)
#print(test_Y)
#准确率
accuracy = getAccuracy(test_Y, predictedLabel)
print('\nAccuracy: ' + repr(accuracy) + '%')
if __name__ == '__main__':
testingNB()
Copy the code
The results are shown below.
2.NB_Iris_ Classify NB_Iris_Classify_v1
3.6.2.2 Implementation of iris flower classification in naive Bayes actual combat – call sklearn library
Next, we use Sklearn to classify, as follows.
# -*- coding: utf-8 -*- from sklearn import datasets from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score ## Step 1: load data print("Step 1: load data..." Load_iris () ## Step 2: split data print("Step 2: split data..." Data # Y = label Y = iris.target X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.6) ## Step 3: init NB print("Step 3: init NB..." GNB = GaussianNB() # Step 4: training... print("Step 4: training..." If (X_train, Y_train) ## Step 5: print("Step 5: testing...") PredictedLabel = gnb.predict(X_test) ## predictedLabel = gnB. predict(X_test) ## show the result print("Step 6: show the result..." O # # http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html accuracy print(accuracy_score(Y_test, predictedLabel)) print("predictedLabel is :") print(predictedLabel)Copy the code
The results are shown below.
2.NB_Iris_ Classify nb_IRIS_CLASSIfy_v2.py
3.6.3 Naive Bayes spam filtering
3.6.3.1 Naive Bayes spam filtering implementation
One of the best-known applications of naive Bayes is E-mail spam filtering. For English text, we can use non-letters, non-numbers as symbols to split, using the split function. Write the code as follows:
# -* -coding: utF-8 -* -import re """ def textParse(bigString): Split (r'\W*', bigString) listOfTokens = re.split(r'\W*', bigString) Return [tok.lower() for tok in listOfTokens if Len (tok) > 2] Parameters: dataSet - The dataSet Returns: {Parameters: dataSet - the dataSet Returns: {Parameters: dataSet - the dataSet Returns: {Parameters: dataSet - the dataSet "" def createVocabList(dataSet): vocabSet = set([]) # create a blank for document in dataSet: VocabSet = vocabSet | set (document) # take and return the list (vocabSet) if __name__ = = "__main__ ': docList = []; classList = [] for i in range(1, 26): WordList = textParse(open('email/spam/%d.txt' % I, 'r').read()) And string into string list doclist.append (wordList) classList.append(1) # tag spam, WordList = textParse(open('email/ham/% d.xt '% I, 'r').read()) # Append (wordList) classList.append(0) # Flag non-spam, 1 indicates spam files vocabList = createVocabList(docList) # create vocabulary, Without repeating print(vocabList) According to the vocabulary, we can vectorize each text. We divide the data set into training set and test set, and test the accuracy of the bayesian classifier by means of cross validation. # -* -coding: Utf-8 -* -import numpy as NP import random import re "" utF-8 -* -import numpy as NP import random import re Returns: vocabSet Returns "" def createVocabList(dataSet): VocabSet = set([]) # create an empty unduplicate list for document in dataSet: VocabSet = vocabSet | set (document) # take and return the list (vocabSet) "" "function description: according to vocabList vocabulary, inputSet to quantify, Parameters: vocabList - createVocabList inputSet - list of entries to be syncopated Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: ReturnVec [vocablist.index (word)] = 1 else: print(" The word: %s is not in my Vocabulary!" % word) return # return document vector "" TrainMatrix - trainCategory - trainCategory label vector, loadDataSet Returns classVec: p0Vect p1Vect pAbusive """ def trainNB(trainMatrix,trainCategory): = len(trainMatrix[0]) palso causes/leads to pollution = sum(trainCategory)/float(numTrainDocs) p0Num = np.ones(numWords); P1Num = np.ones(numWords) # create numpy. P0Denom = 2.0 p1Denom = 2.0 for I in range(numTrainDocs): if trainCategory[I] == 1: # statistics as part of the insult the conditional probability of the required data, i.e. P (w0 | 1), P (w1 | 1), P (w2 | 1)... p1Num + = trainMatrix [I] p1Denom + = sum (trainMatrix [I]) else: # statistical belong to insult the conditional probability of the required data, i.e. P (w0 | 0), P (w1 | 0), P (w2 | 0)... p0Num + = trainMatrix [I] p0Denom + = sum (trainMatrix [I]) # take logarithm, P1Vect = np.log(p1Num/p1Denom) p0Vect = np.log(p0Num/p0Denom) # return p0Vect,p1Vect, p0Denom) Parameters: vec2Classify - p0Vec conditional probability array p1Vec pClass1 Returns: vec2Classify Def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): sum(vec2Classify * p1Vec) + np.log(pClass1) # (pClass1) p0 = sum(vec2Classify * p0Vec) + np.log(1.0-pclass1) if p1 > p0: Return 1 else: return 0 """ Function: Accept a large string and parse it as a list of strings ListOfTokens = re.split(r'\W*', bigString) Return [tok.lower() for tok in listOfTokens if len(tok) > 2] "" "" def spamTest(): ## Step 1: load data ("Step 1: load data... ) docList = [] classList = [] fullText = [] for i in range(1, 26): # read every spam, WordList = textParse(open('email/spam/%d.txt' % I, 'r').read()) doclist.append (wordList) fulltext.append (wordList) # Mark spam, 1 for spam classList.append(1) # read every non-spam message, WordList = textParse(open('email/ham/% d.xt '% I, 'r').read()) doclist.append (wordList) fulltext.append (wordList) # Mark non-spam, 1 for spam classList.append(0) # create vocabulary, VocabList = createVocabList(docList) trainingSet = list(range(50)) testSet = [] 40 were randomly selected as training sets and 10 as test sets for I in range(10): RandIndex = int(random. Uniform (0, Testset.append (trainingSet[randIndex]) # trainingSet[trainingIndex] # trainingSet[randIndex] TrainMat = [] trainClasses = [] ## Training... print("Step 2: training..." For docIndex in trainingSet: Trainmat. append(setOfWords2Vec(vocabList, Trainclass.append (classList[docIndex]) # train naive Bayesian model p0V, p1V, PSpam = trainNB(np.array(trainMat), np.array(trainClasses)) # errorCount = 0 # Step 3: Testing print("Step 3: testing..." For docIndex in testSet: WordVector = setOfWords2Vec(vocabList, docList[docIndex]) # If classifyNB(np.array(wordVector), p0V p1V, pSpam) ! Print (" docList[docIndex] ",docList[docIndex]) ## Step 4: show the result print("Step 4: show the result..." ) print (' error: % 2 f % % % (float (errorCount)/len (testSet) * 100)) if __name__ = = "__main__ ': spamTest ()Copy the code
3. NB_email_Classify_v1 under NB_email_Classify
The spamTest() function outputs the probability of classification errors on 10 randomly selected E-mail messages. Since these e-mails are randomly selected, the output may vary a little from time to time. If an error is found, the function prints this table of the wrong documents so that you can see which document was wrong. If you want a better estimate of the error rate, you should repeat the process several times, say 10 times, and then average it. By contrast, it is better to misdiagnose spam as normal than to classify normal mail as spam.
3.6.3.2 Naive Bayes spam filtering implementation – call sklearn library
As in the previous example, the SkLearn library is called.
# -*- coding: utf-8 -*- import random import re from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import Train_test_split from sklearn.metrics import accuracy_score "" Returns: vocabSet Returns "" def createVocabList(dataSet): VocabSet = set([]) # create an empty unduplicate list for document in dataSet: VocabSet = vocabSet | set (document) # take and return the list (vocabSet) "" "function description: according to vocabList vocabulary, inputSet to quantify, Parameters: vocabList - createVocabList inputSet - list of entries to be syncopated Def setOfWords2Vec(vocabList, inputSet): ReturnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: ReturnVec [vocablist.index (word)] = 1 else: print(" The word: %s is not in my Vocabulary!" Return textParse(bigString): Returns a large string and parses it into a list of strings. # use a special symbol as a shard flag to split strings. ListOfTokens = re.split(r'\W*', BigString) #print(listOfTokens) return [tok.lower() for tok in listOfTokens if Len (tok) > 2] "" def spamTest(): ## Step 1: load data print("Step 1: load data print("Step 1: load data print("Step 2: load data print(" ")) load data..." ) docList = [] classList = [] fullText = [] for i in range(1, 26): WordList = textParse(open('email/spam/%d.txt' % I, 'rt').read()) Append (wordList) fullText. Append (wordList) classList.append(1) # tag spam, WordList = textParse(open('email/ham/% d.xt '% I, 'rt').read()) # Append (wordList) fullText. Append (wordList) classList.append(0) # mark non-spam, VocabList = createVocabList(docList) # create vocabulary, TrainingSet = list(range(50)) testSet = [] for I in range(10): trainingSet = list(range(50)) # From 50 emails, RandIndex = int(random. Uniform (0, Len (trainingSet)) # select testset.appEnd (trainingSet[randIndex]) TrainMat = [] trainClasses = [] ## trainMat = [] ## trainMat = [] ## Training... print("Step 2: training..." For docIndex in trainingSet: Trainmat. append(setOfWords2Vec(vocabList, Trainclass.append (classList[docIndex]) X_train, X_test, Y_train, Y_test = train_test_split(trainMat, trainClasses, test_size=.6) ## Step 3: init NB print("Step 3: init NB..." GNB = GaussianNB() # Step 4: training... print("Step 4: training..." If (X_train, Y_train) ## Step 5: print("Step 5: testing...") PredictedLabel = gnb.predict(X_test) ## predictedLabel = gnB. predict(X_test) ## show the result print("Step 6: show the result..." O # # http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html accuracy print(accuracy_score(Y_test, predictedLabel)) print("predictedLabel is :") print(predictedLabel) if __name__ == '__main__': spamTest()Copy the code
The results are shown below.
NB_email_Classify NB_email_Classify_v2 under NB_email_Classify
3.6.4 Summary of Naive Bayes Sklearn
Reference: English: sklearn.apachecn.org/cn/0.19.0/m… English: scikit-learn.org/stable/modu… API:scikit-learn.org/stable/modu…
Naive Bayes is a kind of simple algorithm, and the use of naive Bayes library in SciKit-Learn is also simple. Compared with algorithms such as decision tree and KNN, naive Bayes need to pay attention to fewer parameters, so it is easier to master. In SciKit-Learn, there are three naive Bayes classification algorithm classes. Are GaussianNB, MultinomialNB, and BernoulliNB. MultinomialNB is also multinomial distribution. BernoulliNB is also multinomial distribution. The prior probability model explained in the previous article is naive Bayes with polynomial distribution prior probability. In general, GaussianNB is preferable if the distribution of sample features is mostly continuous values. MultinomialNB is appropriate if most of the sample features are multinomial discrete values. If the sample features are binary discrete values or very sparse multivariate discrete values, BernoulliNB should be used.
GaussianNB class
class sklearn.naive_bayes.GaussianNB(priors=None)
Copy the code
Parameter Description:
- Priors: This parameter is optional and defaults to None. If prior information is specified, no adjustments are made to the data.
GaussianNB assumes that the prior probability of features is normally distributed, as follows:
Where C, K, C_k, Ck is the KTH class of Y, Y, k, k. μk \mu_k μk and σk2 \sigma_k^2 σk2 are values to be estimated from the training set.
GaussianNB calculates μk \mu_k μk and σk2 \sigma_k^2 σk2 from the training set. μk \mu_k μk is the average value of all Xj X_j Xj in sample category C K C_k Ck. σk2 \sigma_k^2 σk2 is the variance of all Xj X_j Xj in the sample class C K C_k K.
The GaussianNB class has only one major parameter, priors, which corresponds to the prior probability P(Y=Ck) P(Y=C_k) P(Y=Ck) of each category of Y Y Y. This value is not given by default, if it is not given then P(Y=Ck)=mk/m P(Y=C_k)=mk/m P(Y=Ck)=mk/m. Where m m m is the total number of training set samples, and mk mk mk is the number of training set samples whose output is the KTH k k category. Priors prevail if given.
After fitting the data using GaussianNB’s FIT method, we can make predictions. There are three prediction methods, including predict, predict_log_PROBA, and predict_proba.
Predict method is the most commonly used prediction method, which directly provides the prediction category output of the test set. Predict_proba, on the other hand, gives the predicted probabilities for each category of the test set sample. Easy to understand, predict_proba predicts the probability of each category of the maximum corresponding to the category, that is, predict method to obtain the category.
Predict_log_proba is similar to predict_proba in that it gives a logarithmic transformation of the predicted probabilities for each category of the test set sample. After transformation, predict_log_proba predicts the maximum logarithmic probability of each category corresponding to the category, namely predict method to obtain the category.
BernoulliNB class
Class sklear.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)Copy the code
Parameter Description:
- Alpha: floating point optional argument. Default is 1.0, which is used to add Laplacian smoothing. If this parameter is set to 0, no smoothing is added.
- Binarize: This parameter is optional. The default value is 0.0. Threshold for binding (mapped to Boolean) of sample properties. If not, the input is said to be made up of binary vectors.
- Fit_prior: Boolean This parameter is optional. The default value is True. The Boolean parameter fit_prior indicates whether prior probabilities are to be considered, and if false, all sample category outputs have the same category prior probabilities.
- Class_prior: This parameter is optional. The default value is None.
BernoulliNB assumes that the prior probability of features is a binary Bernoulli distribution, namely, the following equation:
P (X J L) = P (j j L) + (1 − P (j J L)) (1 − X J L) P (X_ {jl} | Y = C_k) = P (j | Y = C_k) X_ + {jl} (1 – P (j | Y = C_k)) (1 – X_ {jl})) P (Xjl ∣ Y = Ck) = P (j ∣ Y = Ck) Xjl + (1 – P (j ∣ Y = Ck)) (1 – Xjl))
There are only two values available. X j L x_{jl} XJL can only be 0 or 1.
BernoulliNB has four parameters, three of which are MultinomialNB with the same name and meaning. The only parameter added is binarize. This parameter is mainly used to help BernoulliNB deal with binomial distributions, and can be numeric or no input. If no input is entered, BernoulliNB considers every data feature to be already binary. Otherwise, anything less than binarize goes into one category, anything greater than binarize goes into another category.
After BernoulliNB’s FIT or Partial_FIT method was used to fit the data, we could make predictions. There are three prediction methods, including predict, predict_log_PROBA, and predict_proba. The method is exactly the same as GaussianNB.
MultinamialNB class
The class sklearn. Naive_bayes. MultinomialNB (alpha = 1.0, fit_prior = True, class_prior = None)Copy the code
Parameter Description:
- Alpha: floating point optional argument. Default is 1.0, which is used to add Laplacian smoothing. If this parameter is set to 0, no smoothing is added.
- Fit_prior: Boolean This parameter is optional. The default value is True. The Boolean parameter fit_prior indicates whether prior probabilities are to be considered, and if false, all sample category outputs have the same category prior probabilities. Otherwise, you can use the third parameter class_prior to enter the prior probability. Alternatively, MultinomialNB calculates the prior probability from the training sample without entering the third parameter class_prior. Where m is the total number of training set samples, mk is the number of training set samples whose output is the KTH category.
- Class_prior: This parameter is optional. The default value is None.
MultinomialNB assumes that the prior probability of the feature is multinomial, as follows:
Among them, P (X = X ∣ j l Y = C k) P (X = x_ {jl} | Y = C_k) P (X = XJL ∣ Y = Ck) is the first k k k category of the j j j d characteristics of the l l l all conditional probability values. M k MK mk is the number of samples in the training set whose output is k-th, k, k. λ λ λ is a constant greater than 0, 0, 0, often 1, 1, 1, i.e., Laplace smoothing. You can take other values.
MultinomialNB is also an important feature with the partial_fit method, which is often used when a training set is too large to load all of its data into memory at once. At this time, we can divide the training set into several equal parts and repeatedly call Partial_FIT to learn the training set step by step, which is very convenient.
GaussianNB and BernoulliNB have similar capabilities. After data is fitted using MultinomialNB FIT or The Partial_FIT method, predictions can be made. There are three prediction methods, including predict, predict_log_PROBA, and predict_proba. Predict method is the most commonly used prediction method, which directly provides the prediction category output of the test set. Predict_proba, on the other hand, gives the predicted probabilities for each category of the test set sample. Easy to understand, predict_proba predicts the probability of each category of the maximum corresponding to the category, that is, predict method to obtain the category.
Predict_log_proba is similar to predict_proba in that it gives a logarithmic transformation of the predicted probabilities for each category of the test set sample. After transformation, predict_log_proba predicts the maximum logarithmic probability of each category corresponding to the category, namely predict method to obtain the category. Specific details are not explained, please refer to the official website manual.
[1]H. Zhang (2004). The Optimality of Naive Bayes. Proc. FLAIRS. Home page: www.cs.unb.ca/~hzhang/ [2] sklearn.apachecn.org/cn/0.19.0/m… [3] scikit-learn.org/stable/modu… [4]C.D. Manning, p. Raghavan and h. Schutze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. [5]A. McCallum and K. Nigam (1998). A comparison of event models for Naive Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. [6]V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam Filtering with Naive Bayes — Which Naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).
[Note] The author gives more reference codes in the attachment of this chapter, please read them by yourself.
Attached is the reference code for this chapter