This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together
Before learning the naive Bayes classification model, we review the KNN and decision tree we learned before, and the reader’s own summary: Different machine learning methods are supported by different hypotheses and theories, which to a large extent reflect the advantages and disadvantages of this algorithm. KNN: In the sample space, data of the same type are clustered in the space, that is, the distance will get closer. Based on this assumption, only the distance calculation between test samples and training samples is needed, and the category of the samples with the closest distance is to a large extent the category of test samples. Decision tree: Based on information theory. The sample data is messy, but the good thing is that there are features, and how to use features effectively to make the sample not messy and divisible, that’s what decision trees are supposed to do. The naive Bayes algorithm introduced today is based on Bayes’ rule, so why add a naive algorithm? Because I can’t do it! (Naive Bayes assumes that features are independent and equally important).
Naive Bayes’ principle
Figure is our sample data set.
We now use P1 (x,y) to represent the probability that the data point (x,y) belongs to category 1(the category represented by dots in the figure) and P2 (x,y) to represent the probability that the data point (x,y) belongs to category 2(the category represented by triangles in the figure). Then for a new data point (x,y), The following rules can be used to determine its category:
- If P1 (x,y) > p2(x,y), then the category is 1
- If p2(x,y) > p1(x,y), then the category is 2
In other words, we choose the category with the highest probability. So that’s the idea, but how do you compute p1(x,y) and p2(x,y)? That’s where conditional probability and Bayes’ rule come in.
Conditional probability and Bayes criterion
Under the premise of conditional probability is in B, A probability, mathematical formula for P (A | B). Let’s look at the example in the book: a jar of seven stones, three of which are white and four of which are black, the probability of taking out the white is 3/7, and the probability of taking out the black is 4/7. That’s easy.
If I have two buckets, as I showed here, what is the probability of getting a white ball out of bucket B? It’s obviously 1/3, and you can see that in the picture, but when it’s complicated, it’s hard to figure it out directly, so here’s the conditional probability formula:
Another algorithm for calculating conditional probability is Bayes’ rule:
Algorithm principle
Let’s continue with the previous example:
- If P1 (x,y) > p2(x,y), then the category is 1
- If p2(x,y) > p1(x,y), then the category is 2
Here we use the conditional probability method and p (c1 | x, y) and p (c2 | x, y), in practice, however, this is very bad, so using bayes rule:
So it becomes:
- If P (c1 | x, y) > P (c2 | x, y), then belongs to the category of c1
- If P (c2 | x, y) > P (c1 | x, y), then belongs to the category of c2
Naive Bayes text classification
Problem description and data
Take the message board of online community for example, there are insulting words (1) and normal remarks (0), this data is also created by ourselves.
Def loadDataSet(): """ return: Atherflea = [['my', 'dog', 'has', 'Flea ', 'problems', 'help', 'please'] #,0,1,1,1... [0] [' maybe ', 'not' and 'take', 'question' and 'to' and 'dog', 'park', 'stupid'], [' my ', 'dalmation', 'is' and' so ', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 not return postingList, classVecCopy the code
Word vector construction
We all know that the computer is not directly recognize the text, then we need to convert the text into numbers, how to convert? For example, here are two paragraphs:
- I love China
- I don’t like apples
First, the text is segmented, and then a list (vocabulary) without repetition is formed: “I, love, China, dislike, eat apple”. Then, the two paragraphs are vectorized for the vocabulary, and the appearing words will be assigned a value of 1. Otherwise, the two paragraphs will be converted into the following two word vectors.
- [1, 1, 1, 0,0]
- [1, 0, 0, 0 ,0]
The following code builds the word vector:
def createVocabList(dataSet):
vocabSet= set([])
for i in dataSet:
vocabSet = vocabSet | set(i)
return list(vocabSet)
def set0fWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in inputSet:
returnVec[vocabList.index(word)] = 1
return returnVec
Copy the code
Training algorithm
As there are multiple features, the calculation formula is written in the form of W-matrix:
The code is as follows:
from numpy import * def trainNB0(trainMatrix, trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); P1Num = ones(numWords) p0Denom = 2.0; P1Denom = 2.0 for I in range(numTrainDocs): if trainCategory[I] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) p0Vect = log(p0Num/p0Denom) return p0Vect, p1Vect, pAbusiveCopy the code
The test algorithm
Finally, the algorithm is tested:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): P1 = sum (vec2Classify * p1Vec) + log (pClass1) # P | c1 (w) * P (c1), Molecular p0 = the sum of the bayesian criteria (vec2Classify * p0Vec) + log (1.0 pClass1) # P | c0 (w) * P (c0), namely the bayesian criteria of molecules, the if p1 > p0: return 1 else: return 0 def testingNB(): listOPosts, listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postinDoc in listOPosts: trainMat.append(set0fWords2Vec(myVocabList, postinDoc)) p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses)) testEntry = ['love', 'my', 'dalmation'] thisDoc = array(set0fWords2Vec(myVocabList, testEntry)) print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)) testEntry = ['stupid', 'garbage'] thisDoc = array(set0fWords2Vec(myVocabList, testEntry)) print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))Copy the code
Advantages and disadvantages of the algorithm
- Advantages: less data can also be used, can deal with many kinds of problems
- Disadvantages: Sensitive to data