Machine Learning Algorithms (II) : Naive Bayes

The article directories

1. Laboratory introduction

1.1 Experimental Environment

Python3.7 2. numpy >= '1.16.4' 3. sklearn >= '0.23.1'Copy the code

1.2 Introduction of Naive Bayes

Naive Bayes (NB) algorithm is one of the most widely used classification algorithms. It is a classifier method based on Bayesian definition and the assumption of feature condition independence. Because the Bayesian method is based on the calculation of Bayes formula, it has a solid mathematical foundation and stable classification efficiency. NB model requires few estimated parameters, is not sensitive to missing data, and the algorithm is relatively simple. In those days, spam classification was based on naive Bayes classifier.

What is conditional probability? Let’s think about it in terms of touching a ball. We have two buckets: gray bucket and green bucket. There are altogether 7 small balls, 4 blue and 3 purple. The distribution is shown as follows:

What is the probability that one of these 7 marbles is chosen at random to be purple? The selection process is as follows:

  1. To choose the bucket
  2. Then choose a ball from the selected bucket

⋅ P (ball = purple) = P (gray) + P (green) = 1 2 ⋅ 2 4 + 1 2 ⋅ 1 3 P = purple (ball) \ \ = p (choose grey barrels) \ cdot p (choose the purple from ash bucket) + p (choose green barrels) \ cdot p (purple) from green barrel = \ \ \ frac {1} {2} \ cdot \ frac {2} {4} + \ frac {1} {2} \ cdot \ frac {1} {3} p (= purple ball) = p (choose grey barrels) ⋅ p (choose the purple from ash bucket) + p (choose green barrels) ⋅ p (choose the purple) from green barrel = 21 21 ⋅ ⋅ 42 + 31

So the process that we used to select the ball is the conditional probability process, the purple probability in the case of the bucket color, and another way to calculate the conditional probability is Bayes’ rule.

Bayes’ formula is a data formula developed by British mathematicians:

P of A given B is p of A, B) p (B) = p (B ∣ A) ⋅ ∑ (A) p ∈ a. f. a. p ⋅ p (B ∣ A) (A) p (A | B) = \ frac {p (A, B)} {p} (B) = \ frac {p (B | A) \ cdot P (A)} {\ sum_ {A \ in ℱ _A} p \ cdot p (B | A)} (A) p (∣ B) = p (B) p (A, B) = ∑ ∈ a. FAp ⋅ p (B ∣ A) (A) p ⋅ p (B ∣ A) (A)

P (A,B) : indicates the probability that both events A and B occur simultaneously.

P (B) : represents the probability of occurrence of event B, which is called prior probability; P (A) : indicates the probability of event A.

P (A | B) : it means when the event occurs under the condition of B, the probability of event A occurs is called the A posteriori probability.

P (B | A) : it means when the event happened A condition, the probability of event B occurs.

We understand Bayes in one sentence: many things in the world have some kind of relationship, suppose event A and event B. People often use an event that has already happened to infer the probabilities between what we want to know. For example, when doctors make a diagnosis, they will judge what is wrong with the patient by the coating on the tongue, the heartbeat and so on. For the patient, the focus is on what’s wrong, and the doctor will look at what’s already happened to confirm the specific situation. So this is bayes’ idea, where A is the patient’s symptom that has occurred, the probability of B_i under the condition that A has occurred.

1.3 Application of Naive Bayes

Naive Bayes algorithm assumes that all features appear independently of each other, and each feature is equally important because of its simplicity and good interpretability. Compared with other well-designed and more complex classification algorithms, naive Bayes classification algorithm is one of the classifiers with better learning efficiency and classification effect. Naive Bayes algorithm is generally used in text classification, spam classification, credit evaluation, phishing website detection and so on.

2. Laboratory manual

2.1 Learning Objectives

  1. Master Bayes’ formula
  2. The parameter estimation of Bayesian is studied by combining two examples
  3. Master Bayesian estimation

2.2 Code Flow

  • Part 1. Warbler’s tail data set — Bayesian classification

    • Step1: library function import
    • Step2: Data import & analysis
    • Step3: Model training
    • Step4: Model prediction
    • Step5: Brief analysis of principle
  • Part 2. Simulating discrete Data sets — Bayesian classification

    • Step1: library function import
    • Step2: Data import & analysis
    • Step3: Model training & visualization
    • Step4: Brief analysis of principle

2.3 Actual Algorithm

Warbler’s tail data set — Bayesian classification

Step1: library function import

import warnings
warnings.filterwarnings('ignore')
import numpy as np
Load warbler's tail data set
from sklearn import datasets
# Import gauss naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
Copy the code

Step2: Data import & analysis

X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Copy the code

We need to calculate the probability of two, respectively is: : conditional probability P (X (I) = X (I) ∣ Y = c k) P (X ^ {} (I) = X ^ {(I)} | Y = c_k) P (X (I) = X (I) ∣ Y = (ck) and category c k c_k ck prior probability: P(Y=ck) P(Y=c_k) P(Y=ck).

Through the analysis, it is found that the training data is numerical data, and it is assumed that each feature obeys gaussian distribution, so we choose Gaussian Naive Bayes for classification calculation.

Step3: Model training

# Use Gaussian naive Bayes for calculation
clf = GaussianNB(var_smoothing=1e-8)
clf.fit(X_train, y_train)
Copy the code
GaussianNB(var_smoothing=1e-08)
Copy the code

Step4: Model prediction

# assessment
y_pred = clf.predict(X_test)
acc = np.sum(y_test == y_pred) / X_test.shape[0]
print("Test Acc : %.3f" % acc)

# prediction
y_proba = clf.predict_proba(X_test[:1])
print(clf.predict(X_test[:1]))
print("Expected probability value :", y_proba)
Copy the code
Test Acc: 0.967 [2] Estimated probability value: [[1.63542393E-232 2.18880483E-006 9.99997811E-001]]Copy the code

Step5: Brief analysis of principle

Gaussian naive Bayes assumes that every feature is subject to gaussian distribution. We call the data distribution of a random variable X subject to mathematical expectation μ and variance σ2 gaussian distribution. We generally estimate μ using the mean for each feature and σ2 using the variance for all features.

P (X (I) = X (I) ∣ Y = c k) = 1 2 PI sigma 2 Y exp ⁡ (- (X (I) – mu c k) 2 2 sigma c k 2) P (X ^ {} (I) = X ^ {(I)} | Y = c_k) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x^{(i)} – \mu_{c_k})^2}{2\sigma^2_{c_k}}\right) P (X (I) = X (I) ∣ Y = ck) = 2 PI sigma y2 1 exp (- 2 sigma ck2 (X (I) – mu ck) 2)

From the prediction results in the above example, we can see that category 2 has the largest posterior probability value, so we consider category 2 to be the optimal result.

Simulation of discrete Data sets — Bayesian classification

Step1: library function import + Step2: data import & analysis + Step3: model training & visualization + Step4: principle analysis

import random
import numpy as np
# Use naive Bayes based on category features
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
Copy the code

Step2: Data import & analysis

# Simulation data
rng = np.random.RandomState(1)
Generate 600 random data of 100 dimensions, each dimension characteristic is integer before [0, 4]
X = rng.randint(5, size=(600.100))
y = np.array([1.2.3.4.5.6] * 100)
data = np.c_[X, y]
I'm going to break up the whole of X and y
random.shuffle(data)
X = data[:,:-1]
y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Copy the code

All data features are discrete features. We introduce a naive Bayes classifier based on discrete features.

Step3: Model training & prediction

clf = CategoricalNB(alpha=1)
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Test Acc : %.3f" % acc)
Copy the code
Test the Acc: 0.683Copy the code
# Random data test, analyze the prediction results, bayes will select the prediction results with the highest probability
# for example, the predicted result here is 6,6 has the highest probability, because we are random data
When the reader runs, different results may appear.
x = rng.randint(5, size=(1.100))
print(clf.predict_proba(x))
print(clf.predict(x))
Copy the code
[[3.48859652e-04 4.34747491e-04 2.23077189e-03 9.90226387e-01
  5.98248900e-03 7.76745425e-04]]
[4]
Copy the code

2.4 Brief analysis of principle

2.4.1 Result analysis

You can see the result of the test data. Bayes will select the prediction result with the highest probability, for example, the prediction result here is 6,6 has the highest probability. Since we are random data, different results may appear when readers run the data.

The accuracy of the test data here is not meaningful, because the data is randomly generated and does not necessarily have Bayesian priori. It is only used as an example to guide you how to use it.

What does the parameter alpha=1 mean?

We know that the bayesian method must be two probability calculation: : conditional probability P (X (I) = X (I) ∣ Y = c k) P (X ^ {} (I) = X ^ {(I)} | Y = c_k) P (X (I) = X (I) ∣ Y = (ck) and category c k c_k ck prior probability: P(Y=ck) P(Y=c_k) P(Y=ck).

For discrete features:

P of X of j = X of j given Y = c k = ∑ I = one N I (X I j = a j L, Y I = c k) + α ∑ I = 1 N I (y I = c k) + S j α P(X^{(j)}=x^{(j)}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^j=a_{jl},y_i=c_k)+\alpha}{\sum_{i=1}^{N}I(y_i=c_k)+S_j\alpha} (j) = P (X X ∣ (j) = Y ck) ni = ∑ I = 1 (yi = ck) + ni Sj alpha ∑ I = 1 (xij = ajl, yi = ck) + alpha

We can see that we’re adding alpha to each of these variables. When alphaλ=0, it is the maximum likelihood estimate. When you use maximum likelihood estimation, if a certain eigenvalue does not appear in the training data, the probability will be 0. As a result, the estimation will be 0 because of bayesian estimation.

Among them:

Sj S_j Sj: indicates the number of JTH features.

X I j x_i^j xij: represents the element of the JTH dimension of the ith sample.

Y I y_i YI: label of the I th sample.

2.4.2 Naive Bayes algorithm

Naive Bayes method = Bayes’ theorem + characteristic condition independence.

The input X∈Rn X \in R^n X∈Rn space is a set of n-dimensional vectors, and the output space y= {c 1,c 2,.. c K} y=\{c_1,c_2… ,c_K\} y={c1,c2,… ,cK}. All X and y are random variables in the corresponding space. P(X,Y) P(X,Y) P(X,Y) is the joint probability difference of X and Y. Training data set (generated independently and identically from P(X,Y) P(X,Y) P(X,Y)): T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),… ,(x_N,y_N)\} T={(x1,y1),(x2,y2),… ,(xN,yN)}

Calculation of test data list of x, we need to calculate in turn P (Y = c k ∣ x = x) P (Y = c_k | x = x) P (Y = ck ∣ x = x), take the value of the highest probability, the corresponding classification is x.

∣ P (Y = c k X = X) P (Y = c_k | X = X) P (Y = ck ∣ X = X) we generally explained, when a given (X = X) (X = X) (X = X), Y = c k Y = c_k probability of Y = ck, this is the conditional probability. This is simple, we only need to calculate the probability of ck, K ∈[1,2,.. k] c_k,k \in [1,2… k] ck,k∈[1,2… k] for each x, and select the maximum probability as the category of x.

The probability calculation formula of prediction is obtained through the deformation of Bayesian formula:

P (Y = c k ∣ X = X) = P (Y = c k) P (Y = c k) ∑ K P (X = X ∣ Y = c k) P P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y=c_k)} P (Y = ck ∣ X = X) = ∑ kP (X = X ∣ Y = ck) P (Y = ck) P (X = X ∣ Y = ck) P (Y = (ck)

We only need to calculate the following two probabilities, and since the naive Bayes hypothesis conditions are independent, we can separately calculate the conditional probabilities of each feature: P (X (I) = X (I) ∣ Y = c k) P (X ^ {} (I) = X ^ {(I)} | Y = c_k) P (X (I) = X (I) ∣ Y = (ck) and category c k c_k ck prior probability: P(Y=ck) P(Y=c_k) P(Y=ck). To better understand this formula, take a look at the following illustration:

Among them: P (Y = c k) = ∑ I = 1 N I (I = c k Y) N, k = 1, 2,…, P (Y = c_k) for k = \ frac {\ sum_ {I = 1} ^ I (y_i = c_k)} {N} {N}. K = 1, 2,… , K P (Y = ck) = N ∑ I = 1 ni yi = (ck), K = 1, 2,… ,K

Naive Bayes has an advanced assumption, which we call the conditional independence hypothesis (or Naive: Naive), when it comes to multiple conditions: Formula is as follows (A, B ∣ P Y) = P (A ∣ Y) ⋅ P (B ∣ Y), P (A, B | Y) = P (A | Y) \ cdot P (B | Y), P (A, B ∣ Y) = P (A ∣ Y) ⋅ P (B ∣ Y) assumptions, the basis of this formula is simple bayesian The conditional probabilities are independent, A doesn’t affect B, B doesn’t affect A. For here, assume that X = [1 X, 2 X,…, n] X X = (x_1, x_2,…, x_n] X = [x1, x2,…, xn]

P (X = X ∣ Y = c k) = P (X = X (1) (1), X (2) = X (2),…. X (n) (n) = X ∣ Y = c k) = ∏ I = 1 n P (X I ∣ Y) P (X = X | Y = c_k) \ \ = P (X ^ {} (1) = X ^ {} (1), X ^ {} (2) = X ^ {} (2),… , X ^ {(n)} = X ^ {(n)} \ \ = \ | Y = c_k) prod_ {I = 1} ^ {n} P (x_i | Y), P (X = X ∣ Y = ck) = P (X = X (1) (1), X (2) = X (2),… , X (n) = X (n) ∣ Y = ck) = I = 1 ∏ nP (xi ∣ Y)

Thus, the original formula can be equivalent to:

P (Y = c k ∣ X = X) = ∏ I = 1 n P (X I = c k) P (Y = c K) ∑ K I = 1 n P (X I = c k) P (Y = c k) P(Y=c_k|X=x)=\frac{\prod_{i=1}^{n} P(x_i | Y=c_k)P(Y=c_k)}{\sum_{k}\prod_{i=1}^{n} P(x_i | Y=c_k)P(Y=c_k)} P (Y = ck ∣ X = X) = ∑ k ∏ I = 1 np (xi ∣ Y = ck) P (Y = ck) np ∏ I = 1 (xi ∣ Y = ck) P (Y = (ck)

In order to select the result with the highest posteriori probability, we compare the probabilities. Since the denominator is the same, we directly remove the denominator here and get the final calculation formula.

Y =arg m a x c k P (y= c k) ∏ j P (x (j) = x (j) y= c k) y=arg Max_ {c_k} P (Y = c_k) \ prod_ {} j P (X ^ {} (j) = X ^ {} (j) | Y = c_k) Y = argmaxckP j ∏ P (Y = ck) (X (j) = X (j) ∣ Y = ck)

Let’s look at an example to better understand the Bayesian process of predicting whether a person will go out based on the weather and whether it’s a weekend.

index X1: X_1: X1: the quality of the weather X2: X_2: X2: weekend Y: Y: Y: indicates whether to go out
1 good is Out of the door
2 good no Out of the door
3 good is Don’t go out
4 good no Out of the door
5 bad is Out of the door
6 bad no Don’t go out

According to the above data, in order to better understand the calculation process, we give several calculation formulas:

A. When going out, X_1 is the probability of bad weather: P of X one = bad given Y = out of the door = P of X one = bad, Y = the door) p (Y = the door) = 1 4 p (X_1 = bad | Y = go out) = \ frac {p (X_1 = bad, Y = go out)} {p (Y = go out)} = \ frac {1} {4} P of X1 equal to bad given Y equal to go out is equal to p of Y equal to go out is equal to 41

B. go probability p (Y = the door) = 4 of 6 p (Y = go out) = \ frac {4} {6} p (Y = go out) = 64

C. Probability of X_1 bad weather, p(X1= bad)= 26 p(X_1= bad)=\frac{2}{6} p(X1= bad)=62

D. Probability of going out in X_1 bad weather: ⋅ P (Y = going out) = P (X 1 = bad) = 1 4 ⋅ 4 6 2 6 = 1 2 P (Y = go out | X_1 = bad) = \ frac {p (X_1 = bad | Y = go out) \ cdot p (Y = go out)} {p (X = bad)} \ \ = \ frac {\ frac {1} {4} \ cdot \ frac {4} {6}} {\ frac {2} {6}} = \ frac {1} {2} p (Y = go out ∣ X1 = bad) = p (X = bad) p (X1 = bad ∣ Y = go out) 6241 ⋅ ⋅ p (Y = go out) = 64 = 21

F. Probability of not going out in X_1 bad weather: P of Y = go out of the door ∣ X 1 = not good = 1 minus P of Y = not go out of the door ∣ X 1 = 1 minus 2 = 1, 2 P (Y = go out | X_1 = bad) = 1 – p (Y = don’t go out | X_1 = bad) = 1 – \ frac {1} {2} = \ frac {1} {2} p (Y = go out ∣ X1 = bad) = 1 – p (Y = don’t go out ∣ X1 = bad) = 1-21 = 21

2.4.3 Advantages and disadvantages of Naive Bayes

Advantages: Naive Bayes algorithm is mainly based on the classical Bayesian formula for bulldozing, has a good mathematical principle. It also performs well with small amounts of data, and can do incremental calculations with large amounts of data. Because naive Bayes uses prior probability to estimate posterior probability, it has good model interpretability.

Disadvantages: Naive Bayes model has the lowest theoretical error rate compared with other classification methods. However, this is not always the case in fact, because when the naive Bayes model gives the output category, it assumes that the attributes are independent from each other, which is often not valid in practical application. When the number of attributes is large or the correlation between attributes is large, the classification effect is not good. Naive Bayes has the best performance when the attribute correlation is small. For this, there are semi-naive Bayes and other algorithms that improve modestly by considering partial correlations, such as assuming that each attribute only depends on one other in order not to be too computationally complex. To solve the correlation between features, we can also use the method of data dimension reduction (PCA) to remove the feature correlation and then carry out naive Bayesian calculation.