- Naive Bayes Classification With Sklearn
- Original article by Martin Muller
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: sisibeloved
- Proofreader: Rockyzhengwu
Gaussian distribution with bean machine
This tutorial details the algorithm of naive Bayes classifier, its principles, advantages and disadvantages, and provides an example using the Sklearn library.
background
Take the famous Titanic data set. It collected personal information about passengers on the Titanic and whether they survived the shipwreck. Let’s try to predict his survival using the cost of a passenger’s ferry ticket.
The 500 passengers on the Titanic
Suppose you pick 500 passengers at random. Of those samples, 30 percent survived. The average fare for surviving passengers was $100, while the average fare for those who died was $50. Now, suppose you have a new passenger. You don’t know if he survived, but you know he bought a $30 ticket to cross the Atlantic. Please predict whether the passenger survived.
The principle of
Well, you might reply that the passenger didn’t survive. Why is that? Because based on the information contained in the random subset of passengers taken above, the chances of survival are already low (30%), and the chances of survival for the poor are even lower. You would put this passenger in the most likely group (the low fare group). That’s what naive Bayes classifier is all about.
Analysis of the
Naive Bayes classifiers use conditional probability to aggregate information and assume relative independence of features. What does that mean? This means, for example, that we must assume that the comfort of the Titanic’s rooms had nothing to do with ticket prices. This assumption is clearly wrong, which is why we call it Naive. The naive assumption simplifies calculations, even on very large data sets. Let’s find out.
Naive bayes classifier is essentially describe characteristics of a given search condition belong to a category of probability function, to write the function P (Survival | f1,… , fn). We use Bayes’ theorem to simplify the calculation:
Formula 1: Bayes’ theorem
P(Survival) is easy to calculate, and P(f1,… , fn), so the problem goes back to calculating P(f1… The fn | Survival). We apply the conditional probability formula to simplify the calculation once again:
Formula 2: Preliminary expansion
Each of the calculations in the last line of the above equation requires a data set containing all the conditions. To calculate {Survival, f_1… , f_n 1} under the condition of the probability of fn (i.e., P (fn | Survival, f_1,… , f_N-1)), we need to have enough different conditions to satisfy {Survival, f_1… , f_n-1} of the fn value. This requires a lot of data and leads to dimensional disaster. This is where the benefits of being Naive Assumption come into play. Assuming that characteristics are independent, we can consider the conditions {Survival, f_1… , the probability of f_n-1} is equal to the probability of {Survival}, so as to simplify calculation:
Formula 3: The naive assumption is applied
Finally, to create a feature vector for classification, we only need to select the value (1 or 0) of survival, so that P(f1,… The highest, fn | Survival), namely for the final classification results:
Formula 4: Argmax classifier
Note: A common mistake is to assume that the probability of the classifier’s output is correct. In fact, naive Bayes are called difference estimators, so don’t take these output probabilities too seriously.
Find the right distribution function
The last step is to implement the classifier. How to the probability function P (f_i | Survival) model? There are three models in the Sklearn library:
- Gaussian distribution: Assume that features are continuous and conform to normal distribution.
Normal distribution
- Polynomial distribution: suitable for discrete features.
- Bernoulli distribution: suitable for binary characteristics.
Binomial distribution
Python code
Next, we implement a classical Gaussian naive Bayes based on the Titanic victim data set. We will use cabin class, sex, age, number of siblings, number of parents/children, fare and port of entry.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
# import data set
data = pd.read_csv("data/train.csv")
# Convert categorical variables to numbers
data["Sex_cleaned"]=np.where(data["Sex"] = ="male".0.1)
data["Embarked_cleaned"]=np.where(data["Embarked"] = ="S".0,
np.where(data["Embarked"] = ="C".1,
np.where(data["Embarked"] = ="Q".2.3)))# Clear data set of non-numeric values (NaN)
data=data[[
"Survived"."Pclass"."Sex_cleaned"."Age"."SibSp"."Parch"."Fare"."Embarked_cleaned"
]].dropna(axis=0, how='any')
# Split the data set into training set and test set
X_train, X_test = train_test_split(data, test_size=0.5, random_state=int(time.time()))
Copy the code
# instantiate the classifier
gnb = GaussianNB()
used_features =[
"Pclass"."Sex_cleaned"."Age"."SibSp"."Parch"."Fare"."Embarked_cleaned"
]
# Train the classifier
gnb.fit(
X_train[used_features].values,
X_train["Survived"]
)
y_pred = gnb.predict(X_test[used_features])
Print the result
print("Number of mislabeled points out of a total {} points: {}, performance {:05.2f}%"
.format(
X_test.shape[0],
(X_test["Survived"] != y_pred).sum(),
100* (1-(X_test["Survived"] != y_pred).sum()/X_test.shape[0)))Copy the code
Number of mislabeled points out of a total 357 points: 68, performance 80.95%
The classifier was correct 80.95% of the time.
Use a single feature description
Let’s try to constrain the classifier using only the fare information. Here we calculate the probability of P(Survival = 1) and P(Survival = 0) :
mean_survival=np.mean(X_train["Survived"])
mean_not_survival=1-mean_survival
print("Survival prob = {:03.2f}%, Not survival prob = {:03.2f}%"
.format(100*mean_survival,100*mean_not_survival))
Copy the code
Survival prob = 39.50%, Not survival prob = 60.50%
Then, according to the type 3, we only need the probability distribution function P (fare | Survival = 0) and P (fare | Survival = 1). The Gaussian naive Bayes classifier is used, so the data must be assumed to be gaussian distributed.
Formula 5: Gauss formula (σ : standard deviation / μ : mean value)
Then, we need to figure out the mean and standard deviation of the fare data set if the survival values are different. We get the following results:
mean_fare_survived = np.mean(X_train[X_train["Survived"] = =1] ["Fare"])
std_fare_survived = np.std(X_train[X_train["Survived"] = =1] ["Fare"])
mean_fare_not_survived = np.mean(X_train[X_train["Survived"] = =0] ["Fare"])
std_fare_not_survived = np.std(X_train[X_train["Survived"] = =0] ["Fare"])
print("Mean_fare_survived = {: 03.2 f}".format(mean_fare_survived))
print("Std_fare_survived = {: 03.2 f}".format(std_fare_survived))
print("Mean_fare_not_survived = {: 03.2 f}".format(mean_fare_not_survived))
print("Std_fare_not_survived = {: 03.2 f}".format(std_fare_not_survived))
Copy the code
Mean_fare_survived = 54.75std_fare_survived = 66.91 mean_fare_not_survived = 24.61std_fare_not_survived = 36.29Copy the code
Let’s look at the distribution of results for the histogram of survivors and non-survivors:
Figure 1: Histogram and Gaussian distribution of ticket prices with or without survival values (scaling levels do not correspond)
It can be found that the distribution does not fit well with the data set. Before implementing the model, it is best to verify that the feature distribution follows one of the three models described above. If a continuous feature does not have a normal distribution, it should be converted to a normal distribution using a transformation or a different method. And for the sake of illustration, we’re going to view the distribution as normal. By applying Bayes’ theorem in Formula 1, the following classifier can be obtained:
Figure 2: Gaussian classifier
If fares classifier value of more than 78 (classifier (Fare) to 78 or higher), the P (Fare | Survival = 1) or P (Fare | Survival = 0), we will be the person to be classed as alive. Otherwise, we’re going to classify him as a non-survivor. We obtain a classifier with a correct rate of 64.15%.
If we train the Sklearn Gaussian Naive Bayes classifier on the same data set, we will get exactly the same result:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
used_features =["Fare"]
y_pred = gnb.fit(X_train[used_features].values, X_train["Survived"]).predict(X_test[used_features])
print("Number of mislabeled points out of a total {} points: {}, performance {:05.2f}%"
.format(
X_test.shape[0],
(X_test["Survived"] != y_pred).sum(),
100* (1-(X_test["Survived"] != y_pred).sum()/X_test.shape[0])
))
print("Std Fare not_survived {: 05.2 f}".format(np.sqrt(gnb.sigma_)[0] [0]))
print("Std Fare survived: {: 05.2 f}".format(np.sqrt(gnb.sigma_)[1] [0]))
print("Mean Fare not_survived {: 05.2 f}".format(gnb.theta_[0] [0]))
print("Mean Fare survived: {: 05.2 f}".format(gnb.theta_[1] [0]))
Copy the code
Number of mislabeled points out of a total 357 points: 128, Performance 64.15% Std Fare not_survived 36.29 Std Fare: 66.91 Mean Fare not_survived 24.61 Mean Fare survived: 54.75Copy the code
Advantages and disadvantages of naive Bayes classifier
Advantages:
- Quick computation
- Implement a simple
- Performs well on small data sets
- Performs well on high dimensional data
- Even if the naive hypothesis is not fully satisfied, it can perform well. In many cases, approximating data is all you need to build a good classifier.
Disadvantages:
- Relevant features need to be removed because they will be calculated twice in the model, leading to an overestimation of the importance of the feature.
- If a category of a categorical variable does not appear in the training set in the test set, the model sets this to zero probability. It will not be able to predict. This is often referred to as the “zero frequency”. We can use smoothing techniques to solve this problem. One of the simplest smoothing techniques is called Laplacian smoothing. When you train a naive Bayes classifier, Sklearn uses the Laplace smoothing algorithm by default.
conclusion
Thank you so much for reading this article. I hope it helps you understand the concept of naive Bayes classifier and its advantages.
Antoine Toubhans, Flavian Hautbois, Adil Baaj and Raphael Meudec.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.