1. Naive Bayes why is naive?
Naive Bayes algorithm is a classification method based on Bayes theorem and independent hypothesis of feature conditions. The reason why naive Bayes has the word “naive” is that it simplifies the problem and assumes that all feature parameters are independent of each other. For example, if the fruit is red, round and about 3 inches in diameter, it is considered an apple. Even though these features depend on each other or on the presence of other features, all these features independently contribute to the possibility that the fruit is an apple, which is why it is called “plain”. Naive Bayes algorithm is easy to construct and very useful for large databases. It is often used for text classification under multiple categories.
It is also based on Bayes’ law, as shown below, and bayesian estimation is based on conditional probability in general.
The probability of occurrence of event C is P(C), i.e., prior probability; The probability of event x is P(x); Under the condition of the c occurrences in the probability of event x for P (x | c); Under the condition of the event x occurs in the probability of event c for P (c | x), with P (x | c) * P (c) = P (cx), c, x at the same time, the probability of that event.
Then according to the bayesian rule: under the condition of incident in event x c happen probability P (c | x), namely a posteriori probability, equal to the incident under the condition of incident x c, probability of x at the same time.
The naive Bayes algorithm, on the other hand, aims at the multivariate classification problem, assuming that events x1, X2… The probability of event C occurring in all conditions of xn, where x1, x2… Xn are independent of each other, then P (x | c) probability can be calculated as: P (x | c) = P (x1 | c) * P (x2 | c) *… P (x3 | c).
2. Advantages and disadvantages of Naive Bayes and application scenarios
Advantages :(1) the prediction of data is simple, fast and efficient, especially in multivariate classification tasks; (2) When the assumption that features are independent from each other is established, its prediction ability is better than logistic regression and other algorithms, and it is suitable for incremental training, especially when the amount of data exceeds the memory, we can conduct incremental training in batches. (3) Compared with the input variable as a numerical variable, it performs well in the case of classification variables. If the numerical variable is a numerical variable, it should be assumed that it is normally distributed.
Disadvantages :(1) the assumptions of naive bayes algorithm are often difficult to be established in practice, and the classification effect is not good when the number of attributes is large or the correlation between attributes is large. (2) It is necessary to know the prior probability, and the prior probability depends on the hypothesis in most cases. There can be many models of the hypothesis, so the prediction effect will be poor in some cases due to the prior model of the hypothesis. (3) it is sensitive to the expression form of input data.
Application Scenarios:
(1) Real-time prediction: Naive Bayes algorithm is simple and convenient. Therefore, it can be used for real-time prediction.
(2) Multi-category prediction: it is applicable to tasks with multi-category target variables. Here, we can predict the probability of multi-category target variables.
(3) Text classification/spam filtering/sentiment analysis: Naive Bayes classifier mainly used for text classification (due to better results of multi-class problems and independent rules) has a higher success rate than other algorithms. As a result, it is widely used for spam filtering (identifying spam) and sentiment analysis (identifying positive and negative customer sentiment in social media analysis)
(3) Recommendation system: Naive Bayes classifier and collaborative filtering work together to build a recommendation system, which uses machine learning and data mining techniques to filter invisible information and predict whether users will like a given resource. A simple example is commodity recommendation on Taobao.
3. Simple Python implementation of naive Bayes
There are packages for naive Bayes in Python’s SciKit Learn library, which include three types:
(1) Gaussian: used for classification, it assumes that features follow normal distribution.
Multinomial Multinomial: used for discrete counting. For example, suppose we have a text classification problem. Here, we can consider a further Bernoulli test, instead of “words appearing in documents”, we can consider “calculating the frequency of words appearing in documents”, which you can think of as “the number of occurrences of x_i observed in n trials”.
(3) Bernoulli: The binomial model is useful if your eigenvectors are 0-1. For example, in text classification, where “1” and “0” are “words appear in the document” and “word documents do not appear in the document” respectively.
The following takes the Gaussian model as an example, and its Python code is:
#Import Library of Gaussian Naive Bayes model from sklearn.naive_bayes import GaussianNB import numpy as np # Assuming the existence of such a binary characteristic variable x, Corresponding properties y. x = np, array ([[3, 7], [1, 5], [1, 2], [2, 0], [2, 3], [4, 0], [1, 1], [1, 1], [2, 2], [2, 7], [4, 1], [- 2, 7]]) Y = np. Array ([3, 3, 4, 4, 4, 3, 3, 4, 3, 3, 3, 4]) id3 = np, where (Y = = 3) id4 = np. Where FIG (Y = = 4), Ax = PLT. Subplots (figsize = (8, 5)) ax. Scatter (x [id3, 0], [id3, 1), x = 50 s, c = 'b', marker = 'o', Label = 'Y = 3) ax. Scatter (x [id4, 0], [1] id4, x, s = 50 c =' r ', marker = 'x', label='Y=4') ax.legend() ax.set_xlabel('X[:,0]') ax.set_ylabel('X[:,1]') plt.show()Copy the code
As can be seen from the figure above, the scatter plot distribution of x binary characteristic variables constructed by us under different Y categories basically shows that Y=3 and Y=4 x categories have certain linear separability.
Next, the naive bayes of gaussian distribution is used to train the data set, and category prediction is made for the two test data [1,2] and [3,7], and the results are compared with the prediction results of Logistic regression.
#Create a Gaussian Classifier model = GaussianNB() # Train the model using the training sets model.fit(x, Predicted data = model. Predict ([[1,2],[3,7]]) predicted # LogisticRegression from predicted data sklearn.linear_model.logistic import LogisticRegression classifier=LogisticRegression() classifier.fit(x, Predictions = Y) classifier. Predict ([[1, 2], [3, 7]]) predictionsCopy the code
In the output result, the naive bayes result is array([4, 3]), that is, the characteristic variables [1,2],[3,7] corresponding Y values are 4,3 respectively. The predicted result of Logistic regression is array([4, 4]), that is, the Y values of the two characteristic variables are both 4.
Since the data came from our construction, we could not judge the accuracy of the predicted results of the test set. However, we could put the characteristic variables [1,2] and [3,7] into the coordinate system of the classification scatter graph of the front x, and it could be preliminaries that the prediction of naive bayes was relatively accurate.