Frequency and Bayes

According to the frequency school, although the distribution parameter θ to which the sample belongs is unknown, it is fixed, and the θ^\theta{\hat{}}θ^ can be estimated by the sample θθθ. Bayesians believe that the parameter θθθ is a random variable, not a fixed value. Before sample generation, a distribution π(θ)\ PI (\theta)π(θ) will be preset based on experience or other methods, which is called a prior distribution. And then it adjusts and modifies theta from the sample to be π(θ∣x1,x2,x3… (\ \ PI theta | x1, x2, x3,…) PI (theta ∣ x1, x2, x3,…) Is called a posterior distribution.

Derivation of Bayes’ formula

Why naive Bayes

Assuming that the attributes of the training data are represented by the n-dimensional random vector X, and the classification results are represented by the random variable Y, the statistical law of X and Y can be described by the joint probability distribution P(x, y)P(x, y)P(x, y). Each specific sample (xi, yi)(x_i,y_i)(xi, yi) can be independently and identically distributed through P(x,y)P(x,y)P(x,y)P(x,y) The starting point of producing Bayesian classifier is joint probability distribution. And by conditional probability we get P of X, Y) = P (Y) ∗ P (X ∣ Y) = P (X) ∗ P (Y ∣ X) P (X, Y) = P (Y) * P (X | Y) = P (X) * P (Y | X) P (X, Y) = P (Y) ∗ P (X ∣ Y) = P (X) ∗ P (Y ∣ X) P ( Y)P(Y) P(Y) : The probability of occurrence of each category, which is the prior probability. ∣ Y P (X) P (X | Y), P (X ∣ Y) : the probability of different attributes appear under a given category, the likelihood probability prior probability is very easy to calculate, you just need to statistics the number of samples of different categories, influenced by the number of attributes, and the likelihood probability estimates is difficult. For example, if each sample contains 100 attributes, each attribute may have 100 values, then for each result of classification, the conditional probability to be calculated is 1002=10000, which is a very large number. So naive Bayes was introduced.

What is naive Bayes

Naive Bayes, plus naive, which means simpler Bayes. Naive Bayes assumes that the different attributes of samples satisfy the conditional independence hypothesis, and then applies Bayes’ theorem to perform classification tasks. For a given classification x, a posteriori probability analysis of samples in each category, will be the biggest posterior probability class as a category of x belongs To solve the problem of difficult to estimate the likelihood probability, would require the conditional independence assumption conditional independence assumption that all properties are independent of each other, mutual influence, each property independent of the classification results. So that the product of the conditional probability becomes the property of conditional probability P (X = X ∣ Y = c) = P (X = X (1) (1), X (2) = X (2),… , X (n) = X (n) ∣ Y = c) = I = 0 ∏ nP (Xj = Xj ∣ Y = c) P (X = X | Y = c) = P (X = X (1) (1), X (2) = X (2),… , X (n) = X (n) | Y = c) = \ I = 0 ∏ nP (Xj = Xj ∣ Y = c) P (X = X ∣ Y = c) = P (X = X (1) (1), X (2) = X (2),… , X (n) = X (n) ∣ Y = c) = I = 0 ∏ nP (Xj = Xj ∣ Y = c) this is the naive bayesian method, the training set, We can easily calculate the prior probability P (Y), P (Y), P (Y) and likelihood probability P (Y ∣ X) P (Y | X) P (Y ∣ X), so we can get a posteriori probability P (X ∣ Y), P (X | Y), P (X ∣ Y)

Example – Page 151 of the Watermelon Book

First we have watermelon data set 3.0.

We have a problem. Are the following test sets good or bad?

We can figure out the prior probabilities first

And then we figure out the conditional probability

And then calculate the probability of good and bad melons

0.063 is significantly larger, so large is probably a good melon.