preface

This time we’re going to look at Naive Bayes. This article is about 1.6K words and is expected to read for 10 minutes.

The profile

Naive Bayes algorithm is a kind of “classification algorithm” suitable for binary and multi-classification classification problems. In the framework of Bayesian probability, it can be deduced that “the minimization of expected risk is equivalent to the maximization of posterior probability”. For the calculation of posterior probability, the posterior probability can be obtained through “joint probability distribution modeling” (” generation model “**);

For the generation model, according to ** “Bayes’ theorem” **, it can be written as:

In Naive Bayes, as the conditional probability is difficult to calculate, a strong assumption is proposed: “feature independence hypothesis”, that is, each feature is independent and distributed without interaction, which is also the origin of the name of Naive Bayes. The estimation of prior probability and conditional probability is not expanded. If you are interested, you can refer to Statistical Learning Methods. Through transformation, you can finally maximize a posteriori probability (MAP), i.e. ** “product of prior probability and quasi-conditional probability” ** :

[Note] The derivation that expected risk minimization is equivalent to posteriori probability maximization can be referred to Machine Learning, and statistical Learning Methods can be referred to for detailed derivation.

Algorithm for the interview

In algorithmic interviews, questions related to designing naive Bayes include:

  1. Why is Naive Bayes so naive?
  2. Naive Bayes basic principle and prediction process;
  3. A little bit about Bayes’ theorem;
  4. How to classify garbage using Naive Bayes?

Today’s question is:

Algorithm implementation of naive Bayes.

For naive Bayes, this is a test of both our algorithm principle and programming ability. I started with the establishment of the whole naive Bayes algorithm model class, which is mainly divided into:

  1. Determine the type of naive Bayes (Gaussian naive Bayes or Bernoulli naive Bayes, etc.);
  2. The key of model fitting lies in what the model preserves.
  3. Calculation of posterior probability;
  4. Output of maximum posteriori probability;

1. Model type

For a class of conditional probability parameters estimation, we using the maximum likelihood estimation method, first of all, the most important thing is that * * “what distribution of random variables (features) are assumed to be”, for different hypothesis, also correspond to different naive bayes, such as Bernoulli naive bayes, gaussian naive bayes, multinomial distribution naive bayes. The most common is by assuming a “Gaussian distribution”, in model fitting, “only the mean and standard deviation of the training data (each feature of all samples of each class) needs to be estimated” **.

Therefore, we can get a preliminary framework:

Class Gaussian_NaiveBayes: def init(self): def mean(self, X): #return sum(X) / float(len(X)) def stdev(self, X):return math.sqrt(sum([pow(x-avg, 2for x in X]) / float(len(X)- 1))
Copy the code

2. Model fitting

By understanding the naive Bayes principle, we know that learning the joint probability model requires maximum likelihood method to estimate prior probability (assuming Bernoulli distribution is obeyed) and conditional probability-like parameters. For Gaussian Naive Bayes, we need to save the whole training data set:

  1. The number corresponding to each class (which can calculate prior probabilities) and the total number of samples (which can be an attribute of the model);
  2. The mean and standard deviation of each feature of each class (class conditional probability can be calculated);

To sum up, we can save in the form of ** “dictionary” ** :

Therefore:

 def summarize(self, train_data):
   summaries = [(self.mean(column), self.stdev(column), len(column)) for column in zip(*train_data)]
  return summaries

 def fit(self, X, y):
   self.total_rows = len(X)
   labels = list(set(y))
   data = {label: [] for label in labels}
   for f, label in zip(X, y):
     data[label].append(f)
   self.model = {label: self.summarize(value) for label, value in data.items()}
Copy the code

3. Calculation of posterior probability

For the posterior probability calculation of each category, the prior probability and quasi-conditional probability need to be obtained. Firstly, for the quasi-conditional probability, according to Gauss formula,

 def gaussian_probabality(self, x, mean, stdev):
   exponent = math.exp(-math.pow(x - mean, 2)/(2 * math.pow(stdev, 2)))
   return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
Copy the code

Then iteratively calculate the posterior probability of all classes:

 def calculate_probabalities(self, input_data):
      probalalities = {}
      for label, summaries in self.model.items():
        probalalities[label] = summaries[0] [2] / self.total_rows # priorifor i in range(len(summaries)): Typecar Mean, stdev, _ = typecar [I] probalalities[label] *= sell.gaussian_probabality (# prior * conditional probability input_data[I], mean, stdev)return probalalities
Copy the code

4. Maximum posterior probability

The calculation of the maximum posteriori probability is to sort and get the label value corresponding to the maximum probability. For the test set, all samples are iterated for calculation.

 def predict(self, X_test):
  labels = []
  for i in X_test:
   label = sorted(
    self.calculate_probabalities(i).items(),
    key=lambda x: x[- 1[])- 1] [0]
   labels.append(label)
  return np.array(labels)
Copy the code

conclusion

The above is the summary of “naive Bayes algorithm implementation”. The most important thing is to need a perfect class framework to be able to combine the principles of code reproduction. As for the other questions, they are relatively simple and will not be discussed in detail here.

Highlights of past For beginners entry route of artificial intelligence and data download machine learning and deep learning notes such as printing machine learning online manual deep learning notes album "statistical learning method" code retrieval based album album download AI based machine learning mathematics This knowledge planet "Huang Bo circle of machine learning" (92416895) Site QQ group704220115. To join the wechat group, scan the code:Copy the code