1. Bayesian knowledge mapping

2. Some easy to understand examples

This article briefly introduces the application of Bayes with four examples

Take a serious look: Bayesian classifiers

3.1 Bayes’ theorem

One data set hasThe species label is denoted as, corresponding to each type of labelThe eigenvalues are denoted as(It’s a vector), then, for Bayes’ theorem, there is the following formula:


By applying Formula 3.1 to each label, the probability corresponding to each label can be calculated, and the label with the highest calculated probability is the predicted value.

3.2 Bayesian decision theory

In classification: The goal of Bayes is how to select the best category markers under given conditions. So, what is the optimal criterion?

It must be the price of getting the category wrong!

Let’s say I have a category, let’s call it thetaCharacteristics, andThe correct category isAt this point, we takeThought to be wrongThe damage caused is recorded as, so, for all samples in the sample space, at a given featureIn this case, think of them both asThe total loss caused is denoted asTo:


And our task is to find a way, minimize formula (3.2). This method can map features to categories, and the corresponding total risk of this method isThen, on the whole training set, the expected loss of this function is:


Obviously, when conditional riskMinimum, total riskIt’s the smallest. So,Select the category markers for each sample feature that minimize its conditional risk, can achieve the optimal. At this point, the method is expressed as:

Specifically, if the goal is to minimize the classification error rate, then misjudgment lossCan be written as:

3.3 Bayesian calculation method

Our goal is to get a posterior probabilitySo, there are two strategies:

  1. Discriminant model: givenDirectly through modelingget.
  2. Generative model: the first joint probabilityModel it, and then calculate it

3.4 What does this have to do with maximum likelihood estimation?

3.4.1 Maximum likelihood estimation

This article gives you a very brief introduction to the core idea of Bayes with four examples, all of which are examples of how you can count on your fingers and figure out all the probabilities, without, I don’t think, using the concept of maximum likelihood estimation, right? (What is a likelihood function?)

In statistics, mathematicians believe that, given enough data, each feature follows a pattern that can be calculated using mathematical expressions. The process of finding expression is parameter estimation and maximum likelihood estimation of the best expression.

Therefore, the training process of probability model is the process of parameter estimation. And the Bayesians say: if you have a training setThe first ofThe set of class samples is expressed as, assuming that these samples are independently identically distributed, then the parameterFor the data setThe likelihood of is this:


3.4.2 Logarithmic likelihood

Formula 3.8 is easy to overflow under the continuous multiplication operation, so logarithmic likelihood can be used for analysis.


3.5 Soul Torture

3.5.1 Why should “Attribute Condition Independence hypothesis” be assumed?

It seems simple enough, but there’s a problem:

  1. When the number of features is large, the amount of data they combine grows explosively. Take patient classification as an example: 2 features. There are two values for symptoms:; There are four values of Occupation:. So, by permutations, there are 16 characteristic combinations (sneeze + nurse, sneeze + farmer, sneeze + construction worker…) . These numbers are intolerable.
  2. Due to the excessive number of combinations of features, the possible combination of features in the actual data set did not appear at all, such as (headache + nurse) what is the disease? (Note that this cannot be written as 0, because ** “not observed” and “probability of occurrence is 0” ** are two different things).

Therefore, there is a general premise: assuming that all attributes are independent of each other, on this basis, formula 3.1 can be written


  • Among them,Is the number of features,forIn the firstThe value on the property.
  • Referred to asQuasi-prior probability.

,,

  • Is the evidence factor, that is, the probability multiplication of the features, which are the same for the class markers. (e.g.,)

  • Is the sampleFor class tagstheQuasi-conditional probability (likelihood)

    • For discrete data,saidIn the firstThe value on the property isA collection of samples of:

(e.g.,)

  • For continuous data:

Assume that. Among them,Respectively,The class tag is at noThe mean and variance of each attribute.

3.5.2 What if unknown features appear?

In the estimation of probability value, “smoothing” is needed, and the commonly used method is “Laplace correction”, the formula is:


Set the number of possible categories for training (3 in the disease category (cold, allergy, concussion)).


For the firstThe number of possible values of the attributes (in the classification of diseases, the number of occupational attributes is: 4 (nurse, construction worker, teacher, farmer)).


4. Pyhton code implementation

This uses data from Statistical Learning Methods and the Watermelon Book

After writing the code, I found that there were two omissions in Zhou Zhihua’s watermelon book:

  1. Probability calculation error.

  2. Continuous-valued probabilities use standard deviation rather than variance


The data in Watermelon Book is table 4.3 on page P84. The preceding numbering column was deleted from the code.

To get the data

def getData():
    ' ''Get data :return: Return data set, eigenvalue name and tag class name'' '
    dataset = pd.DataFrame({
        'x1': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],'x2': ['S'.'M'.'M'.'S'.'S'.'S'.'M'.'M'.'L'.'L'.'L'.'M'.'M'.'L'.'L'].'Y': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})Some features, if continuous... I'm going to use the probability density function
    features_info = {
        'x1': 'dispersed'.'x2': 'dispersed'
    }
    label_names = 'Y'

    target = {
        'x1': 2.'x2': 'S'
    }

    dataset = pd.read_csv('./data/WaterMelonDataset3.csv')
    # dataset = dataset[1: ]
    features_info = {
        'color': 'dispersed'.'roots': 'dispersed'.'knock sound': 'dispersed'.'grain': 'dispersed'.'time': 'dispersed'.'touch': 'dispersed'.'density': 'series'.'Sugar content': 'series',
    }
    label_names = 'good melon'
    target = {
        'color': 'green'.'roots': '蜷缩'.'knock sound': 'cloud ring'.'grain': 'clear'.'time': 'depression'.'touch': 'hard sliding'.'density': 0.697.'Sugar content': 0.460,}return dataset, features_info, label_names, target
Copy the code

Compute the probability density of continuous attributes

def calNormalDistribution(x_value, var_value, mean_value):
    ' ''Probability used to calculate consecutive attributes :param X_value: target eigenvalue: Param var_value: variance of class C samples on ith attributes: Param mean_value: mean of class C samples on ith attributes :return: probability result'' '
    return math.exp(-(x_value - mean_value) ** 2 / (2*(var_value**2))) / (math.sqrt(2*math.pi) * var_value)

Copy the code

class NativeBayesModel(object):
    def __init__(
            self,
            dataset: pd.DataFrame,
            features_info: dict,
            label_names: str,
    ):
        ' ''Get training data: Param Features_info: Column names of characteristic values of data: Param label_names: column names of tag values of data'' '
        self.features_info = features_info
        self.label_names = label_names

        # Like prior probability
        self.prior_prob = {}
        # Evidence factor
        self.evidence_prob = {}
        # Conditional probability like
        self.class_conditional_prob = {}
        Initialize the evidence factor
        for ifeature in features_info:
            self.evidence_prob[ifeature] = {}

        Collect statistics for dataset features and tags
        self.features_stat, self.label_stat = self.getStatistic(dataset)

    def getStatistic(self, dataset: pd.DataFrame):
        ' ''Statistics for each category are stored in label_stat and Features_stat :param FeatuRES_names: column names of characteristic values of data: Param label_names: column names of label values of data :return: Statistical results of eigenvalues and label values'' '
        The column name of the eigenvalue of the data
        features_name = [ifeature for ifeature in self.features_info.keys()]
        features = dataset[features_name]
        The column name of the data tag value
        labels = dataset[self.label_names]

        Translate the results of statistics into dictionary form
        label_stat = dict(labels.value_counts())

        features_stat = {}
        Convert statistical results into dictionary form according to characteristics
        for ifeature in self.features_info.keys():
            features_stat[ifeature] = dict(features[ifeature].value_counts())
        return features_stat, label_stat

    def getPriorProb(self, dataset_nums: int, regular=False):
        ' ''Calculate prior probabilities (class probabilities) : Param label_stat: tag statistical results: Param regular: Whether the Laplace correction flag is required :return:'' '
        # if you don't use Laplace correction
        if regular is False:
            for iclass, counts in self.label_stat.items():
                self.prior_prob[iclass] = counts / dataset_nums
        else:
            for iclass, counts in self.label_stat.items():
                self.prior_prob[iclass] = (counts+1) / (dataset_nums+len(self.label_stat))


    def getEvidenceProb(self, dataset_nums: int):
        ' 'Param Features_stat: Statistical results of features :return:' '
        for ifeature in self.features_info.keys():
            for ifeature_name, counts in self.features_stat[ifeature].items():
                self.evidence_prob[ifeature][ifeature_name] = counts / dataset_nums

    def getConditionData(self, dataset: pd.DataFrame):
        ' ''Param dataset: :return: Dataset filtered by (tag value)'' '
        new_dataset = {}
        for iclass in self.label_stat:
            # Class conditional probability initialization
            self.class_conditional_prob[iclass] = {}
            Divide the data set by class
            new_dataset[iclass] = dataset[dataset[self.label_names] == iclass]
        return new_dataset

    def getClassConditionalProb(self, dataset, target, iclass, regular=False):
        ' ''class: conditional probability P (feature_i = ifeature | class = iclass) : param dataset: contains only the first iclass class child data sets: param target: {class: {feature_name: {class: {feature_name: {class: {feature_name: {class: {feature_name: {class: {feature_name: {class: {feature_name: { features } } } '' '
        for target_feature_name, target_feature in target.items():
            Initialize the class conditional probability, which is stored according to the "class-feature column name-feature variable name" structure
            if target_feature_name not in self.class_conditional_prob[iclass]:
                self.class_conditional_prob[iclass][target_feature_name] = {}

            if target_feature not in self.class_conditional_prob[iclass][target_feature_name]:
                self.class_conditional_prob[iclass][target_feature_name][target_feature] = {}

            # Determine whether the feature is continuous or discrete
            if self.features_info[target_feature_name] == 'dispersed':
                Filter the data set
                condition_dataset = dataset[dataset[target_feature_name] == target_feature]
                # If you use a Laplace correction
                if regular is False:
                    prob = condition_dataset.shape[0] / dataset.shape[0]
                else:
                    prob = (condition_dataset.shape[0]+1) / (dataset.shape[0]+len(self.features_stat[target_feature_name]))
            # if this is continuous
            else: x_value = target_feature var_value = dataset[target_feature_name].var() mean_value = dataset[target_feature_name].mean()  prob = calNormalDistribution(x_value, var_value, mean_value) self.class_conditional_prob[iclass][target_feature_name][target_feature] = prob def getPredictClass(self, target):# count categories
        max_prob = 0
        predict_class = None
        for iclass in self.label_stat:
            prob = nb.prior_prob[iclass]
            for target_feature_name, target_feature in target.items():
                prob *= nb.class_conditional_prob[iclass][target_feature_name][target_feature]
            print('label', iclass, '\'s probability is:', prob) if prob > max_prob: predict_class = iclass max_prob = prob return predict_classCopy the code

A function call

if __name__ == '__main__':
    import pandas as pd
    import math
    # Whether a Laplace correction is needed
    regular_state = False
    dataset, features_info, label_names, target = getData()
    dataset_nums = dataset.shape[0]

    nb = NativeBayesModel(dataset, features_info, label_names)


    # Calculate prior probabilities
    nb.getPriorProb(dataset_nums, regular=regular_state)
    # Calculate the evidence factor
    nb.getEvidenceProb(dataset_nums)
    # Divide the dataset by class tag into multiple datasets containing only one class tag
    subDataset = nb.getConditionData(dataset)
    # Calculate the conditional probability of each type of tag in turn
    for iclass, subdata in subDataset.items():
        nb.getClassConditionalProb(subdata, target, iclass, regular=regular_state)

    predict_class = nb.getPredictClass(target)
    print('predict label is :', predict_class)
    print('==============prior prob===================')
    print(nb.prior_prob)
    print('==============ClassConditionalProb===================')
    print(nb.class_conditional_prob)


Copy the code

The instance

How to write a spell checker

reference

Statistical Learning Methods – Li Hang – Machine learning – Zhou Zhihua Bayes easy to understand derivation