1. Bayesian knowledge mapping
2. Some easy to understand examples
This article briefly introduces the application of Bayes with four examples
Take a serious look: Bayesian classifiers
3.1 Bayes’ theorem
One data set hasThe species label is denoted as, corresponding to each type of labelThe eigenvalues are denoted as(It’s a vector), then, for Bayes’ theorem, there is the following formula:
By applying Formula 3.1 to each label, the probability corresponding to each label can be calculated, and the label with the highest calculated probability is the predicted value.
3.2 Bayesian decision theory
In classification: The goal of Bayes is how to select the best category markers under given conditions. So, what is the optimal criterion?
It must be the price of getting the category wrong!
Let’s say I have a category, let’s call it thetaCharacteristics, andThe correct category isAt this point, we takeThought to be wrongThe damage caused is recorded as, so, for all samples in the sample space, at a given featureIn this case, think of them both asThe total loss caused is denoted asTo:
And our task is to find a way, minimize formula (3.2). This method can map features to categories, and the corresponding total risk of this method isThen, on the whole training set, the expected loss of this function is:
Obviously, when conditional riskMinimum, total riskIt’s the smallest. So,Select the category markers for each sample feature that minimize its conditional risk, can achieve the optimal. At this point, the method is expressed as:
Specifically, if the goal is to minimize the classification error rate, then misjudgment lossCan be written as:
3.3 Bayesian calculation method
Our goal is to get a posterior probabilitySo, there are two strategies:
- Discriminant model: givenDirectly through modelingget.
- Generative model: the first joint probabilityModel it, and then calculate it
3.4 What does this have to do with maximum likelihood estimation?
3.4.1 Maximum likelihood estimation
This article gives you a very brief introduction to the core idea of Bayes with four examples, all of which are examples of how you can count on your fingers and figure out all the probabilities, without, I don’t think, using the concept of maximum likelihood estimation, right? (What is a likelihood function?)
In statistics, mathematicians believe that, given enough data, each feature follows a pattern that can be calculated using mathematical expressions. The process of finding expression is parameter estimation and maximum likelihood estimation of the best expression.
Therefore, the training process of probability model is the process of parameter estimation. And the Bayesians say: if you have a training setThe first ofThe set of class samples is expressed as, assuming that these samples are independently identically distributed, then the parameterFor the data setThe likelihood of is this:
3.4.2 Logarithmic likelihood
Formula 3.8 is easy to overflow under the continuous multiplication operation, so logarithmic likelihood can be used for analysis.
3.5 Soul Torture
3.5.1 Why should “Attribute Condition Independence hypothesis” be assumed?
It seems simple enough, but there’s a problem:
- When the number of features is large, the amount of data they combine grows explosively. Take patient classification as an example: 2 features. There are two values for symptoms:; There are four values of Occupation:. So, by permutations, there are 16 characteristic combinations (sneeze + nurse, sneeze + farmer, sneeze + construction worker…) . These numbers are intolerable.
- Due to the excessive number of combinations of features, the possible combination of features in the actual data set did not appear at all, such as (headache + nurse) what is the disease? (Note that this cannot be written as 0, because ** “not observed” and “probability of occurrence is 0” ** are two different things).
Therefore, there is a general premise: assuming that all attributes are independent of each other, on this basis, formula 3.1 can be written
- Among them,Is the number of features,forIn the firstThe value on the property.
- Referred to asQuasi-prior probability.
,,
-
Is the evidence factor, that is, the probability multiplication of the features, which are the same for the class markers. (e.g.,)
-
Is the sampleFor class tagstheQuasi-conditional probability (likelihood)
- For discrete data,saidIn the firstThe value on the property isA collection of samples of:
(e.g.,)
- For continuous data:
Assume that. Among them,Respectively,The class tag is at noThe mean and variance of each attribute.
3.5.2 What if unknown features appear?
In the estimation of probability value, “smoothing” is needed, and the commonly used method is “Laplace correction”, the formula is:
Set the number of possible categories for training (3 in the disease category (cold, allergy, concussion)).
For the firstThe number of possible values of the attributes (in the classification of diseases, the number of occupational attributes is: 4 (nurse, construction worker, teacher, farmer)).
4. Pyhton code implementation
This uses data from Statistical Learning Methods and the Watermelon Book
After writing the code, I found that there were two omissions in Zhou Zhihua’s watermelon book:
-
Probability calculation error.
-
Continuous-valued probabilities use standard deviation rather than variance
The data in Watermelon Book is table 4.3 on page P84. The preceding numbering column was deleted from the code.
To get the data
def getData():
' ''Get data :return: Return data set, eigenvalue name and tag class name'' '
dataset = pd.DataFrame({
'x1': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],'x2': ['S'.'M'.'M'.'S'.'S'.'S'.'M'.'M'.'L'.'L'.'L'.'M'.'M'.'L'.'L'].'Y': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})Some features, if continuous... I'm going to use the probability density function
features_info = {
'x1': 'dispersed'.'x2': 'dispersed'
}
label_names = 'Y'
target = {
'x1': 2.'x2': 'S'
}
dataset = pd.read_csv('./data/WaterMelonDataset3.csv')
# dataset = dataset[1: ]
features_info = {
'color': 'dispersed'.'roots': 'dispersed'.'knock sound': 'dispersed'.'grain': 'dispersed'.'time': 'dispersed'.'touch': 'dispersed'.'density': 'series'.'Sugar content': 'series',
}
label_names = 'good melon'
target = {
'color': 'green'.'roots': '蜷缩'.'knock sound': 'cloud ring'.'grain': 'clear'.'time': 'depression'.'touch': 'hard sliding'.'density': 0.697.'Sugar content': 0.460,}return dataset, features_info, label_names, target
Copy the code
Compute the probability density of continuous attributes
def calNormalDistribution(x_value, var_value, mean_value):
' ''Probability used to calculate consecutive attributes :param X_value: target eigenvalue: Param var_value: variance of class C samples on ith attributes: Param mean_value: mean of class C samples on ith attributes :return: probability result'' '
return math.exp(-(x_value - mean_value) ** 2 / (2*(var_value**2))) / (math.sqrt(2*math.pi) * var_value)
Copy the code
class NativeBayesModel(object):
def __init__(
self,
dataset: pd.DataFrame,
features_info: dict,
label_names: str,
):
' ''Get training data: Param Features_info: Column names of characteristic values of data: Param label_names: column names of tag values of data'' '
self.features_info = features_info
self.label_names = label_names
# Like prior probability
self.prior_prob = {}
# Evidence factor
self.evidence_prob = {}
# Conditional probability like
self.class_conditional_prob = {}
Initialize the evidence factor
for ifeature in features_info:
self.evidence_prob[ifeature] = {}
Collect statistics for dataset features and tags
self.features_stat, self.label_stat = self.getStatistic(dataset)
def getStatistic(self, dataset: pd.DataFrame):
' ''Statistics for each category are stored in label_stat and Features_stat :param FeatuRES_names: column names of characteristic values of data: Param label_names: column names of label values of data :return: Statistical results of eigenvalues and label values'' '
The column name of the eigenvalue of the data
features_name = [ifeature for ifeature in self.features_info.keys()]
features = dataset[features_name]
The column name of the data tag value
labels = dataset[self.label_names]
Translate the results of statistics into dictionary form
label_stat = dict(labels.value_counts())
features_stat = {}
Convert statistical results into dictionary form according to characteristics
for ifeature in self.features_info.keys():
features_stat[ifeature] = dict(features[ifeature].value_counts())
return features_stat, label_stat
def getPriorProb(self, dataset_nums: int, regular=False):
' ''Calculate prior probabilities (class probabilities) : Param label_stat: tag statistical results: Param regular: Whether the Laplace correction flag is required :return:'' '
# if you don't use Laplace correction
if regular is False:
for iclass, counts in self.label_stat.items():
self.prior_prob[iclass] = counts / dataset_nums
else:
for iclass, counts in self.label_stat.items():
self.prior_prob[iclass] = (counts+1) / (dataset_nums+len(self.label_stat))
def getEvidenceProb(self, dataset_nums: int):
' 'Param Features_stat: Statistical results of features :return:' '
for ifeature in self.features_info.keys():
for ifeature_name, counts in self.features_stat[ifeature].items():
self.evidence_prob[ifeature][ifeature_name] = counts / dataset_nums
def getConditionData(self, dataset: pd.DataFrame):
' ''Param dataset: :return: Dataset filtered by (tag value)'' '
new_dataset = {}
for iclass in self.label_stat:
# Class conditional probability initialization
self.class_conditional_prob[iclass] = {}
Divide the data set by class
new_dataset[iclass] = dataset[dataset[self.label_names] == iclass]
return new_dataset
def getClassConditionalProb(self, dataset, target, iclass, regular=False):
' ''class: conditional probability P (feature_i = ifeature | class = iclass) : param dataset: contains only the first iclass class child data sets: param target: {class: {feature_name: {class: {feature_name: {class: {feature_name: {class: {feature_name: {class: {feature_name: {class: {feature_name: { features } } } '' '
for target_feature_name, target_feature in target.items():
Initialize the class conditional probability, which is stored according to the "class-feature column name-feature variable name" structure
if target_feature_name not in self.class_conditional_prob[iclass]:
self.class_conditional_prob[iclass][target_feature_name] = {}
if target_feature not in self.class_conditional_prob[iclass][target_feature_name]:
self.class_conditional_prob[iclass][target_feature_name][target_feature] = {}
# Determine whether the feature is continuous or discrete
if self.features_info[target_feature_name] == 'dispersed':
Filter the data set
condition_dataset = dataset[dataset[target_feature_name] == target_feature]
# If you use a Laplace correction
if regular is False:
prob = condition_dataset.shape[0] / dataset.shape[0]
else:
prob = (condition_dataset.shape[0]+1) / (dataset.shape[0]+len(self.features_stat[target_feature_name]))
# if this is continuous
else: x_value = target_feature var_value = dataset[target_feature_name].var() mean_value = dataset[target_feature_name].mean() prob = calNormalDistribution(x_value, var_value, mean_value) self.class_conditional_prob[iclass][target_feature_name][target_feature] = prob def getPredictClass(self, target):# count categories
max_prob = 0
predict_class = None
for iclass in self.label_stat:
prob = nb.prior_prob[iclass]
for target_feature_name, target_feature in target.items():
prob *= nb.class_conditional_prob[iclass][target_feature_name][target_feature]
print('label', iclass, '\'s probability is:', prob) if prob > max_prob: predict_class = iclass max_prob = prob return predict_classCopy the code
A function call
if __name__ == '__main__':
import pandas as pd
import math
# Whether a Laplace correction is needed
regular_state = False
dataset, features_info, label_names, target = getData()
dataset_nums = dataset.shape[0]
nb = NativeBayesModel(dataset, features_info, label_names)
# Calculate prior probabilities
nb.getPriorProb(dataset_nums, regular=regular_state)
# Calculate the evidence factor
nb.getEvidenceProb(dataset_nums)
# Divide the dataset by class tag into multiple datasets containing only one class tag
subDataset = nb.getConditionData(dataset)
# Calculate the conditional probability of each type of tag in turn
for iclass, subdata in subDataset.items():
nb.getClassConditionalProb(subdata, target, iclass, regular=regular_state)
predict_class = nb.getPredictClass(target)
print('predict label is :', predict_class)
print('==============prior prob===================')
print(nb.prior_prob)
print('==============ClassConditionalProb===================')
print(nb.class_conditional_prob)
Copy the code
The instance
How to write a spell checker
reference
Statistical Learning Methods – Li Hang – Machine learning – Zhou Zhihua Bayes easy to understand derivation