This is the 18th day of my participation in the August Genwen Challenge.More challenges in August
Naive Bayes
Conditional probability hypothesis attributes of statistical attributes are independent of each other text classification – spam text filtering, sentiment prediction, recommendation system
The specific process
- Give training data
- The corresponding classification of data
- Category probability and conditional probability
Bayesian principle, Bayesian classification and Naive Bayes
Bayes principle – the biggest concept – solves the problem of “reverse probability” in probability theory bayes classifier – Naive Bayes classifier designed on the basis of Bayes principle – the simplest Bayes classifier
Working principle of naive Bayes classification
Discrete data case
Continuous data case
- Let’s say they’re all normally distributed
- Find the mean and variance – get the probability density function – plug in the values, and figure out the value of the density function at some point
-
- Into discrete data distribution, directly also use the formula
Plain Bayesian workflow
- Preparation stage
Determine characteristic attributes – divide each characteristic attribute appropriately – classify a part of data manually – form training samples
- The stage of training
Generate classifier – Calculate the frequency of occurrence of each category in the training sample and the conditional probability input of each feature attribute partition for each category: feature attribute, training sample output: classifier
- Application stage
Input: classifier, new data output: classification result of new data
O IF – the IDF
Creation of the TfidfVectorizer class
- Create TfidfVectorizer class
- Fit the list – FIT_transform – text matrix
- The output
- All non-repeating words – get_feature_names()
- Corresponding ID – vocabulary_
- TFDF value for each word – toarray()
Classify documents
- Preparation stage
- Document segmentation
- English – me
- Word_tokenize (text) – participle
- Pos_tag () – Marks the parts of speech
- Chinese – jieba
- English – me
- Load stop word
- TfidfVectorizer class fit_transform fitting tF-IDF feature space features TF = TfidVectorizer(stop_words, max_df) features = tf.fit_transform(train_contents)
- Document segmentation
- Classification stage
-
MultinomialNB CLF = + + + + + + + + + + + + + + MultionmialNB (alpha = 0.001). The fit (train_features train_labels)
-
Classifiers are applied to test sets
-
Calculate accuarcy_score from sklearn import metrics print metrics. Accuracy_score (test_labels, predicted_labels)
-