Naive Bayes is used for text classification
The introduction
A simple and powerful probabilistic model derived from Bayes’ theorem that determines the probability of an object belonging to a certain class based on the probability of each feature. The method is based on the assumption that all features need to be independent of each other, that is, the values of any feature are not related to the values of other features. Although the assumption of condition independence may not be well satisfied in many application fields, it is even untenable. However, this simplified Bayesian classifier has achieved good classification accuracy in many practical applications. The process of training a model can be viewed as a calculation of the probability of the relevant conditions, which can be estimated by the frequency of the statistical pair corresponding to a particular class of features. One of the most successful applications of Naive Bayes is in the field of natural language processing, where the data of natural language processing can be regarded as annotated data in text documents, which can be used as training data sets to be trained using machine learning algorithms. In this section, we mainly introduce the use of naive Bayes method for text classification. We will use a group of text documents marked with categories to train the naive Bayes classifier, so as to predict the categories of unknown data instances. This method can be used as spam filtering.
The data set
Data from this experiment can be retrieved from a set of news information via SciKit-learn. The dataset consists of 19,000 pieces of news information covering 20 different topics, including politics, sports, science and more. This data set can be divided into training and testing, with training and testing data divided based on a specific date.
Data can be loaded in two ways:
- Sklearn. Datasets. Fetch_20newsgroups, this function returns a list of the original data, it can be as a text feature extraction of interface (sklearn. Feature_extraction. Text. CountVectorizer) input
- Sklearn.datasets. Fetch_20newsgroups_vectorized. This interface returns directly usable features, and feature extraction can no longer be used
\
1 from sklearn.datasets import fetch_20newsgroups
2 news = fetch_20newsgroups(subset='all')
3 print news.keys()
4 print type(news.data), type(news.target), type(news.target_names)
5 print news.target_names
6 print len(news.data)
7 print len(news.target)
Copy the code
Print information:
1 ['DESCR', 'data', 'target', 'target_names', 'filenames']
2
3 <type 'list'> <type 'numpy.ndarray'> <type 'list'>
4 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
5 18846
6 18846
Copy the code
Let’s take a look at the first item and the corresponding category:
print news.data[0]
print news.target[0], news.target_names[news.target[0]]
Copy the code
Omit the printed news content, category 10, category name rec.sport.hockey.
Data preprocessing
Machine learning algorithms can only work on numerical data. The algorithm expects to use numerical features of fixed length rather than raw text files of indefinite length. Our next step is to convert text data sets into numerical data sets. Now, we have only one feature: the textual content of a news message, and we need a function to transform a piece of text into a meaningful set of numerical features. Intuitively, you can try to look at individual strings (more specifically tokens, tokens) for each text category, and then describe the frequency distribution characteristics of the tag words corresponding to each category. The sklearn.feature_extruse.text module has some useful tools for building numerical feature vectors from text documents.
Divide training and test data
Before we can do the transformation work, we need to divide the data into training and test data sets. Since the loaded data appears in random order, we can divide the data into two parts, 75% for training data and 25% for test data:
1 SPLIT_PERC = 0.75
2 split_size = int(len(news.data)*SPLIT_PERC)
3 X_train = news.data[:split_size]
4 X_test = news.data[split_size:]
5 Y_train = news.target[:split_size]
6 Y_test = news.target[split_size:]
Copy the code
Because Sklear.datasets. Fetch_20newsgroups can select training data and test data according to subset parameters, there are 11,314 training data, accounting for 60% of the total dataset and 40% of the test dataset. It can be obtained as follows:
1 news_train = fetch_20newsgroups(subset='train')
2 news_test = fetch_20newsgroups(subset='test')
3 X_train = news_train.data
4 X_test = news_test.data
5 Y_train = news_train.target
6 Y_test = news_test.target
Copy the code
Bag of Words
The word bag model is a simple assumption in natural language processing and information retrieval. In this model, text (paragraphs or documents) is treated as an unordered collection of words, ignoring grammar or even word order. The bag of words model is used in some methods of text classification. When traditional Bayesian classification is applied to text, the assumption of conditional independence in Bayes leads to the word bag model. Scikit-learn provides utilities for extracting numerical features from text content in the most common ways, such as:
- Tokenizing text and an integer ID assigned to each possible token, such as space and punctuation marks as the mark separator (Chinese word segmentation involved)
- Counting the occurrence frequency of tokens in each text
- Normalizing and weighting in the diminishing importance of the markup that appears in most samples/documents Our strategy for transforming the above process from a stack of text files to a numerical feature vector is called word bags
Under this strategy, features and samples are defined as follows: Taking the frequency of occurrence of each independent token (whether standardized or not) as a feature, the frequency component vector of all tokens in a given document as a multivariate sample, a corpus of text can be represented as a matrix, in which each line represents a document, And each column represents a marker word that appears in the corpus.
Text can be represented by the occurrence frequency of words, so that the relative position information of words in the text can be completely ignored, which should ensure the conditional independence of Bayes.
Sparse sex
Most documents will typically use only a subset of all the words in the corpus, and the resulting matrix will have many eigenvalues of 0 (usually more than 99% of them). For example, a set of 10,000 short texts (such as email) would use a total of 100,000 words, while each document would use 100 to 1,000 unique words. Sparse representations such as those provided in the Scipy. sparse package are often used to store this matrix in memory and provide the speed of matrix/vector algebra.
Interface for text feature extraction
Sklearn.feature_extraction.text provides the following tools for constructing feature vectors:
- feature_extraction.text.CountVectorizer([…] ) Convert a collection of text documents to a matrix of token counts
- feature_extraction.text.HashingVectorizer([…] ) Convert a collection of text documents to a matrix of token occurrences
- feature_extraction.text.TfidfTransformer([…] ) Transform a count matrix to a normalized tf or tf-idf representation
- feature_extraction.text.TfidfVectorizer([…] ) Convert a collection of raw documents to a matrix of TF-IDF features.
Explanation:
- The CountVectorizer method builds a dictionary of words, and each word instance is converted to a numeric feature of a feature vector, where each element is the number of times a particular word appears in the text
- The HashingVectorizer method implements a hash function that maps tags to indexes of features that are evaluated the same way as CountVectorizer
- TfidfVectorizer uses an advanced calculation method called Term Frequency Inverse Document Frequency (TF-IDF). This is a statistical measure of the importance of a word in a text or corpus. Intuitively, this approach looks for words with high frequency in the current document by comparing the frequency of words throughout the corpus. This is a way of standardizing the results to avoid the situation where some words appear so frequently that they do not contribute much to the characterization of an instance (I guess a and AND appear more frequently in English, but they do not contribute much to the characterization of a text).
Construction of naive Bayes classifier
Since we use the number of occurrences of words as a feature, we can use a multinomial distribution to describe this feature. MultinomialNB class of the sklearn. Naive_bayes module is used in sklearn to build the classifier. We use the Pipeline class to build compound Classifers with Vectorizers and classifiers.
1 from sklearn.naive_bayes import MultinomialNB
2 from sklearn.pipeline import Pipeline
3 from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
4
5 #nbc means naive bayes classifier
6 nbc_1 = Pipeline([ ('vect', CountVectorizer()), ('clf', MultinomialNB()), ]) nbc_2 = Pipeline([ ('vect', HashingVectorizer(non_negative=True)), ('clf', MultinomialNB()), ])
7 nbc_3 = Pipeline([ ('vect', TfidfVectorizer()), ('clf', MultinomialNB()), ])
8 nbcs = [nbc_1, nbc_2, nbc_3]
Copy the code
Cross validation
Let’s design a cross-validation function to test the performance of the classifier:
1 from sklearn.cross_validation import cross_val_score, KFold 2 from scipy.stats import sem 3 import numpy as np 4 def evaluate_cross_validation(clf, X, y, K): 5 # create a k-fold croos validation iterator of k=5 folds 6 cv = KFold(len(y), K, shuffle=True, random_state=0) 7 # by default the score used is the one returned by score method of the estimator (accuracy) 8 scores = cross_val_score(clf, X, y, cv=cv) 9 print scores 10 print ("Mean score: {0:.3f} (+/-{1:.3f})").format( np.mean(scores), sem(scores))Copy the code
Divide the training data into 5 parts and output the scores of verification:
for nbc in nbcs:
evaluate_cross_validation(nbc, X_train, Y_train, 5)
Copy the code
The output is:
1 [0.82589483 0.83473266 0.8272205 0.84136103 0.83377542] 2 Mean score: 0.833 (+/-0.003) 3 [0.76358816 0.72337605 0.72293416 0.74370305 0.74977896] 4 Mean score: 0.741 (+/-0.008) 5 [0.84975696 0.83517455 0.82545294 0.83870968 0.84615385] 6 Mean Score: 0.839 (+/-0.004)Copy the code
As can be seen from the above results, CountVectorizer and TfidfVectorizer are better than HashingVectorizer for feature extraction.
Feature extraction was optimized to improve the classification effect
Next, we parse the text with regular expressions to get tag words.
Optimized extraction of word rule parameters
A parameter token_pattern to TfidfVectorizer is used to specify the rules for extracting words. The default regular expression is ur”\b\w\w+\b”. This regular expression only matches word boundaries and takes into account underscores and possibly dashes and dots. New regular expression is ur “\ [b a – z0-9 _ \ – \] + [a-z] [a – z0-9 _ \ – \] + b \”.
1 nbc_4 = Pipeline([
2 ('vect', TfidfVectorizer(
3 token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
4 )),
5 ('clf', MultinomialNB()),
6 ])
7
8 evaluate_cross_validation(nbc_4, X_train, Y_train, 5)
Copy the code
[0.86478126 0.85461776 0.84489616 0.85505966 0.85234306] Mean score: 0.854 (+/-0.003)Copy the code
That’s up from the previous score of 0.839.
Optimize ellipsis parameters
Stop_words, an argument to TfidfVectorizer, specifies words that are omitted from the list of tag words, such as words that occur frequently but do not provide any prior support for a particular topic.
1 def get_stop_words():
2 result = set()
3 for line in open('stopwords_en.txt', 'r').readlines():
4 result.add(line.strip())
5 return result
6
7 stop_words = get_stop_words()
8 nbc_5 = Pipeline([
9 ('vect', TfidfVectorizer(
10 stop_words=stop_words,
11 token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
12 )),
13 ('clf', MultinomialNB()),
14 ])
15
16 evaluate_cross_validation(nbc_5, X_train, Y_train, 5)
Copy the code
[0.88731772 0.88731772 0.878038 0.88466637 0.88107869] Mean score: 0.884 (+/-0.002)Copy the code
The score went up to 0.884.
Optimizing the alpha parameter of Bayesian classifier
MultinomialNB has an alpha parameter, which is a smoothing parameter. The default is 1.0, which we set to 0.01.
1 nbc_6 = Pipeline([ 2 ('vect', TfidfVectorizer( 3 stop_words=stop_words, 4 token_pattern = ur "\ b [a - z0-9 _ \ - \] + [a-z] [a - z0-9 _ \ - \] + b \", 5)), 6 (' CLF 'MultinomialNB (alpha = 0.01). 7 ]) 8 9 evaluate_cross_validation(nbc_6, X_train, Y_train, 5)Copy the code
[0.91073796 0.92532037 0.91604065 0.91294741 0.91202476] Mean Score: 0.915 (+/-0.003)Copy the code
So this score is pretty much optimized.
Evaluate classifier performance
We got the classifier parameters with good effect through cross validation, and now we can use the classifier to test our test data.
1 from sklearn import metrics
2 nbc_6.fit(X_train, Y_train)
3 print "Accuracy on training set:"
4 print nbc_6.score(X_train, Y_train)
5 print "Accuracy on testing set:"
6 print nbc_6.score(X_test,Y_test)
7 y_predict = nbc_6.predict(X_test)
8 print "Classification Report:"
9 print metrics.classification_report(Y_test,y_predict)
10 print "Confusion Matrix:"
11 print metrics.confusion_matrix(Y_test,y_predict)
Copy the code
Only accuracy is output here:
Accuracy on training set:
0.997701962171
Accuracy on testing set:
0.846919808816
Copy the code
The resources
Wiki: Word bag model