Welcome to the main site of Jizhi: Jizhi, the engine leading to the intelligent era
When using machine learning methods to solve practical problems, often what we get is not purely a data file, are likely to be pictures, text, video and other complex data contain valid information, then you need our digital features extracted from the data, so that after we analyzed and training process. This section introduces some of the features in SciKit-Learn that can be used for text categorization, including
- Read text content and classification information from the hard disk
- Feature vectors that can be used in machine learning processes are extracted from text information
The specific process of using text features for model training, evaluation, and optimization will be discussed in the next section.
Get data file
Let’s start with the data set we’re going to use. The data set used in this article is called “20 newsgroups” and is often used for machine learning and natural language processing. It contains nearly 20,000 stories in 20 news categories, and its official synopsis can be found at http://qwone.com/~jason/20Newsgroups/.
Fetch_20newsgroups There are several ways to get this dataset, one simple way is to use the sclearn function sklearn.dataset. fetch_20newsgroups. This function automatically downloads and reads “20 newsgroups” data from the Web, as shown in the following example. In order to save calculation and processing time, we select only four of the 20 categories for subsequent analysis.
Note: Due to the large size of the packet and the foreign data source address, the following example will be slow to run and is not provided.
from sklearn.datasets import fetch_20newsgroups
# Select the news category you want to download
categories = ["alt.atheism"."soc.religion.christian"."comp.graphics"."sci.med"]
Download and get training data
twenty_train = fetch_20newsgroups(subset="train",
categories=categories, shuffle=True, random_state=42)
# Show the classification of training data
twenty_train.target_names
Copy the code
Of course, the more common method is to download the data we need directly from the Internet. We can use the Urllib library provided by Python to download and unpack packets. Download files from the network can use urllib. Request. Urlretrieve this function. Usually, the packages we download are compressed files, which can be done using the tarfile library, as shown in the following example.
# Download packets from the network
from urllib import request
request.urlretrieve("http://jizhi-10061919.cos.myqcloud.com/sklearn/20news-bydate.tar.gz"."data.tar.gz")
Unzip the downloaded package
import tarfile
tar = tarfile.open("data.tar.gz"."r:gz")
tar.extractall()
tar.close()
# Select the news category you want to download
categories = ['alt.atheism'.'soc.religion.christian'.'comp.graphics'.'sci.med']
Get training data from hard drive
from sklearn.datasets import load_files
twenty_train = load_files('20news-bydate/20news-bydate-train',
categories=categories,
load_content = True,
encoding='latin1',
decode_error='strict',
shuffle=True,random_state=42)
# Show the classification of training data
print(twenty_train.target_names)
Copy the code
As shown above, after the package is downloaded, we can use sklearn.datasets.load_files to get the data. We also collect only four types of text data for analysis and check some information of the read text data to confirm that the data has been read.
Extracting text features
No matter what machine learning method is, it can only analyze vector features (that is, a series of combinations of numbers), so after reading the text, we need to convert the text into digital feature vectors.
Bags of Words
One of the most commonly used methods to extract text features is the word bag model, which is expressed as follows: For each document # I, the number of occurrences of each word w is calculated and recorded in X[I, j] as the value of feature #j, where j represents the position of the word W in the dictionary
The bag of words model assumes that each dataset existsn_features
Different words, and the number often exceeds 100,000. What’s the problem with that? Consider that if the number of samples (that is, the number of documents) is 10,000 and features are stored as 32-bit floating-point numbers, the total text features need to beThat 4GB must be stored in computer memory, which is almost impossible on today’s computers.
Fortunately, most feature values in the feature data obtained by the above method are zero, because each document actually uses only a few hundred different terms. For this reason, we call the word bag model a high-dimensional sparse dataset. We can save a lot of memory by storing only non-zero features.
The Scipy. sparse model is a set of data structures that handle this process and is supported by SciKit-Learn.
Markup text with scikit-learn
An efficient data processing module includes functions such as text and processing, marking and stopwords filtering, which can help us transform text data into feature vectors, so as to build feature dictionaries of data.
CountVectorizer is a library of functions that support counting words in text. We can use the functions to analyze the text data to obtain a dictionary of feature vectors. The value of each item in the dictionary represents the number of occurrences of the word in the total data, as shown in the following example.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
Print characteristic information
print("There are {0} pieces of training data, and the vocabulary count is {1} pieces.".format(X_train_counts.shape[0], X_train_counts.shape[1]))
# check the count of a word
count = count_vect.vocabulary_.get(u'algorithm')
print("Algorithm is {0}".format(count))
Copy the code
From frequency of occurrence to frequency analysis
Counting the number of words is a good attempt, but there is also the problem that a word is much more likely to appear in longer articles than in shorter ones, even if the two articles are on the same topic.
To address these potential differences, we can try dividing the number of times each word appears in a document by the total number of words in the document. This will result in a new feature that we can call TF, or Term Frequencies.
Another optimization based on TF is to reduce the weight of some less important words, which tend to appear in many documents and have less impact on classification than words that appear in only a small number of documents.
We call this word Frequency + weight model TF-IDF, namely “Term Frequency Times Inverse Document Frequency”. Here’s a brief description of their mathematical significance.
tf
Word frequency refers to a wordIn some documentThe probability of the occurrence of, which can be calculated by the following formula:
[math]? $\textrm{tf}_{i,j}=\frac{n_{i,j}}{\sum_kn_{k,j}}? $[/math]
Among themRefers to the words in the document, while[math]$\sum_kn_{k,j}$[/math]
Refers to the fileThe number of occurrences of all words in.
Idf, or reverse file frequency, is a measure of the general importance of a word, indicating how much information it contains. IDF can be obtained by dividing the total number of files by the number of files containing the term and taking the logarithm:
Among themRepresents the total number of files in the data,Inclusive expressionNumber of files.
And finally:
[math]? $\textrm{tfidf}_{i,j}= \textrm{tf}_{i,j}\times \textrm{idf}_i? $[/math]
Both TF and TF-IDF can be calculated in the following way.
from sklearn.feature_extraction.text import TfidfTransformer
Extract text features using TF-IDF
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
# View feature results
X_train_tf.shape
Copy the code
In the above code, we first use fit(..) Method to process raw text data and then use transform(..) Method to transform lexical statistics into tF-IDF models. The two can actually be combined to save computation, using fit_transform(..) as shown below. Method to do this.
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# View feature results
X_train_tfidf.shape
Copy the code
At this point, we have completed the whole process of extracting data features from text data using TF-IDF model. We will discuss how to use these data features for text classification model training, evaluation and optimization in the next section.
Scikit-learning-working With Text Data (sciKit-learning-Working With Text Data)