Feature extraction
The target
-
DictVectorizer is used to realize the numerical and discrete classification features
-
Use CountVectorizer to numerize text features
-
The TfidfVectorizer is used to numerize text features
-
Tell the difference between the two methods of text feature extraction
define
Feature extraction is the conversion of arbitrary data, such as text or images, into digital features that can be used for machine learning
Note: The eigenvalue is for the computer to understand the data better
-
Dictionary feature Extraction (Feature discretization)
-
Text feature extraction
-
Image Feature Extraction (Deep learning)
Feature extraction API
sklearn.feature_extraction
Dictionary feature extraction
Function: Eigenize dictionary data
- Sklearn. Feature_extraction. DictVectorizer (sparse = True,…).
- DictVectorizer. Fit_transform (X) X: dictionary or iterator containing a dictionary. Return value: Sparse matrix
- DictVectorizer. Inverse_transform (X) X: array Array or Sparse matrix Return value: Data format before conversion
- Dictvectorizer.get_feature_names () returns the category name
application
Feature extraction is carried out on the following data
data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]
Copy the code
Process analysis
-
Instantiate the class DictVectorizer
-
Call the fit_transform method to input the data and transform it (note the return format)
def dict_demo() :
""" Dictionary eigenvalue extraction :return: """
data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]
# 1. Instantiate a converter that returns a sparse matrix by default to represent non-zero values in position to save memory and improve load efficiency
transfer = DictVectorizer(sparse=False)
# Application scenario: more category eigenvalues in the data set; Attribute the data set to the dictionary type; DictVectorizer transformation; I got the dictionary
# 2. Call fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
print("Feature name :\n", transfer.get_feature_names())
return None
Copy the code
Notice the result without the sparse=False parameter
This result is not what you want to see, so add parameters to get the desired result, and here the data processing technique is called “one-hot” code.
conclusion
One-hot coding will be performed for features with category information
Text feature extraction
Function: Carries on the characteristic value to the text data
-
sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
- Returns the word frequency matrix
-
Countvectorizer.fit_transform (X) X: text or iterable containing a text string. Returned value: Sparse matrix
-
CountVectorizer. Inverse_transform (X) X: Array Array or Sparse matrix Returned value: Data lattice before conversion
-
Countvectorizer.get_feature_names () Returns a value: list of words
-
sklearn.feature_extraction.text.TfidfVectorizer
application
Feature extraction is carried out on the following data
data = ["life is short, i like python"."life is too long i dislike python"]
Copy the code
Process analysis
-
Instantiate CountVectorizer
-
The fit_transform method is called to input data and transform it (note the return format: ToArray () is used to transform an array array with sparse matrix).
def count_demo() :
""" Text eigenvalue extraction :return: """
data = ["life is short, i like python"."life is too long i dislike python"]
# 1. Instantiate a converter class
transfer = CountVectorizer()
# Demo stop words
# transfer = CountVectorizer(stop_words=["is", "too"])
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("Feature name :\n", transfer.get_feature_names())
# 2, call fit_transform
return None
Copy the code
Q: What if we replace the data with Chinese?
Find English is separated by Spaces by default. In fact, a word segmentation effect has been achieved, so we need to carry out word segmentation for Chinese
The following code requires that the text be whitespace ahead of time
def count_chinese_demo() :
""" Chinese text eigenvalue extraction :return: """
data = ["I love Tian 'anmen Square in Beijing"."The sun rises over Tian 'anmen."]
data2 = ["I love Tian 'anmen Square in Beijing"."The sun rises over Tian 'anmen."]
# 1. Instantiate a converter class
transfer = CountVectorizer()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("Feature name :\n", transfer.get_feature_names())
# 2, call fit_transform
return None
Copy the code
For a better way to deal with it, see the following scheme
Jieba word segmentation processing
- jieba.cut()
- Returns a generator composed of words
The jieba library needs to be installed
pip install jieba
Case analysis
data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
Copy the code
Analysis of the
-
Prepare sentences, using jieba. Cut to participle
-
Instantiation CountVectorizer
-
Take the result of the participle to a string as the input value of fit_Transform
def count_word(text) :
I love Beijing Tian 'anmen square - - I love Beijing Tian 'anmen Square: :return: """
a = "".join(list(jieba.cut(text)))
print(a)
return a
def count_chinese_demo2() :
""" Chinese text feature value extraction automatic word segmentation:
data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
# 1. Instantiate a converter class
transfer = CountVectorizer(stop_words=["From Mama."])
data_new = transfer.fit_transform(count_word(item) for item in data)
print("data_new:\n", data_new.toarray())
print("Feature name :\n", transfer.get_feature_names())
# 2, call fit_transform
return None
Copy the code
Question: How do you deal with the high number of times a word or phrase appears in multiple articles?
Tf-idf text feature extraction
The main idea of TF-IDF is that if a certain word or phrase has a high probability of appearing in one article and rarely appears in other articles, it is considered that this word or phrase has good classification ability and is suitable for classification.
Function of TF-IDF: It is used to evaluate the importance of a word to a file set or a document in a corpus.
The formula
Term frequency (TF) refers to the frequency of occurrence of a given word in the document
Inverse document Frequency (IDF) is a measure of the universal importance of a word. The IDF of a given term can be obtained by dividing the total number of documents by the number of documents containing the term and multiplying the resulting quotient by the logarithm of base 10
The final result can be understood as the degree of importance.
Note: If the total number of words in a document is 100 and the word “very” appears 5 times, then the word frequency of “very” in the document is 5/100=0.05. The file frequency (IDF) is calculated by dividing the total number of files in the set by the number of files in which the word “unusual” appears. So, if the word “extraordinary” appears in 1,000 documents and the total number of documents is 10,000,000, the reverse file frequency is LG (10,000,000/1,0000) =3. Finally, the tF-IDF score of “very” for this document was 0.05 * 3=0.15
case
def tfidf_demo() :
Extracting text eigenvalues using TF-IDF method :return: """
data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
transfer = TfidfVectorizer(stop_words=["From Mama."])
data_new = transfer.fit_transform(count_word(item) for item in data)
print("data_new:\n", data_new.toarray())
print("Feature name :\n", transfer.get_feature_names())
return None
Copy the code
The importance of TF-IDF
The classification machine learning algorithm is used to process data in the early stage of article classification