The data set

Here are some examples of what counts as a data set:

A table or CSV file that contains some data
An organized collection of tables
A file in a proprietary format that contains data
A group of files that together constitute a meaningful data set
Structured objects that contain data in other formats that you might want to load into special tools for processing
An image that captures the data
Files related to machine learning, such as trained parameters or neural network structure definitions
Anything that looks like a data set

Sklearn is a powerful Python third party machine learning library that covers everything from data preprocessing to training models. Using Scikit-Learn in the real world can save us a lot of coding time and reduce the amount of code we use, allowing us to spend more time analyzing data distribution, adjusting models and modifying superparameters. (Sklearn is the package name)

Feature extraction

The target

DictVectorizer is used to realize the numerical and discrete classification features
Use CountVectorizer to numerize text features
The TfidfVectorizer is used to numerize text features
Tell the difference between the two methods of text feature extraction

define

Feature extraction is the conversion of arbitrary data, such as text or images, into digital features that can be used for machine learning

Note: The eigenvalue is for the computer to understand the data better

Dictionary feature Extraction (Feature discretization)
Text feature extraction
Image Feature Extraction (Deep learning)

Feature extraction API

sklearn.feature_extraction

Dictionary feature extraction

Function: Eigenize dictionary data

Sklearn. Feature_extraction. DictVectorizer (sparse = True,…).
- DictVectorizer. Fit_transform (X) X: dictionary or iterator containing a dictionary. Return value: Sparse matrix
- DictVectorizer. Inverse_transform (X) X: array Array or Sparse matrix Return value: Data format before conversion
- Dictvectorizer.get_feature_names () returns the category name

application

Feature extraction is carried out on the following data

    data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]
Copy the code

Process analysis

Instantiate the class DictVectorizer
Call the fit_transform method to input the data and transform it (note the return format)

def dict_demo() :
    """ Dictionary eigenvalue extraction :return: """
    data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]
    # 1. Instantiate a converter that returns a sparse matrix by default to represent non-zero values in position to save memory and improve load efficiency
    transfer = DictVectorizer(sparse=False)

    # Application scenario: more category eigenvalues in the data set; Attribute the data set to the dictionary type; DictVectorizer transformation; I got the dictionary

    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    print("Feature name :\n", transfer.get_feature_names())
    return None
Copy the code

Notice the result without the sparse=False parameter

This result is not what you want to see, so add parameters to get the desired result, and here the data processing technique is called “one-hot” code.

conclusion

One-hot coding will be performed for features with category information

Text feature extraction

Function: Carries on the characteristic value to the text data

sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
- Returns the word frequency matrix
Countvectorizer.fit_transform (X) X: text or iterable containing a text string. Returned value: Sparse matrix
CountVectorizer. Inverse_transform (X) X: Array Array or Sparse matrix Returned value: Data lattice before conversion
Countvectorizer.get_feature_names () Returns a value: list of words
sklearn.feature_extraction.text.TfidfVectorizer

application

Feature extraction is carried out on the following data

data = ["life is short, i like python"."life is too long i dislike python"]
Copy the code

Process analysis

Instantiate CountVectorizer
The fit_transform method is called to input data and transform it (note the return format: ToArray () is used to transform an array array with sparse matrix).

def count_demo() :
    """ Text eigenvalue extraction :return: """
    data = ["life is short, i like python"."life is too long i dislike python"]
    # 1. Instantiate a converter class
    transfer = CountVectorizer()
    # Demo stop words
    # transfer = CountVectorizer(stop_words=["is", "too"])
    data_new = transfer.fit_transform(data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    # 2, call fit_transform

    return None
Copy the code

Q: What if we replace the data with Chinese?

Find English is separated by Spaces by default. In fact, a word segmentation effect has been achieved, so we need to carry out word segmentation for Chinese

The following code requires that the text be whitespace ahead of time

def count_chinese_demo() :
    """ Chinese text eigenvalue extraction :return: """
    data = ["I love Tian 'anmen Square in Beijing"."The sun rises over Tian 'anmen."]
    data2 = ["I love Tian 'anmen Square in Beijing"."The sun rises over Tian 'anmen."]
    # 1. Instantiate a converter class
    transfer = CountVectorizer()
    data_new = transfer.fit_transform(data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    # 2, call fit_transform

    return None
Copy the code

For a better way to deal with it, see the following scheme

Jieba word segmentation processing

jieba.cut()
- Returns a generator composed of words

The jieba library needs to be installed

pip install jieba

Case analysis

    data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
Copy the code

Analysis of the

Prepare sentences, using jieba. Cut to participle
Instantiation CountVectorizer
Take the result of the participle to a string as the input value of fit_Transform

def count_word(text) :
    I love Beijing Tian 'anmen square - - I love Beijing Tian 'anmen Square: :return: """
    a = "".join(list(jieba.cut(text)))
    print(a)
    return a


def count_chinese_demo2() :
    """ Chinese text feature value extraction automatic word segmentation:
    data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
    # 1. Instantiate a converter class
    transfer = CountVectorizer(stop_words=["From Mama."])
    data_new = transfer.fit_transform(count_word(item) for item in data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    # 2, call fit_transform

    return None
Copy the code

Question: How do you deal with the high number of times a word or phrase appears in multiple articles?

Tf-idf text feature extraction

The main idea of TF-IDF is that if a certain word or phrase has a high probability of appearing in one article and rarely appears in other articles, it is considered that this word or phrase has good classification ability and is suitable for classification.

Function of TF-IDF: It is used to evaluate the importance of a word to a file set or a document in a corpus.

The formula

Term frequency (TF) refers to the frequency of occurrence of a given word in the document

Inverse document Frequency (IDF) is a measure of the universal importance of a word. The IDF of a given term can be obtained by dividing the total number of documents by the number of documents containing the term and multiplying the resulting quotient by the logarithm of base 10

The final result can be understood as the degree of importance.

Note: If the total number of words in a document is 100 and the word “very” appears 5 times, then the word frequency of “very” in the document is 5/100=0.05. The file frequency (IDF) is calculated by dividing the total number of files in the set by the number of files in which the word “unusual” appears. So, if the word “extraordinary” appears in 1,000 documents and the total number of documents is 10,000,000, the reverse file frequency is LG (10,000,000/1,0000) =3. Finally, the tF-IDF score of “very” for this document was 0.05 * 3=0.15

case

def tfidf_demo() :
    Extracting text eigenvalues using TF-IDF method :return: """
    data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
    transfer = TfidfVectorizer(stop_words=["From Mama."])
    data_new = transfer.fit_transform(count_word(item) for item in data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    return None
Copy the code

The importance of TF-IDF

The classification machine learning algorithm is used to process data in the early stage of article classification

Feature pretreatment

The target

Understand the characteristics of numerical data and category data
MinMaxScaler is used to normalize the feature data
StandardScaler is used to standardize the feature data

What is feature preprocessing

Feature preprocessing: the process of converting feature data into feature data more suitable for the algorithm model through some conversion functions

You can see it in the picture above

Contains the content

Dimensionless of numerical data:

The normalized
standardized

Feature preprocessing API

sklearn.preprocessing
Copy the code

Why do we normalize/standardize?

The units or sizes of features differ greatly, or the variance of a feature is several orders of magnitude larger than that of other features, which is easy to influence (dominate) the target results, making some algorithms unable to learn other features

We need to use some method of dimensionless, so that different specifications of data into the same specification

The normalized

define

Maps data between (default: [0,1]) by transforming the original data

The formula

For each column, Max is the maximum value of a column, min is the minimum value of a column, then X “” is the final result, mx and mi are specified interval values respectively. By default, mx is 1 and mi is 0

API

Sklearn. Preprocessing. MinMaxScaler (feature_range = (0, 1))…
- MinMaxScalar.fit_transform(X)
  - X: numpy array data [n_samples,n_features]
- Return value: the converted array of the same shape

Data calculation

We calculate the following data in dating.txt. This is the data of the previous dates

Milage, Liters, Consumtime, target,8.326976 40920, 0.953952, 3, 14488,7.153469, 1.673904, 2 26052,1.441871, 0.805124, 1 38344,1.669788,13.147394 75136, 0.428964, 1, 0.134296, 1Copy the code

Analysis of the

Instantiation MinMaxScalar
By fit_transform

def minmax_demo() :
    """ Normalization :return: """
    # 1. Get the data
    data = pd.read_csv("dating.txt")
    data = data.iloc[:, :3]
    print(data)

    # 2. Instantiate a converter class
    transform = MinMaxScaler()
    # transform = MinMaxScaler(feature_range=[2, 3])

    # 3, call fit_transform
    data_new = transform.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Normalized summary

Note that the maximum and minimum values are variable. In addition, the maximum and minimum values are easily affected by outliers, so this method has poor robustness and is only suitable for traditional precise small data scenarios.

standardized

define

You transform the original data into something with a mean of 0 and a standard deviation of 1

The formula

For each column, mean is the mean and σ is the standard deviation

So back to where we were with the outliers, let’s look at normalization again

For normalization: If an outlier occurs that affects the maximum and minimum values, the results will obviously change
For standardization: If outliers occur, a small number of outliers due to a certain amount of data
The impact on the average is not large, so the variance change is small.

API

sklearn.preprocessing.StandardScaler( )
- All the data is clustered around the mean 0 for each column and the standard deviation difference is 1
- StandardScaler.fit_transform(X)
  - X:numpy array data [n_samples,n_features]
- Return value: the converted array of the same shape

Data calculation

The above data is processed as well

60,4,15,45,2,10,40 [[90], [], [75,3,13,46]]Copy the code

Analysis of the

Instantiation StandardScaler
By fit_transform

def stand_demo() :
    """ """ """ """ """ """ """ "
    # 1. Get the data
    data = pd.read_csv("dating.txt")
    data = data.iloc[:, :3]
    print(data)

    # 2. Instantiate a converter class
    transform = StandardScaler()

    # 3, call fit_transform
    data_new = transform.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Standardization summary

In the case of sufficient samples, it is relatively stable and suitable for modern noisy big data scene.

Feature dimension reduction

The target

Know the embedded, filtered and wrapped methods of feature selection
Use VarianceThreshold to remove low variance features
Understand the characteristics and calculation of correlation coefficients
Feature selection is realized by applying correlation coefficient

Dimension reduction

Dimension reduction refers to the process of reducing the number of random variables (characteristics) to obtain a set of “uncorrelated” pivot variables under certain limited conditions

Lower the number of random variables
Correlated feature: Correlation between relative humidity and rainfall, etc

Because when we train, we use features to learn. If there is a problem in the feature itself or a strong correlation between features, it will have a great impact on the algorithm learning and prediction

Two ways to reduce dimension

Feature selection
Principal component analysis (can understand a feature extraction method)

Feature selection

What is feature selection

Definition: Data containing redundant or irrelevant variables (or features, attributes, indicators, etc.) designed to identify the main features from the original features.

Methods:

Filter: Mainly explore the characteristics themselves, features and features and the correlation between features and target values
- Variance selection method: low variance feature filtering
- The correlation coefficient
Embedded: Algorithms automatically select features (associations between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- Deep learning: convolution, etc
1) Wrapper

The module

sklearn.feature_selection
Copy the code

The filter type

Low variance feature filtering

Remove some of the features of low variance, which is what variance means. And let’s think about it in terms of the magnitude of the variance.

Small eigenvariance: the values of most samples of a feature are relatively similar
Large variance of a feature: many samples of a feature have different values

API

Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)
- Delete all low variance features
- Variance.fit_transform(X)
  - X:numpy array data [n_samples,n_features]
  - Return value: Features whose training set difference is lower than threshold will be deleted. The default value is to retain all non-zero variance features, that is, to delete features with the same value in all samples.

Data calculation

We make a filter between the indicator characteristics of certain stocks

All these features

pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
Copy the code

index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_ex pense,date,return0000, 001. XSHE, 5.9572, 1.1818, 85252550922.0, 0.8008, 14.9403, 1211444855670.0, 2.01, 20701401000.0, 10882540000.0, 2012-01-31, 0.0 27657228229937388 1000, 002. XSHE, 7.0289, 1.588, 84113358168.0, 1.6463, 7.8656, 300252061695.0, 0.326, 29308369223.2, 23783476901.2, 2012-01-31, 0.082 35182370820669 2000-008. XSHE, 262.7461, 7.0003, 517045520.0, 0.5678, 0.5943, 770517752.56, 0.006, 11679829.03, 12030080.04, 2012-01-31, 0.0997 8900335112327 3000, 060. XSHE, 16.476, 3.7146, 19680455995.0, 5.6036, 14.617, 28009159184.6, 0.35, 9189386877.65, 7935542726.05, 2012-01-31, 0.1215 9482758620697 4000, 069. XSHE, 12.5878, 2.5616, 41727214853.0, 2.8729, 10.9097, 81247380359.0, 0.271, 8951453490.28, 7091397989.13, 2012-01-31, 0. 0026808154146886697Copy the code

def variance_demo():
    """Filter low variance characteristics:""
    # 1. Get the data
    data = pd.read_csv("factor_returns.csv")
    data = data.iloc[:, 1: -2]
    print(data)

    # 2. Instantiate a converter
    transfer = VarianceThreshold(threshold=5)

    # 3, call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new", data_new, data_new.shape)
    return None


if __name__ == '__main__':
    # Low variance feature filtering
    variance_demo()
Copy the code

The correlation coefficient

Pearson Correlation Coefficient: a statistical indicator reflecting the close Correlation between variables

Formula calculation case (understand, do not memorize)

Formula:

For example, we calculate the annual advertising expenditure and monthly average sales

= 0.9942

So we finally concluded that there is a high positive correlation between advertising spending and average monthly sales.

The characteristics of

The value of the correlation coefficient is between -1 and +1, i.e. -1 ≤ r ≤+1. The properties are as follows:

When r>0, the two variables are positively correlated; when r<0, the two variables are negatively correlated
When the | r | = 1, said two variables for completely related, when r = 0, said no correlation between two variables
When the 0 < | r | < 1, said the two variables are related to a certain extent. And the closer the | r | 1, linear relationship between the two variables are more closely; | r | is close to 0, said two variables linear correlation is weak

According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 or less | r | < 0.7 is significant related; 0.7 or less | r | < 1 for highly linear correlation

The absolute value of the symbol: | r | r, | | – 5 = 5

API

from scipy.stats importPearsonr x: (N,) array_like y: (N,) array_like Returns: (Pearson's correlation coefficient, p-value)Copy the code

Principal component analysis

The target

PCA was used to reduce the dimension of features
Application: Principal component analysis between user and item category

What is Principal Component Analysis (PCA)

Definition: The process of transforming high-dimensional data into low-dimensional data, during which old data may be discarded and new variables created

Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), the loss of a small amount of information.

Application: regression analysis or cluster analysis

What about a better understanding of the process? Let’s look at a picture

API

sklearn.decomposition.PCA(n_components=None)
- Decompose the data into lower dimensional Spaces
- n_components:
  - Decimal: the percentage of information reserved
  - Integer: how many features to reduce to
- Pca.fit_transform (X) X: Numpy array format data [n_samples,n_features]
- Return value: The converted array of the specified dimension

Data calculation

6,3,0,8,8,4,5 [[2], [], [5,4,9,1]]Copy the code

def pca() :
    "" principal component analysis for dimension reduction:
    Information retention 70%
    pca = PCA(n_components=0.7)
    data = pca.fit_transform([[2.8.4.5], [6.3.0.8], [5.4.9.1]])
    print(data)

    return None
Copy the code

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

Order_products__prior. CSV: order and product information
- Fields: ORDER_ID, product_ID, add_to_cart_ORDER, reordered
Products.csv: product information
- Fields: product_ID, product_NAME, AISle_id, department_ID
Orders. CSV: user order information
- Fields: order_id, user_id, eval_set order_number,… .
Aisles.csv: commodity belongs to the specific category of goods
- Fields: aisle_id, aisle

Analysis of the

Join the table so that user_id and aisle are in the same table
Perform the cross table transformation
For dimension reduction

def pca_case_study() :
    """ Principal component analysis case: ""
    # To read the data of the four tables
    prior = pd.read_csv("./instacart/order_products__prior.csv")
    products = pd.read_csv("./instacart/products.csv")
    orders = pd.read_csv("./instacart/orders.csv")
    aisles = pd.read_csv("./instacart/aisles.csv")

    print(prior)

    Merge four tables
    mt = pd.merge(prior, products, on=['product_id'.'product_id'])
    mt1 = pd.merge(mt, orders, on=['order_id'.'order_id'])
    mt2 = pd.merge(mt1, aisles, on=['aisle_id'.'aisle_id'])
    
    # pd.crosstab count the number of relationships between users and items (count)
    cross = pd.crosstab(mt2['user_id'], mt2['aisle'])

    # PCA Perform principal component analysis
    pc = PCA(n_components=0.95)
    data_new = pc.fit_transform(cross)
    print("data_new:\n", data_new.shape)

    return None
Copy the code

Machine learning – Quick start feature engineering

directory

The data set

Feature extraction

The target

define

Feature extraction API

Dictionary feature extraction

application

Process analysis

conclusion

Text feature extraction

application

Process analysis

Q: What if we replace the data with Chinese?

Jieba word segmentation processing

Case analysis

Analysis of the

Question: How do you deal with the high number of times a word or phrase appears in multiple articles?

Tf-idf text feature extraction

The formula

case

The importance of TF-IDF

Feature pretreatment

The target

What is feature preprocessing

Contains the content

Feature preprocessing API

The normalized

define

The formula

API

Data calculation

Normalized summary

standardized

define

The formula

API

Data calculation

Standardization summary

Feature dimension reduction

The target

Dimension reduction

Two ways to reduce dimension

Feature selection

What is feature selection

The module

The filter type

Low variance feature filtering

Data calculation

The correlation coefficient

The characteristics of

Principal component analysis

The target

What is Principal Component Analysis (PCA)

Data calculation

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

Analysis of the

Related Posts

Wang Haifeng: Cross-language communication is becoming a reality

Apple and Huawei have developed mobile AI chips, but developers seem to have strong opinions

Kaggle Black Friday data analysis and mining