directory

  • The data set
    • Available data set

    • Sklearn data set

  • Feature extraction
    • The dictionary

    • The text

  • Feature pretreatment
    • dimensionless
      • The normalized
      • standardized
  • Feature dimension reduction
    • Feature selection

    • Principal Component Analysis (PCA)

The data set

Here are some examples of what counts as a data set:

  • A table or CSV file that contains some data

  • An organized collection of tables

  • A file in a proprietary format that contains data

  • A group of files that together constitute a meaningful data set

  • Structured objects that contain data in other formats that you might want to load into special tools for processing

  • An image that captures the data

  • Files related to machine learning, such as trained parameters or neural network structure definitions

  • Anything that looks like a data set

Sklearn is a powerful Python third party machine learning library that covers everything from data preprocessing to training models. Using Scikit-Learn in the real world can save us a lot of coding time and reduce the amount of code we use, allowing us to spend more time analyzing data distribution, adjusting models and modifying superparameters. (Sklearn is the package name)

Feature extraction

The target

  • DictVectorizer is used to realize the numerical and discrete classification features

  • Use CountVectorizer to numerize text features

  • The TfidfVectorizer is used to numerize text features

  • Tell the difference between the two methods of text feature extraction

define

Feature extraction is the conversion of arbitrary data, such as text or images, into digital features that can be used for machine learning

Note: The eigenvalue is for the computer to understand the data better

  • Dictionary feature Extraction (Feature discretization)

  • Text feature extraction

  • Image Feature Extraction (Deep learning)

Feature extraction API

sklearn.feature_extraction

Dictionary feature extraction

Function: Eigenize dictionary data

  • Sklearn. Feature_extraction. DictVectorizer (sparse = True,…).
    • DictVectorizer. Fit_transform (X) X: dictionary or iterator containing a dictionary. Return value: Sparse matrix
    • DictVectorizer. Inverse_transform (X) X: array Array or Sparse matrix Return value: Data format before conversion
    • Dictvectorizer.get_feature_names () returns the category name

application

Feature extraction is carried out on the following data

    data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]
Copy the code

Process analysis

  • Instantiate the class DictVectorizer

  • Call the fit_transform method to input the data and transform it (note the return format)

def dict_demo() :
    """ Dictionary eigenvalue extraction :return: """
    data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]
    # 1. Instantiate a converter that returns a sparse matrix by default to represent non-zero values in position to save memory and improve load efficiency
    transfer = DictVectorizer(sparse=False)

    # Application scenario: more category eigenvalues in the data set; Attribute the data set to the dictionary type; DictVectorizer transformation; I got the dictionary

    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    print("Feature name :\n", transfer.get_feature_names())
    return None
Copy the code

Notice the result without the sparse=False parameter

This result is not what you want to see, so add parameters to get the desired result, and here the data processing technique is called “one-hot” code.

conclusion

One-hot coding will be performed for features with category information

Text feature extraction

Function: Carries on the characteristic value to the text data

  • sklearn.feature_extraction.text.CountVectorizer(stop_words=[])

    • Returns the word frequency matrix
  • Countvectorizer.fit_transform (X) X: text or iterable containing a text string. Returned value: Sparse matrix

  • CountVectorizer. Inverse_transform (X) X: Array Array or Sparse matrix Returned value: Data lattice before conversion

  • Countvectorizer.get_feature_names () Returns a value: list of words

  • sklearn.feature_extraction.text.TfidfVectorizer

application

Feature extraction is carried out on the following data

data = ["life is short, i like python"."life is too long i dislike python"]
Copy the code

Process analysis

  • Instantiate CountVectorizer

  • The fit_transform method is called to input data and transform it (note the return format: ToArray () is used to transform an array array with sparse matrix).

def count_demo() :
    """ Text eigenvalue extraction :return: """
    data = ["life is short, i like python"."life is too long i dislike python"]
    # 1. Instantiate a converter class
    transfer = CountVectorizer()
    # Demo stop words
    # transfer = CountVectorizer(stop_words=["is", "too"])
    data_new = transfer.fit_transform(data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    # 2, call fit_transform

    return None
Copy the code

Q: What if we replace the data with Chinese?

Find English is separated by Spaces by default. In fact, a word segmentation effect has been achieved, so we need to carry out word segmentation for Chinese

The following code requires that the text be whitespace ahead of time

def count_chinese_demo() :
    """ Chinese text eigenvalue extraction :return: """
    data = ["I love Tian 'anmen Square in Beijing"."The sun rises over Tian 'anmen."]
    data2 = ["I love Tian 'anmen Square in Beijing"."The sun rises over Tian 'anmen."]
    # 1. Instantiate a converter class
    transfer = CountVectorizer()
    data_new = transfer.fit_transform(data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    # 2, call fit_transform

    return None
Copy the code

For a better way to deal with it, see the following scheme

Jieba word segmentation processing

  • jieba.cut()
    • Returns a generator composed of words

The jieba library needs to be installed

pip install jieba

Case analysis

    data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
Copy the code

Analysis of the

  • Prepare sentences, using jieba. Cut to participle

  • Instantiation CountVectorizer

  • Take the result of the participle to a string as the input value of fit_Transform

def count_word(text) :
    I love Beijing Tian 'anmen square - - I love Beijing Tian 'anmen Square: :return: """
    a = "".join(list(jieba.cut(text)))
    print(a)
    return a


def count_chinese_demo2() :
    """ Chinese text feature value extraction automatic word segmentation:
    data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
    # 1. Instantiate a converter class
    transfer = CountVectorizer(stop_words=["From Mama."])
    data_new = transfer.fit_transform(count_word(item) for item in data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    # 2, call fit_transform

    return None
Copy the code

Question: How do you deal with the high number of times a word or phrase appears in multiple articles?

Tf-idf text feature extraction

The main idea of TF-IDF is that if a certain word or phrase has a high probability of appearing in one article and rarely appears in other articles, it is considered that this word or phrase has good classification ability and is suitable for classification.

Function of TF-IDF: It is used to evaluate the importance of a word to a file set or a document in a corpus.

The formula

Term frequency (TF) refers to the frequency of occurrence of a given word in the document

Inverse document Frequency (IDF) is a measure of the universal importance of a word. The IDF of a given term can be obtained by dividing the total number of documents by the number of documents containing the term and multiplying the resulting quotient by the logarithm of base 10

The final result can be understood as the degree of importance.

Note: If the total number of words in a document is 100 and the word “very” appears 5 times, then the word frequency of “very” in the document is 5/100=0.05. The file frequency (IDF) is calculated by dividing the total number of files in the set by the number of files in which the word “unusual” appears. So, if the word “extraordinary” appears in 1,000 documents and the total number of documents is 10,000,000, the reverse file frequency is LG (10,000,000/1,0000) =3. Finally, the tF-IDF score of “very” for this document was 0.05 * 3=0.15

case

def tfidf_demo() :
    Extracting text eigenvalues using TF-IDF method :return: """
    data = ["In the past two months, I've spent an hour talking to more than 60 people, one-to-one."."They are mostly friends who want to try and monetize their side business."."From first-tier cities to third-tier cities, from mothers to workers, from the workplace to the system."]
    transfer = TfidfVectorizer(stop_words=["From Mama."])
    data_new = transfer.fit_transform(count_word(item) for item in data)

    print("data_new:\n", data_new.toarray())
    print("Feature name :\n", transfer.get_feature_names())
    return None
Copy the code

The importance of TF-IDF

The classification machine learning algorithm is used to process data in the early stage of article classification

Feature pretreatment

The target

  • Understand the characteristics of numerical data and category data

  • MinMaxScaler is used to normalize the feature data

  • StandardScaler is used to standardize the feature data

What is feature preprocessing

Feature preprocessing: the process of converting feature data into feature data more suitable for the algorithm model through some conversion functions

You can see it in the picture above

Contains the content

Dimensionless of numerical data:

  • The normalized

  • standardized

Feature preprocessing API

sklearn.preprocessing
Copy the code

Why do we normalize/standardize?

The units or sizes of features differ greatly, or the variance of a feature is several orders of magnitude larger than that of other features, which is easy to influence (dominate) the target results, making some algorithms unable to learn other features

We need to use some method of dimensionless, so that different specifications of data into the same specification

The normalized

define

Maps data between (default: [0,1]) by transforming the original data

The formula

For each column, Max is the maximum value of a column, min is the minimum value of a column, then X “” is the final result, mx and mi are specified interval values respectively. By default, mx is 1 and mi is 0

API

  • Sklearn. Preprocessing. MinMaxScaler (feature_range = (0, 1))…
    • MinMaxScalar.fit_transform(X)

      • X: numpy array data [n_samples,n_features]
    • Return value: the converted array of the same shape

Data calculation

We calculate the following data in dating.txt. This is the data of the previous dates

Milage, Liters, Consumtime, target,8.326976 40920, 0.953952, 3, 14488,7.153469, 1.673904, 2 26052,1.441871, 0.805124, 1 38344,1.669788,13.147394 75136, 0.428964, 1, 0.134296, 1Copy the code

Analysis of the

  • Instantiation MinMaxScalar

  • By fit_transform

def minmax_demo() :
    """ Normalization :return: """
    # 1. Get the data
    data = pd.read_csv("dating.txt")
    data = data.iloc[:, :3]
    print(data)

    # 2. Instantiate a converter class
    transform = MinMaxScaler()
    # transform = MinMaxScaler(feature_range=[2, 3])

    # 3, call fit_transform
    data_new = transform.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Normalized summary

Note that the maximum and minimum values are variable. In addition, the maximum and minimum values are easily affected by outliers, so this method has poor robustness and is only suitable for traditional precise small data scenarios.

standardized

define

You transform the original data into something with a mean of 0 and a standard deviation of 1

The formula

For each column, mean is the mean and σ is the standard deviation

So back to where we were with the outliers, let’s look at normalization again

  • For normalization: If an outlier occurs that affects the maximum and minimum values, the results will obviously change

  • For standardization: If outliers occur, a small number of outliers due to a certain amount of data

  • The impact on the average is not large, so the variance change is small.

API

  • sklearn.preprocessing.StandardScaler( )
    • All the data is clustered around the mean 0 for each column and the standard deviation difference is 1
    • StandardScaler.fit_transform(X)
      • X:numpy array data [n_samples,n_features]
    • Return value: the converted array of the same shape

Data calculation

The above data is processed as well

60,4,15,45,2,10,40 [[90], [], [75,3,13,46]]Copy the code

Analysis of the

  • Instantiation StandardScaler

  • By fit_transform

def stand_demo() :
    """ """ """ """ """ """ """ "
    # 1. Get the data
    data = pd.read_csv("dating.txt")
    data = data.iloc[:, :3]
    print(data)

    # 2. Instantiate a converter class
    transform = StandardScaler()

    # 3, call fit_transform
    data_new = transform.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Standardization summary

In the case of sufficient samples, it is relatively stable and suitable for modern noisy big data scene.

Feature dimension reduction

The target

  • Know the embedded, filtered and wrapped methods of feature selection

  • Use VarianceThreshold to remove low variance features

  • Understand the characteristics and calculation of correlation coefficients

  • Feature selection is realized by applying correlation coefficient

Dimension reduction

Dimension reduction refers to the process of reducing the number of random variables (characteristics) to obtain a set of “uncorrelated” pivot variables under certain limited conditions

  • Lower the number of random variables

  • Correlated feature: Correlation between relative humidity and rainfall, etc

Because when we train, we use features to learn. If there is a problem in the feature itself or a strong correlation between features, it will have a great impact on the algorithm learning and prediction

Two ways to reduce dimension

  • Feature selection

  • Principal component analysis (can understand a feature extraction method)

Feature selection

What is feature selection

Definition: Data containing redundant or irrelevant variables (or features, attributes, indicators, etc.) designed to identify the main features from the original features.

Methods:

  • Filter: Mainly explore the characteristics themselves, features and features and the correlation between features and target values

    • Variance selection method: low variance feature filtering
    • The correlation coefficient
  • Embedded: Algorithms automatically select features (associations between features and target values)

    • Decision tree: information entropy, information gain
    • Regularization: L1, L2
    • Deep learning: convolution, etc
  • 1) Wrapper

The module

sklearn.feature_selection
Copy the code

The filter type

Low variance feature filtering

Remove some of the features of low variance, which is what variance means. And let’s think about it in terms of the magnitude of the variance.

  • Small eigenvariance: the values of most samples of a feature are relatively similar

  • Large variance of a feature: many samples of a feature have different values

API

  • Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)
    • Delete all low variance features
    • Variance.fit_transform(X)
      • X:numpy array data [n_samples,n_features]
      • Return value: Features whose training set difference is lower than threshold will be deleted. The default value is to retain all non-zero variance features, that is, to delete features with the same value in all samples.

Data calculation

We make a filter between the indicator characteristics of certain stocks

All these features

pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
Copy the code
index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_ex pense,date,return0000, 001. XSHE, 5.9572, 1.1818, 85252550922.0, 0.8008, 14.9403, 1211444855670.0, 2.01, 20701401000.0, 10882540000.0, 2012-01-31, 0.0 27657228229937388 1000, 002. XSHE, 7.0289, 1.588, 84113358168.0, 1.6463, 7.8656, 300252061695.0, 0.326, 29308369223.2, 23783476901.2, 2012-01-31, 0.082 35182370820669 2000-008. XSHE, 262.7461, 7.0003, 517045520.0, 0.5678, 0.5943, 770517752.56, 0.006, 11679829.03, 12030080.04, 2012-01-31, 0.0997 8900335112327 3000, 060. XSHE, 16.476, 3.7146, 19680455995.0, 5.6036, 14.617, 28009159184.6, 0.35, 9189386877.65, 7935542726.05, 2012-01-31, 0.1215 9482758620697 4000, 069. XSHE, 12.5878, 2.5616, 41727214853.0, 2.8729, 10.9097, 81247380359.0, 0.271, 8951453490.28, 7091397989.13, 2012-01-31, 0. 0026808154146886697Copy the code
def variance_demo():
    """Filter low variance characteristics:""
    # 1. Get the data
    data = pd.read_csv("factor_returns.csv")
    data = data.iloc[:, 1: -2]
    print(data)

    # 2. Instantiate a converter
    transfer = VarianceThreshold(threshold=5)

    # 3, call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new", data_new, data_new.shape)
    return None


if __name__ == '__main__':
    # Low variance feature filtering
    variance_demo()
Copy the code

The correlation coefficient

Pearson Correlation Coefficient: a statistical indicator reflecting the close Correlation between variables

Formula calculation case (understand, do not memorize)

Formula:

For example, we calculate the annual advertising expenditure and monthly average sales

= 0.9942

So we finally concluded that there is a high positive correlation between advertising spending and average monthly sales.

The characteristics of

The value of the correlation coefficient is between -1 and +1, i.e. -1 ≤ r ≤+1. The properties are as follows:

  • When r>0, the two variables are positively correlated; when r<0, the two variables are negatively correlated

  • When the | r | = 1, said two variables for completely related, when r = 0, said no correlation between two variables

  • When the 0 < | r | < 1, said the two variables are related to a certain extent. And the closer the | r | 1, linear relationship between the two variables are more closely; | r | is close to 0, said two variables linear correlation is weak

According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 or less | r | < 0.7 is significant related; 0.7 or less | r | < 1 for highly linear correlation

The absolute value of the symbol: | r | r, | | – 5 = 5

API

from scipy.stats importPearsonr x: (N,) array_like y: (N,) array_like Returns: (Pearson's correlation coefficient, p-value)Copy the code

Principal component analysis

The target

  • PCA was used to reduce the dimension of features

  • Application: Principal component analysis between user and item category

What is Principal Component Analysis (PCA)

Definition: The process of transforming high-dimensional data into low-dimensional data, during which old data may be discarded and new variables created

Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), the loss of a small amount of information.

Application: regression analysis or cluster analysis

What about a better understanding of the process? Let’s look at a picture

API

  • sklearn.decomposition.PCA(n_components=None)
    • Decompose the data into lower dimensional Spaces
    • n_components:
      • Decimal: the percentage of information reserved
      • Integer: how many features to reduce to
    • Pca.fit_transform (X) X: Numpy array format data [n_samples,n_features]
    • Return value: The converted array of the specified dimension

Data calculation

6,3,0,8,8,4,5 [[2], [], [5,4,9,1]]Copy the code
def pca() :
    "" principal component analysis for dimension reduction:
    Information retention 70%
    pca = PCA(n_components=0.7)
    data = pca.fit_transform([[2.8.4.5], [6.3.0.8], [5.4.9.1]])
    print(data)

    return None
Copy the code

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

  • Order_products__prior. CSV: order and product information

    • Fields: ORDER_ID, product_ID, add_to_cart_ORDER, reordered
  • Products.csv: product information

    • Fields: product_ID, product_NAME, AISle_id, department_ID
  • Orders. CSV: user order information

    • Fields: order_id, user_id, eval_set order_number,… .
  • Aisles.csv: commodity belongs to the specific category of goods

    • Fields: aisle_id, aisle

Analysis of the

  • Join the table so that user_id and aisle are in the same table

  • Perform the cross table transformation

  • For dimension reduction

def pca_case_study() :
    """ Principal component analysis case: ""
    # To read the data of the four tables
    prior = pd.read_csv("./instacart/order_products__prior.csv")
    products = pd.read_csv("./instacart/products.csv")
    orders = pd.read_csv("./instacart/orders.csv")
    aisles = pd.read_csv("./instacart/aisles.csv")

    print(prior)

    Merge four tables
    mt = pd.merge(prior, products, on=['product_id'.'product_id'])
    mt1 = pd.merge(mt, orders, on=['order_id'.'order_id'])
    mt2 = pd.merge(mt1, aisles, on=['aisle_id'.'aisle_id'])
    
    # pd.crosstab count the number of relationships between users and items (count)
    cross = pd.crosstab(mt2['user_id'], mt2['aisle'])

    # PCA Perform principal component analysis
    pc = PCA(n_components=0.95)
    data_new = pc.fit_transform(cross)
    print("data_new:\n", data_new.shape)

    return None
Copy the code