Feature dimension reduction

The target

  • Know the embedded, filtered and wrapped methods of feature selection

  • Use VarianceThreshold to remove low variance features

  • Understand the characteristics and calculation of correlation coefficients

  • Feature selection is realized by applying correlation coefficient

Dimension reduction

Dimension reduction refers to the process of reducing the number of random variables (characteristics) to obtain a set of “uncorrelated” pivot variables under certain limited conditions

  • Lower the number of random variables

  • Correlated feature: Correlation between relative humidity and rainfall, etc

Because when we train, we use features to learn. If there is a problem in the feature itself or a strong correlation between features, it will have a great impact on the algorithm learning and prediction

Two ways to reduce dimension

  • Feature selection

  • Principal component analysis (can understand a feature extraction method)

Feature selection

What is feature selection

Definition: Data containing redundant or irrelevant variables (or features, attributes, indicators, etc.) designed to identify the main features from the original features.

Methods:

  • Filter: Mainly explore the characteristics themselves, features and features and the correlation between features and target values

    • Variance selection method: low variance feature filtering
    • The correlation coefficient
  • Embedded: Algorithms automatically select features (associations between features and target values)

    • Decision tree: information entropy, information gain
    • Regularization: L1, L2
    • Deep learning: convolution, etc
  • 1) Wrapper

The module

sklearn.feature_selection
Copy the code

The filter type

Low variance feature filtering

Remove some of the features of low variance, which is what variance means. And let’s think about it in terms of the magnitude of the variance.

  • Small eigenvariance: the values of most samples of a feature are relatively similar

  • Large variance of a feature: many samples of a feature have different values

API

  • Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)
    • Delete all low variance features
    • Variance.fit_transform(X)
      • X:numpy array data [n_samples,n_features]
      • Return value: Features whose training set difference is lower than threshold will be deleted. The default value is to retain all non-zero variance features, that is, to delete features with the same value in all samples.

Data calculation

We make a filter between the indicator characteristics of certain stocks

All these features

pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
Copy the code
index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_ex pense,date,return0000, 001. XSHE, 5.9572, 1.1818, 85252550922.0, 0.8008, 14.9403, 1211444855670.0, 2.01, 20701401000.0, 10882540000.0, 2012-01-31, 0.0 27657228229937388 1000, 002. XSHE, 7.0289, 1.588, 84113358168.0, 1.6463, 7.8656, 300252061695.0, 0.326, 29308369223.2, 23783476901.2, 2012-01-31, 0.082 35182370820669 2000-008. XSHE, 262.7461, 7.0003, 517045520.0, 0.5678, 0.5943, 770517752.56, 0.006, 11679829.03, 12030080.04, 2012-01-31, 0.0997 8900335112327 3000, 060. XSHE, 16.476, 3.7146, 19680455995.0, 5.6036, 14.617, 28009159184.6, 0.35, 9189386877.65, 7935542726.05, 2012-01-31, 0.1215 9482758620697 4000, 069. XSHE, 12.5878, 2.5616, 41727214853.0, 2.8729, 10.9097, 81247380359.0, 0.271, 8951453490.28, 7091397989.13, 2012-01-31, 0. 0026808154146886697Copy the code
def variance_demo():
    """Filter low variance characteristics:""
    # 1. Get the data
    data = pd.read_csv("factor_returns.csv")
    data = data.iloc[:, 1: -2]
    print(data)

    # 2. Instantiate a converter
    transfer = VarianceThreshold(threshold=5)

    # 3, call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new", data_new, data_new.shape)
    return None


if __name__ == '__main__':
    # Low variance feature filtering
    variance_demo()
Copy the code

The correlation coefficient

Pearson Correlation Coefficient: a statistical indicator reflecting the close Correlation between variables

Formula calculation case (understand, do not memorize)

Formula:

For example, we calculate the annual advertising expenditure and monthly average sales

= 0.9942

So we finally concluded that there is a high positive correlation between advertising spending and average monthly sales.

The characteristics of

The value of the correlation coefficient is between -1 and +1, i.e. -1 ≤ r ≤+1. The properties are as follows:

  • When r>0, the two variables are positively correlated; when r<0, the two variables are negatively correlated

  • When the | r | = 1, said two variables for completely related, when r = 0, said no correlation between two variables

  • When the 0 < | r | < 1, said the two variables are related to a certain extent. And the closer the | r | 1, linear relationship between the two variables are more closely; | r | is close to 0, said two variables linear correlation is weak

According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 or less | r | < 0.7 is significant related; 0.7 or less | r | < 1 for highly linear correlation

The absolute value of the symbol: | r | r, | | – 5 = 5

API

from scipy.stats importPearsonr x: (N,) array_like y: (N,) array_like Returns: (Pearson's correlation coefficient, p-value)Copy the code

Principal component analysis

The target

  • PCA was used to reduce the dimension of features

  • Application: Principal component analysis between user and item category

What is Principal Component Analysis (PCA)

Definition: The process of transforming high-dimensional data into low-dimensional data, during which old data may be discarded and new variables created

Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), the loss of a small amount of information.

Application: regression analysis or cluster analysis

What about a better understanding of the process? Let’s look at a picture

API

  • sklearn.decomposition.PCA(n_components=None)
    • Decompose the data into lower dimensional Spaces
    • n_components:
      • Decimal: the percentage of information reserved
      • Integer: how many features to reduce to
    • Pca.fit_transform (X) X: Numpy array format data [n_samples,n_features]
    • Return value: The converted array of the specified dimension

Data calculation

6,3,0,8,8,4,5 [[2], [], [5,4,9,1]]Copy the code
def pca() :
    "" principal component analysis for dimension reduction:
    Information retention 70%
    pca = PCA(n_components=0.7)
    data = pca.fit_transform([[2.8.4.5], [6.3.0.8], [5.4.9.1]])
    print(data)

    return None
Copy the code

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

  • Order_products__prior. CSV: order and product information

    • Fields: ORDER_ID, product_ID, add_to_cart_ORDER, reordered
  • Products.csv: product information

    • Fields: product_ID, product_NAME, AISle_id, department_ID
  • Orders. CSV: user order information

    • Fields: order_id, user_id, eval_set order_number,… .
  • Aisles.csv: commodity belongs to the specific category of goods

    • Fields: aisle_id, aisle

Analysis of the

  • Join the table so that user_id and aisle are in the same table

  • Perform the cross table transformation

  • For dimension reduction

def pca_case_study() :
    """ Principal component analysis case: ""
    # To read the data of the four tables
    prior = pd.read_csv("./instacart/order_products__prior.csv")
    products = pd.read_csv("./instacart/products.csv")
    orders = pd.read_csv("./instacart/orders.csv")
    aisles = pd.read_csv("./instacart/aisles.csv")

    print(prior)

    Merge four tables
    mt = pd.merge(prior, products, on=['product_id'.'product_id'])
    mt1 = pd.merge(mt, orders, on=['order_id'.'order_id'])
    mt2 = pd.merge(mt1, aisles, on=['aisle_id'.'aisle_id'])
    
    # pd.crosstab count the number of relationships between users and items (count)
    cross = pd.crosstab(mt2['user_id'], mt2['aisle'])

    # PCA Perform principal component analysis
    pc = PCA(n_components=0.95)
    data_new = pc.fit_transform(cross)
    print("data_new:\n", data_new.shape)

    return None
Copy the code