Feature dimension reduction

The target

Know the embedded, filtered and wrapped methods of feature selection
Use VarianceThreshold to remove low variance features
Understand the characteristics and calculation of correlation coefficients
Feature selection is realized by applying correlation coefficient

Dimension reduction

Dimension reduction refers to the process of reducing the number of random variables (characteristics) to obtain a set of “uncorrelated” pivot variables under certain limited conditions

Lower the number of random variables
Correlated feature: Correlation between relative humidity and rainfall, etc

Because when we train, we use features to learn. If there is a problem in the feature itself or a strong correlation between features, it will have a great impact on the algorithm learning and prediction

Two ways to reduce dimension

Feature selection
Principal component analysis (can understand a feature extraction method)

Feature selection

What is feature selection

Definition: Data containing redundant or irrelevant variables (or features, attributes, indicators, etc.) designed to identify the main features from the original features.

Methods:

Filter: Mainly explore the characteristics themselves, features and features and the correlation between features and target values
- Variance selection method: low variance feature filtering
- The correlation coefficient
Embedded: Algorithms automatically select features (associations between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- Deep learning: convolution, etc
1) Wrapper

The module

sklearn.feature_selection
Copy the code

The filter type

Low variance feature filtering

Remove some of the features of low variance, which is what variance means. And let’s think about it in terms of the magnitude of the variance.

Small eigenvariance: the values of most samples of a feature are relatively similar
Large variance of a feature: many samples of a feature have different values

API

Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)
- Delete all low variance features
- Variance.fit_transform(X)
  - X:numpy array data [n_samples,n_features]
  - Return value: Features whose training set difference is lower than threshold will be deleted. The default value is to retain all non-zero variance features, that is, to delete features with the same value in all samples.

Data calculation

We make a filter between the indicator characteristics of certain stocks

All these features

pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
Copy the code

index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_ex pense,date,return0000, 001. XSHE, 5.9572, 1.1818, 85252550922.0, 0.8008, 14.9403, 1211444855670.0, 2.01, 20701401000.0, 10882540000.0, 2012-01-31, 0.0 27657228229937388 1000, 002. XSHE, 7.0289, 1.588, 84113358168.0, 1.6463, 7.8656, 300252061695.0, 0.326, 29308369223.2, 23783476901.2, 2012-01-31, 0.082 35182370820669 2000-008. XSHE, 262.7461, 7.0003, 517045520.0, 0.5678, 0.5943, 770517752.56, 0.006, 11679829.03, 12030080.04, 2012-01-31, 0.0997 8900335112327 3000, 060. XSHE, 16.476, 3.7146, 19680455995.0, 5.6036, 14.617, 28009159184.6, 0.35, 9189386877.65, 7935542726.05, 2012-01-31, 0.1215 9482758620697 4000, 069. XSHE, 12.5878, 2.5616, 41727214853.0, 2.8729, 10.9097, 81247380359.0, 0.271, 8951453490.28, 7091397989.13, 2012-01-31, 0. 0026808154146886697Copy the code

def variance_demo():
    """Filter low variance characteristics:""
    # 1. Get the data
    data = pd.read_csv("factor_returns.csv")
    data = data.iloc[:, 1: -2]
    print(data)

    # 2. Instantiate a converter
    transfer = VarianceThreshold(threshold=5)

    # 3, call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new", data_new, data_new.shape)
    return None


if __name__ == '__main__':
    # Low variance feature filtering
    variance_demo()
Copy the code

The correlation coefficient

Pearson Correlation Coefficient: a statistical indicator reflecting the close Correlation between variables

Formula calculation case (understand, do not memorize)

Formula:

For example, we calculate the annual advertising expenditure and monthly average sales

= 0.9942

So we finally concluded that there is a high positive correlation between advertising spending and average monthly sales.

The characteristics of

The value of the correlation coefficient is between -1 and +1, i.e. -1 ≤ r ≤+1. The properties are as follows:

When r>0, the two variables are positively correlated; when r<0, the two variables are negatively correlated
When the | r | = 1, said two variables for completely related, when r = 0, said no correlation between two variables
When the 0 < | r | < 1, said the two variables are related to a certain extent. And the closer the | r | 1, linear relationship between the two variables are more closely; | r | is close to 0, said two variables linear correlation is weak

According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 or less | r | < 0.7 is significant related; 0.7 or less | r | < 1 for highly linear correlation

The absolute value of the symbol: | r | r, | | – 5 = 5

API

from scipy.stats importPearsonr x: (N,) array_like y: (N,) array_like Returns: (Pearson's correlation coefficient, p-value)Copy the code

Principal component analysis

The target

PCA was used to reduce the dimension of features
Application: Principal component analysis between user and item category

What is Principal Component Analysis (PCA)

Definition: The process of transforming high-dimensional data into low-dimensional data, during which old data may be discarded and new variables created

Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), the loss of a small amount of information.

Application: regression analysis or cluster analysis

What about a better understanding of the process? Let’s look at a picture

API

sklearn.decomposition.PCA(n_components=None)
- Decompose the data into lower dimensional Spaces
- n_components:
  - Decimal: the percentage of information reserved
  - Integer: how many features to reduce to
- Pca.fit_transform (X) X: Numpy array format data [n_samples,n_features]
- Return value: The converted array of the specified dimension

Data calculation

6,3,0,8,8,4,5 [[2], [], [5,4,9,1]]Copy the code

def pca() :
    "" principal component analysis for dimension reduction:
    Information retention 70%
    pca = PCA(n_components=0.7)
    data = pca.fit_transform([[2.8.4.5], [6.3.0.8], [5.4.9.1]])
    print(data)

    return None
Copy the code

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

Order_products__prior. CSV: order and product information
- Fields: ORDER_ID, product_ID, add_to_cart_ORDER, reordered
Products.csv: product information
- Fields: product_ID, product_NAME, AISle_id, department_ID
Orders. CSV: user order information
- Fields: order_id, user_id, eval_set order_number,… .
Aisles.csv: commodity belongs to the specific category of goods
- Fields: aisle_id, aisle

Analysis of the

Join the table so that user_id and aisle are in the same table
Perform the cross table transformation
For dimension reduction

def pca_case_study() :
    """ Principal component analysis case: ""
    # To read the data of the four tables
    prior = pd.read_csv("./instacart/order_products__prior.csv")
    products = pd.read_csv("./instacart/products.csv")
    orders = pd.read_csv("./instacart/orders.csv")
    aisles = pd.read_csv("./instacart/aisles.csv")

    print(prior)

    Merge four tables
    mt = pd.merge(prior, products, on=['product_id'.'product_id'])
    mt1 = pd.merge(mt, orders, on=['order_id'.'order_id'])
    mt2 = pd.merge(mt1, aisles, on=['aisle_id'.'aisle_id'])
    
    # pd.crosstab count the number of relationships between users and items (count)
    cross = pd.crosstab(mt2['user_id'], mt2['aisle'])

    # PCA Perform principal component analysis
    pc = PCA(n_components=0.95)
    data_new = pc.fit_transform(cross)
    print("data_new:\n", data_new.shape)

    return None
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning – Feature dimension reduction

Feature dimension reduction

The target

Dimension reduction

Two ways to reduce dimension

Feature selection

What is feature selection

The module

The filter type

Low variance feature filtering

Data calculation

The correlation coefficient

The characteristics of

Principal component analysis

The target

What is Principal Component Analysis (PCA)

Data calculation

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

Analysis of the

Machine learning – Feature dimension reduction

Feature dimension reduction

The target

Dimension reduction

Two ways to reduce dimension

Feature selection

What is feature selection

The module

The filter type

Low variance feature filtering

Data calculation

The correlation coefficient

The characteristics of

Principal component analysis

The target

What is Principal Component Analysis (PCA)

Data calculation

Case: To explore the user’s preference for item category segmentation and dimension reduction

data

Analysis of the

Related Posts

Essay – Bulk resize images in Python

Flink study Notes (4) — In-depth understanding of Watermark

Why is ethics one of the biggest challenges for AI?