Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

In the past two weeks, I have been reading a book “Introduction and Practice of Feature Engineering”, which has inspired me a lot.

Feature engineering is a very important step in the process of data worker modeling. How to find or select features with high correlation with the target from numerous features is particularly important.

There is a correlation coefficient mentioned in this book, I read a lot of information, sort out this article, I hope to help you.

What is the correlation coefficient

The correlation coefficient, also known as the correlation coefficient, is developed by statistician Karl. A statistical index designed by Pearson, the most commonly used one is also the Person Pearson correlation coefficient. Correlation coefficient describes the relationship between two variables and the direction of correlation. But the correlation coefficient does not accurately indicate the extent of the correlation between two variables.

The correlation coefficient is between -1 and 1:

  • -1 indicates that the two variables are negatively correlated
  • 0 means there is no correlation between the two variables
  • 1 indicates a positive correlation between the two variables

Finally, correlation coefficient is a concept in statistics. How do you get started in statistics? Peter suggests you look at this chart below:

Three common correlation coefficients

** Pearson correlation coefficient, Spearman correlation coefficient and Kendall correlation coefficient are called the three major correlation coefficients in statistics.

The most common method for analyzing feature correlation is pandas. Datafame. Corr:

DataFrame. Corr (self, method = "Pearson", min_periods =1)
Copy the code

The methods included are as follows:

  • Pearson: Pearson correlation coefficient
  • Spearman: Spearman rank correlation coefficient
  • Kendall: Kendall rank correlation coefficient

Person correlation coefficient

Person correlation coefficient, also called simple correlation coefficient or linear correlation coefficient, is used to detect the degree of linear correlation between two continuous variables.

The overall Person correlation coefficient is ρ\rhoρ, and the calculation formula is


rho x . y = cov ( x . y ) sigma x sigma y = E [ ( x mu x . y mu y ) ] sigma x sigma y \rho_{x, y}=\frac{\operatorname{cov}(\boldsymbol{x}, \boldsymbol{y})}{\sigma_{x} \sigma_{y}}=\frac{E\left[\left(\boldsymbol{x}-\mu_{x}, \boldsymbol{y}-\mu_{y}\right)\right]}{\sigma_{x} \sigma_{y}}

Or as follows:


rho X . Y = E ( X Y ) E ( X ) E ( Y ) E ( X 2 ) ( E ( X ) ) 2 E ( Y 2 ) ( E ( Y ) ) 2 \rho_{X, Y}=\frac{E(X Y)-E(X) E(Y)}{\sqrt{E\left(X^{2}\right)-(E(X))^{2}} \sqrt{E\left(Y^{2}\right)-(E(Y))^{2}}}

The Person correlation coefficient of the sample is represented by letter R to measure the linear relationship between the two variables. The calculation formula is as follows:


r ( X . Y ) = Cov ( X . Y ) Var [ X ] Var [ Y ] r(X, Y)=\frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}[X] \operatorname{Var}[Y]}}

The covariance of the two variables divided by the product of the standard deviations of the two variables

  • Cov(X,Y) is the covariance
  • Var[X] : it’s the variance, and when you take the square root, it becomes the standard deviation

⚠️ Summary: Pearson correlation coefficient between two variables is defined as the quotient of covariance and standard deviation between two variables

Spearman correlation coefficient

Spearman correlation coefficient is Spearman rank correlation coefficient named after Charles Edward Spearman. Usually represented by the Greek letter ρ\rhoρ, Spearman correlation coefficient is also defined as the Person correlation coefficient between rank variables.

The calculation formula is:


rho = i ( x i x ˉ ) ( y i y ˉ ) i ( x i x ˉ ) 2 i ( y i y ˉ ) 2 \rho=\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i}\left(x_{i}-\bar{x}\right)^{2} \sum_{i}\left(y_{i}-\bar{y}\right)^{2}}}

In practice, the links between variables are irrelevant, so the ρ can be calculated in a simple step. The difference of the grades of the two variables observed, ρ is:


rho = 1 6 d i 2 n ( n 2 1 ) \rho=1-\frac{6 \sum d_{i}^{2}}{n\left(n^{2}-1\right)}

Where, n represents the number of data points, did_idi represents the difference of the rank of data points (xi,yi)(x_i,y_i)(xi,yi) (rxi,ryi)(r_{xi},r_{yi})(rxi,ryi) : Di = rxi – ryid_i = r_ {xi} – r_ di = {yi} rxi – ryi

If there is a Person correlation coefficient, why is there a Spearman correlation coefficient? The Person correlation coefficient has certain limitations ** : First, the variable must be continuous, and second, it must follow a normal distribution **

Spearman’s correlation coefficient only cares about the monotonous relationship of variables, does not consider the influence of specific values, and can tolerate outliers. In general, Spearman’s correlation coefficient can be used wherever Person’s correlation coefficient can be used.

Quick understanding of rank order and rank sum:

The following two groups of data AB, how to find the rank and the sum of the rank?

1. Arrange two groups of ABL data in order:

2. Mark their order, i.e. rank. If the two values are the same, take the mean of the order

3. The sum of rank is the sum of rank:

A: 3.5 + 5 + 8 + 9 + 10 = 35.5

B: 1 + 2 + 3.5 + 6 + 7 = 19.5

Special case: When two variables have repeated data, the Spearman correlation coefficient calculated between variables is the Person correlation coefficient calculated between the rank of variables:


rho s = rho r x . r y = cov ( r x . r y ) sigma r x sigma r y \rho_{s}=\rho_{r_{x}, r_{y}}=\frac{\operatorname{cov}\left(r_{x}, r_{y}\right)}{\sigma_{r_{x}} \sigma_{r_{y}}}

Where: rxr_xrx represents the rank of variable x after conversion. As can be seen from the above definition, Spearman’s correlation coefficient is actually Pearson’s correlation coefficient after rank transformation of data

Kendall rank correlation coefficient

Kendall rank correlation coefficient is a rank correlation coefficient, which is used to measure the strength of the monotonic relationship between two ordered variables. Its value ranges from -1 to 1. The larger the absolute value is, the stronger the monotonic correlation is; when the value is 0, the correlation is completely irrelevant. Kendall coefficients are usually expressed by the Greek letter τ\tauτ(tau).

Categorical variables can be understood as categorical variables, which can be disordered, such as gender (male and female); It can also be ordered, such as grades: excellent, good, medium, poor.

Usually, it is to find the correlation coefficient of ordered classification variables.

How to solve the Person correlation coefficient

Here we introduce a variety of methods for solving Person correlation coefficients based on Python or third-party libraries. The Person correlation coefficient is still frequently used.

Import libraries

import pandas as pd
import numpy as np
import math
import random
Copy the code

Simulated data

Simulate a simple piece of data with two columns (variables)

Method 1: Based on pandas

The pandas library has a function called corr() that allows you to directly evaluate the correlation coefficients between variables of numeric type.

In the following results, we see:

  • The coefficients of the main diagonals are all 1, so the correlation coefficients between itself and itself are definitely all 1
  • The correlation coefficients of the subdiagonals must be the same

Corr function has a parameter method, specify different values, we can solve different correlation coefficient, default is Pearson coefficient, specify the value of the parameter below:

Method 2: Use Python custom functions

1. Find the mean of the two variables first

2. Find the expected value of xy

3. Xy covariance coefficient

The covariance coefficient of XY can be calculated according to the above results:

4. Find the standard deviation of XY

5. Solve the Person coefficient

The covariance divided by the standard deviation is the Person coefficient

Method 3: Based on Numpy

The Numpy library has a function called corrcoef that can directly evaluate the correlation coefficient between two variables and return an array. The values on the main diagonals are all 1 (the correlation between itself and itself is 1), and the auxiliary diagonals are the correlation coefficients between two different variables:

Method 4: Based on Scipy

Scipy is also a powerful data computing library for Python, including a Pearsonr function that can also solve Pearson correlation coefficients.

This function returns two values: the first for the Person correlation coefficient and the second for the p-value.

If the P value is less than 0.05, which is less than the significance level, the two variables are considered to be correlated

Case of actual combat

1. We first simulate a piece of data: ABCDE is 5 variables, and CAT is the last category (choose one between 0 and 1).

import random

# Simulate a data set: 5 classes +1 feature
df1 = pd.DataFrame({"A":np.random.randint(1.5.20),
                    "B":np.random.randint(4.10.20),
                    "C":np.random.randint(3.6.20),
                    "D":np.random.randint(22.50.20),
                    "E":np.random.randint(15.60.20),
                    "cat":np.random.randint(0.2.20),
                   },index=list(range(20)))
df2 = np.array(df1)
Copy the code

2. Calculate the mean of features and classes

 Calculate the average of features and classes in the data
  
def calcMean(x,y) :
    Parameter: x: data of the class y: data of the class Return value: average value of the class
    x_sum = sum(x)
    y_sum = sum(y)
    n = len(x)
    x_mean = float(x_sum) / n
    y_mean = float(y_sum) / n  
    return x_mean,y_mean  # return to mean
Copy the code

3. Calculate Pearson coefficient

def calcPearson(x,y) :
    x_mean, y_mean = calcMean(x,y)  Call the above function to return the mean
    n = len(x)
    sumTop = 0.0
    sumBottom = 0.0
    x_pow = 0.0
    y_pow = 0.0
    
    # Calculate covariance
    for i in range(n):
        sumTop += (x[i] - x_mean) * (y[i] - y_mean) 
    # Calculate standard deviation
    for i in range(n):
        x_pow += math.pow(x[i] - x_mean, 2)
    for i in range(n):
        y_pow += math.pow(y[i] - y_mean ,2)
    sumBottom = np.sqrt(x_pow * y_pow)  
    p = sumTop / sumBottom  # covariance/standard deviation
    return p
Copy the code

4. Calculate the contribution of each attribute

def calcAttribute(dataSet) :
    prr = []  Empty list to append data
    n,m = np.shape(dataSet)  Get the number of rows and columns
    x = [0] * n  Initialize feature x and category vector y
    y = [0] * n
    for i in range(n):
        y[i] = dataSet[i][m-1]  Get all the class vectors
    for j in range(m-1) :for k in range(n):
            x[k] = dataSet[k][j]
        prr.append(calcPearson(x,y))  # Calculate the correlation coefficient for each feature and category Y and store it in the list
    return prr
Copy the code

5. Calculation results

prr = calcAttribute(df2)
prr

# the results
[-0.12335134242111898,
 -0.05860090386731199,
 -0.39038619785678985,
 -0.14989060907230156,
 -0.03952841713829405]
Copy the code

The magnitude of correlation coefficients between each variable and category CAT is described below:

The python corR function is used to display the Pearson correlation coefficients. The return value is the DataFrame table.

  • The correlation coefficient on the diagonal is 1
  • The last line shows the correlation coefficient for each variable and category cat (the same as the custom function calculated above)

Also, we can look at the correlation between any two variables:

The resources

1, the commonly used Pearson correlation coefficient of feature selection methods: guyuecanhui. Making. IO / 2019/07/20 /…

2, how to understand rank and rank sum: blog.csdn.net/weixin_4215…

3, correlation analysis: Pearson, Kendall, Spearman:www.biaodianfu.com/pearson-ken…