Public account: You and the cabin by: Peter Editor: Peter
Hello, I’m Peter
In the past two weeks, I have been reading a book “Introduction and Practice of Feature Engineering”, which has inspired me a lot.
Feature engineering is a very important step in the process of data worker modeling. How to find or select features with high correlation with the target from numerous features is particularly important.
There is a correlation coefficient mentioned in this book, I read a lot of information, sort out this article, I hope to help you.
What is the correlation coefficient
The correlation coefficient, also known as the correlation coefficient, is developed by statistician Karl. A statistical index designed by Pearson, the most commonly used one is also the Person Pearson correlation coefficient. Correlation coefficient describes the relationship between two variables and the direction of correlation. But the correlation coefficient does not accurately indicate the extent of the correlation between two variables.
The correlation coefficient is between -1 and 1:
- -1 indicates that the two variables are negatively correlated
- 0 means there is no correlation between the two variables
- 1 indicates a positive correlation between the two variables
Finally, correlation coefficient is a concept in statistics. How do you get started in statistics? Peter suggests you look at this chart below:
Three common correlation coefficients
** Pearson correlation coefficient, Spearman correlation coefficient and Kendall correlation coefficient are called the three major correlation coefficients in statistics.
The most common method for analyzing feature correlation is pandas. Datafame. Corr:
DataFrame. Corr (self, method = "Pearson", min_periods =1)
Copy the code
The methods included are as follows:
- Pearson: Pearson correlation coefficient
- Spearman: Spearman rank correlation coefficient
- Kendall: Kendall rank correlation coefficient
Person correlation coefficient
Person correlation coefficient, also called simple correlation coefficient or linear correlation coefficient, is used to detect the degree of linear correlation between two continuous variables.
The overall Person correlation coefficient is ρ\rhoρ, and the calculation formula is
Or as follows:
The Person correlation coefficient of the sample is represented by letter R to measure the linear relationship between the two variables. The calculation formula is as follows:
The covariance of the two variables divided by the product of the standard deviations of the two variables
- Cov(X,Y) is the covariance
- Var[X] : it’s the variance, and when you take the square root, it becomes the standard deviation
⚠️ Summary: Pearson correlation coefficient between two variables is defined as the quotient of covariance and standard deviation between two variables
Spearman correlation coefficient
Spearman correlation coefficient is Spearman rank correlation coefficient named after Charles Edward Spearman. Usually represented by the Greek letter ρ\rhoρ, Spearman correlation coefficient is also defined as the Person correlation coefficient between rank variables.
The calculation formula is:
In practice, the links between variables are irrelevant, so the ρ can be calculated in a simple step. The difference of the grades of the two variables observed, ρ is:
Where, n represents the number of data points, did_idi represents the difference of the rank of data points (xi,yi)(x_i,y_i)(xi,yi) (rxi,ryi)(r_{xi},r_{yi})(rxi,ryi) : Di = rxi – ryid_i = r_ {xi} – r_ di = {yi} rxi – ryi
If there is a Person correlation coefficient, why is there a Spearman correlation coefficient? The Person correlation coefficient has certain limitations ** : First, the variable must be continuous, and second, it must follow a normal distribution **
Spearman’s correlation coefficient only cares about the monotonous relationship of variables, does not consider the influence of specific values, and can tolerate outliers. In general, Spearman’s correlation coefficient can be used wherever Person’s correlation coefficient can be used.
Quick understanding of rank order and rank sum:
The following two groups of data AB, how to find the rank and the sum of the rank?
1. Arrange two groups of ABL data in order:
2. Mark their order, i.e. rank. If the two values are the same, take the mean of the order
3. The sum of rank is the sum of rank:
A: 3.5 + 5 + 8 + 9 + 10 = 35.5
B: 1 + 2 + 3.5 + 6 + 7 = 19.5
Special case: When two variables have repeated data, the Spearman correlation coefficient calculated between variables is the Person correlation coefficient calculated between the rank of variables:
Where: rxr_xrx represents the rank of variable x after conversion. As can be seen from the above definition, Spearman’s correlation coefficient is actually Pearson’s correlation coefficient after rank transformation of data
Kendall rank correlation coefficient
Kendall rank correlation coefficient is a rank correlation coefficient, which is used to measure the strength of the monotonic relationship between two ordered variables. Its value ranges from -1 to 1. The larger the absolute value is, the stronger the monotonic correlation is; when the value is 0, the correlation is completely irrelevant. Kendall coefficients are usually expressed by the Greek letter τ\tauτ(tau).
Categorical variables can be understood as categorical variables, which can be disordered, such as gender (male and female); It can also be ordered, such as grades: excellent, good, medium, poor.
Usually, it is to find the correlation coefficient of ordered classification variables.
How to solve the Person correlation coefficient
Here we introduce a variety of methods for solving Person correlation coefficients based on Python or third-party libraries. The Person correlation coefficient is still frequently used.
Import libraries
import pandas as pd
import numpy as np
import math
import random
Copy the code
Simulated data
Simulate a simple piece of data with two columns (variables)
Method 1: Based on pandas
The pandas library has a function called corr() that allows you to directly evaluate the correlation coefficients between variables of numeric type.
In the following results, we see:
- The coefficients of the main diagonals are all 1, so the correlation coefficients between itself and itself are definitely all 1
- The correlation coefficients of the subdiagonals must be the same
Corr function has a parameter method, specify different values, we can solve different correlation coefficient, default is Pearson coefficient, specify the value of the parameter below:
Method 2: Use Python custom functions
1. Find the mean of the two variables first
2. Find the expected value of xy
3. Xy covariance coefficient
The covariance coefficient of XY can be calculated according to the above results:
4. Find the standard deviation of XY
5. Solve the Person coefficient
The covariance divided by the standard deviation is the Person coefficient
Method 3: Based on Numpy
The Numpy library has a function called corrcoef that can directly evaluate the correlation coefficient between two variables and return an array. The values on the main diagonals are all 1 (the correlation between itself and itself is 1), and the auxiliary diagonals are the correlation coefficients between two different variables:
Method 4: Based on Scipy
Scipy is also a powerful data computing library for Python, including a Pearsonr function that can also solve Pearson correlation coefficients.
This function returns two values: the first for the Person correlation coefficient and the second for the p-value.
If the P value is less than 0.05, which is less than the significance level, the two variables are considered to be correlated
Case of actual combat
1. We first simulate a piece of data: ABCDE is 5 variables, and CAT is the last category (choose one between 0 and 1).
import random
# Simulate a data set: 5 classes +1 feature
df1 = pd.DataFrame({"A":np.random.randint(1.5.20),
"B":np.random.randint(4.10.20),
"C":np.random.randint(3.6.20),
"D":np.random.randint(22.50.20),
"E":np.random.randint(15.60.20),
"cat":np.random.randint(0.2.20),
},index=list(range(20)))
df2 = np.array(df1)
Copy the code
2. Calculate the mean of features and classes
Calculate the average of features and classes in the data
def calcMean(x,y) :
Parameter: x: data of the class y: data of the class Return value: average value of the class
x_sum = sum(x)
y_sum = sum(y)
n = len(x)
x_mean = float(x_sum) / n
y_mean = float(y_sum) / n
return x_mean,y_mean # return to mean
Copy the code
3. Calculate Pearson coefficient
def calcPearson(x,y) :
x_mean, y_mean = calcMean(x,y) Call the above function to return the mean
n = len(x)
sumTop = 0.0
sumBottom = 0.0
x_pow = 0.0
y_pow = 0.0
# Calculate covariance
for i in range(n):
sumTop += (x[i] - x_mean) * (y[i] - y_mean)
# Calculate standard deviation
for i in range(n):
x_pow += math.pow(x[i] - x_mean, 2)
for i in range(n):
y_pow += math.pow(y[i] - y_mean ,2)
sumBottom = np.sqrt(x_pow * y_pow)
p = sumTop / sumBottom # covariance/standard deviation
return p
Copy the code
4. Calculate the contribution of each attribute
def calcAttribute(dataSet) :
prr = [] Empty list to append data
n,m = np.shape(dataSet) Get the number of rows and columns
x = [0] * n Initialize feature x and category vector y
y = [0] * n
for i in range(n):
y[i] = dataSet[i][m-1] Get all the class vectors
for j in range(m-1) :for k in range(n):
x[k] = dataSet[k][j]
prr.append(calcPearson(x,y)) # Calculate the correlation coefficient for each feature and category Y and store it in the list
return prr
Copy the code
5. Calculation results
prr = calcAttribute(df2)
prr
# the results
[-0.12335134242111898,
-0.05860090386731199,
-0.39038619785678985,
-0.14989060907230156,
-0.03952841713829405]
Copy the code
The magnitude of correlation coefficients between each variable and category CAT is described below:
The python corR function is used to display the Pearson correlation coefficients. The return value is the DataFrame table.
- The correlation coefficient on the diagonal is 1
- The last line shows the correlation coefficient for each variable and category cat (the same as the custom function calculated above)
Also, we can look at the correlation between any two variables:
The resources
1, the commonly used Pearson correlation coefficient of feature selection methods: guyuecanhui. Making. IO / 2019/07/20 /…
2, how to understand rank and rank sum: blog.csdn.net/weixin_4215…
3, correlation analysis: Pearson, Kendall, Spearman:www.biaodianfu.com/pearson-ken…