preface
Refer to the website: https://zhuanlan.zhihu.com/p/60059869, https://blog.csdn.net/yanjiangdi/article/details/100939969 three correlation coefficient in statistics: Pearson, Spearman and Kendall all reflect the direction and degree of change trend between the two variables, and their values range from -1 to +1. 0 means that the two variables are not correlated, positive value means positive correlation, negative value means negative correlation, and higher value means stronger correlation.Copy the code
1/ What is correlation
Correlation analysis is to analyze the signs that are really related in the population, and its main body is to analyze the signs that have causal relationship in the population. It is a process of describing the closeness of the relationship between objective things and expressing it with appropriate statistical indicators. In a period of time, the birth rate increases with the economic level, which indicates that there is a positive correlation between the two characteristics. In another period, with the further development of economic level, the birth rate decreases, and there is a negative correlation between the two characteristics. To put it simply: correlation is to see if there is some kind of relationship between two variables. If one variable gets bigger, the other one gets bigger, that's a positive correlation. If one variable gets bigger and the other one gets smaller, that's a negative correlation. If there is no relationship between two variables, there is no correlation between the two variables, that is, the two variables are independent.Copy the code
2/ correlation coefficient r
The magnitude of correlation between two variables is expressed by correlation coefficient R. The value range of correlation coefficient R is [-1,1], and can be any value within this range. R is between zero and one, it's a positive correlation, the scatter plot is slanting upwards, so one variable goes up, the other variable goes up; The r value is between -1 and 0, it's a negative correlation, the scatter plot is sloping down, one variable goes up, the other variable goes down. The closer the absolute value of R is to 1, the stronger the correlation between the two variables is (the stronger the positive correlation is, the stronger the negative correlation is); the closer the absolute value of R is to 0, the weaker the correlation between the two variables is. The correlation coefficient r greater than 0.4 and less than 0.7 is called weak correlation, and greater than 0.7 is called strong correlation. No matter how large the correlation coefficient is, As long as Pvalue>0.05 is meaningless, Pvalue>0.05 indicates that the result may be accidental <0.2 irrelevant 0.2< R <0.4 relationship generally 0.4< R <0.7 relationship r>0.7 relationship is very close P_value, namely pvalue, also known as Sig value. A P value <0.01 means that there is at least 99% certainty that something will happen, and a P value <0.05 (and >0.01) means that there is at least 95% certainty that something will happen. When P<0.01 or P<0.05, it indicates that the level is significant, so that the correlation obtained is meaningful. The correlation coefficient R answers the question of the degree of correlation, and the significance answers the question of whether there is a relationship between them, indicating whether the result obtained is caused by accidental factors and whether it is statistically significant. If I get a P<0.05 correlation coefficient r=0.279, it means that there is indeed (P<0.05) correlation between the two, and the correlation is 0.279, which is not high. However, if P >0.05 and correlation coefficient R =0.799, it means that there is a strong correlation between the two, and the result of high correlation may be caused by chance, which is not statistically significant.Copy the code
3/ How to calculate correlation
<1>dataframe.corr()
The.corr() method is unique to pandas' Dataframe data object. This method can calculate the correlation coefficient between any two variables (columns) in the dataframe data object, but only the correlation coefficient r, not the pvalue. The correlation coefficient r can only be calculated for columns of numeric type. Columns that are not numeric are automatically filtered out. Datafame.corr (method=" Pearson ",min_periods=500) # Correlation only applies to the data type. Optional values are {' Pearson ', 'Spearman', 'Kendall'} 1) Pearson: also known as the product moment correlation coefficient, which is used to measure whether two data sets are on the same line. In other words, the correlation coefficient calculation is suitable for linear data, but there will be errors for nonlinear data. 3) Kendall: also known as rank correlation coefficient, which is used to reflect the correlation of classification variables, that is, the correlation coefficient for disordered sequence and non-normal distribution data. Periods: The return value of a DataFrame table consisting of the correlation coefficients between the types. Pvalue cannot be givenCopy the code
<2> Pearson is also called product moment correlation coefficient
1) Applicable conditions. Only when the following conditions are met, it is meaningful to calculate the correlation coefficient with Pearson. <1> When the two variables respectively follow the normal distribution (not necessarily the standard normal distribution) or the near-normal single-peak distribution <2> the standard deviation of both variables is not zero, because Pearson's correlation coefficient is the ratio of covariance to standard deviation, So the denominator cannot be 0 <3> there is a linear relationship between the two variables <4> Both variables are continuous variables <4> The observed values of the two variables are paired, and each pair of observed values is independent from each other. The sample size should be more than 500. 2) Why does Pearson need data to be normally distributed? After Pearson's correlation coefficient is obtained, methods such as T test are usually used to conduct Pearson's correlation coefficient test, while T test is based on the assumption that data are normally distributed, so when Pearson is used for correlation analysis, data must meet or be approximately normally distributed. 3) the grammar: From scipy import stats stats. Pearsonr (x,y) 4) PYTHon3.6 apply datafame. Corr (method=' Pearson '), return correlation matrix, but cannot give pvalue. From scipy.stats import Normaltest, probPlot Normaltest (a) returns statistics and test P values, sample requirements >20. Probplot (Np. array(x,y), Dist ="norm", plot= Pylab) PP plots, if in the diagonal, the correlation is strong.Copy the code
<3> Spearman rank correlation coefficient (also rank correlation coefficient)
1) Spearman is a rank correlation coefficient without parameters, that is, R has nothing to do with the specific values of the two variables, but is only related to the magnitude relationship between the values. Di represents the position difference of paired variables after the two variables are sorted separately, and N represents N samples to reduce the influence of outliers.Copy the code
2) Spearman's requirements on data conditions are less strict than Pearson's correlation coefficient. As long as the observed values of the two variables are paired grade evaluation data or grade data transformed from the observation data of continuous variables, no matter how the overall distribution form and sample size of the two variables are, Spearman rank correlation coefficient can be used to study. Spearmanr (x,y) 4) Python3.6 A) datafame. Corr (method='spearman') B) from scipy. Stats import spearmanr spearmanr(array) return Spearman coefficient (coefficient matrix) and test P value, sample requirement >20.Copy the code
<4> Kendall rank correlation coefficient (also rank correlation coefficient)
Kendall's correlation coefficient, also known as Kendall's rank correlation coefficient, is also a rank correlation coefficient, but the object it calculates is a classification variable. Categorical variables can be understood as categorical variables, which can be divided into: disordered, such as gender (male, female), blood type (A, B, O, AB); Ordered, such as obesity grade (severe obesity, moderate obesity, mild obesity, normal, underweight), performance (S, A, B, C, D) usually require correlation coefficients are ordered (with size relationship) classification variables. Python3.6 application: A) datafame. Corr (method=' Kendall '), return correlation matrix b) from scipy. Stats import Kendalltau kendalltau(x,y) return correlation coefficient r and pvalueCopy the code
<5> When Pearson is used and when spearman is used
1) when a linear correlation between two continuous variables (line draw scatterplot, whether linear correlation), using Pearson product moment correlation coefficient, does not meet the product moment correlation analysis of the applicable conditions, the use of Spearman rank correlation coefficient to describe. 2) the use of rank size of these two variables linear correlation analysis, the distribution of the original variable does not make the request, It belongs to non-parametric statistical method and is applicable to a wider range. Spearman's correlation coefficient can also be calculated for the data subject to Pearson's correlation coefficient, but the statistical efficiency is lower. Spearman's correlation coefficient calculation formula can be completely applied, but x and y in the formula can be replaced by the corresponding rank. 3) What they have in common is that both variables must be continuous variables, not discrete variablesCopy the code