Skewness and kurtosis in Python

When I was working on a financial data project, a friend asked me if there was a good way to measure stock returns. This question has made the author think for a long time. In fact, if the return rate of the stock is not data in nature, it is nothing more than the return rate. We want to make it as high as possible, that is, to increase the data as much as possible. We often use means, variance, skewness and kurtosis to measure data. Mean and variance are the most commonly seen and used methods, even mentioned in middle school textbooks. So today, THE author will talk about skewness and kurtosis, two methods that are not commonly used, and explain the simple application of skewness and kurtosis in data analysis by combining Python code.

First, I’ll introduce the concepts of skewness and kurtosis.

Figure 1. Skewness and kurtosis formulas

Skewness, also known as skewness or skewness coefficient, is a measure that describes the direction and degree of skewness of data distribution. It is a digital feature that measures the degree of asymmetry of data distribution. For random variable X, its skewness is the third-order standardized moment of the sample, and the calculation formula is shown in Equation (1) in FIG. 1.

Skewness is measured relative to the normal distribution, which has a skewness of 0. So we say, if the data distribution is symmetric, the skewness is zero; If the skewness is greater than 0, the distribution can be considered right skewness, also known as positive skewness, that is, the distribution has a long tail on the right; If the skewness is less than 0, the distribution can be considered left-biased, also known as negative skewness, that is, the distribution has a long tail to the left. Positive bias and negative bias are shown in Figure 2. In Figure 2, the one on the left is positive bias and the one on the right is negative bias.

Figure 2. Schematic diagram of skewness

Kurtosis is a statistic that describes the steepness or smoothness of data distribution. By calculating Kurtosis, we can determine whether the data distribution is steeper or gentler than the normal distribution. For the random variable X, its kurtosis is the fourth-order standard central moment of the sample, and the calculation formula is shown in Equation 2 in Figure 1.

When kurtosis coefficient >0, compared with normal distribution, it is steeper or has thicker tail. When kurtosis coefficient <0, compared with normal distribution, it is flatter or has thinner tail. In the real world, if a distribution is thick-tailed, the distribution tends to have more “mass” — that is, more extreme values — than the tail of the normal distribution. Among the distributions commonly used, the kurtosis of the normal distribution is 0, the kurtosis of the uniform distribution is -1.2, and the kurtosis of the exponential distribution is 6.

The schematic diagram of kurtosis is shown in Figure 3, wherein the first subgraph is the case of kurtosis 0, the second subgraph is the case of kurtosis greater than 0, and the third is the case of kurtosis less than 0.

FIG. 3. Schematic diagram of kurtosis

So with the basic concepts behind it, let’s talk a little bit more about how to do normality tests based on skewness and kurtosis. There are two main methods, one is the Omnibus test and the other is the Jarque-Bera test.

Figure 4. Formulas for Omnibus and JB tests

The formula of Omnibus test is shown in Formula (3) in Figure 4. Z1 and Z2 are two normalized functions, g1 and G2 are skewness and kurtosis respectively. Under the action of Z1 and Z2, the result of K is close to the Chi-square distribution, which can be used to test. The principle of this formula is more complex, you can find relevant information if you want to know.

The formula of Jarque-Bera test is shown in Formula (4) in Figure 4, where N is the sample size. The result is also close to the Chi-square distribution, and its principle will not be described here. Both tests are based on the assumption that the data used are normally distributed.

Null hypothesis H0: data is normally distributed.

Alternative hypothesis H1: Data is not normally distributed.

Now let’s use code to illustrate skewness and kurtosis.

Let’s take a look at the data first. This data is very simple, only 15 rows and 2 columns. The data describes the loss of fire accidents and the distance between the place where the fire occurred and the nearest fire station, the former in thousand yuan and the latter in kilometers, as shown in Figure 5. Distance refers to the distance between the fire place and the nearest fire station. Loss refers to the loss caused by a fire accident.

Figure 5. Sample data

Here’s the code, first importing the required libraries.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
from statsmodels.compat import lzip
from statsmodels.graphics.tsaplots import plot_acf
Copy the code

The next step is to read the data and graph, these codes are very simple, THE author will not do too much explanation.

file = r'C:\Users\data.xlsx'
df = pd.read_excel(file)
fig, ax = plt.subplots(figsize=(8.6))
plt.ylabel('Loss')
plt.xlabel('Distance')
plt.plot(df['distance'], df['loss'].'bo-', label='loss')
plt.legend()
plt.show()
Copy the code

The results are shown in FIG. 6, from which we can see that these points are roughly on a straight line, so we use unary linear regression to fit these data.

Figure 6. Data connection diagram

The following is to generate the model and output the results of the model.

expr = 'loss ~ distance'Results = smf.ols(expr, df).fit(print(results.summary())
Copy the code

The result is shown in Figure 7. As can be seen from the figure, the value of Prob (F-Statistic) is 1.25E-08, which is very small, indicating that our unary linear regression model is correct, that is, the linear relationship between Loss and distance is significant. Skew=-0.003, indicating that this part of data is very close to the normal distribution, while Kurtosis=1.706, indicating that our data is steeper than the normal distribution and a sharp peak. In addition, it can be seen from the figure that Omnibus=2.551, Prob(Omnibus)=0.279, Jarque-Bera (JB)=1.047, Prob(JB)=0.592, Here, it is difficult to judge whether the previous alternative hypothesis is supported or not directly from the values of Omnibus and Jarque-Bera, but we can judge from the two values of Prob(Omnibus) and Prob(JB), because these two values are relatively large, we cannot reject the previous null hypothesis, that is, H0 is correct. It shows that our data are normally distributed.

Figure 7. Description of model results

Next we verify the Skew, Kurtosis, Omnibus and Jarque-Bera (JB) values using statsmodels’s own method. Here’s the code.

omnibus_label = ['Omnibus K-squared test'.'Chi-squared(2) p-value'] omnibus_test = sms.omni_normtest(results.resid) # omnibus_results = lzip(omnibus_label, omnibus_test) jb_label = ['Jarque-Bera test'.'Chi-squared(2) p-value'.'Skewness'.'Kurtosis'Jb_test = sms.jarque_bera(results.resid) jb_results = lzip(jb_label, jb_test)print(omnibus_results)
print(jb_results)
Copy the code

Here, omnibus_label and jB_label are two lists containing the names of the items we want to check. Sms. omni_normtest is the omnibus check method provided by Statsmodels. SMS. Jarque_bera is statsmodels’s own jarque_bera check method. Results. resid is a residual value, with a total of 15 values. Our data itself has only 15 points, and each residual value here corresponds to each previous data point. Sms. omni_normtest and SMS. jarque_bera are tested by residual value. Lzip is a very rare method, which is similar to the native python function zip. Here I use lzip to let you know more about Statsmodels, so I can use zip directly. The result is shown in Figure 8. As you can see from Figure 8, we get exactly the same result as in Figure 7 above. The validation here is done with sms.omni_normtest and sms.jarque_bera, mainly as an explanation of the results in Figure 7 to help you learn statsmodels better.

Figure 8. Results of omnibus and JB tests

This paper mainly explains some basic applications of skewness and kurtosis in data analysis through Statsmodels. Readers who want to know more about skewness, kurtosis and Statsmodels can refer to relevant materials by themselves.

About the author: Mort, a data analysis enthusiast, is good at data visualization, pays more attention to machine learning, and hopes to learn more and communicate with friends in the industry.

Appreciate the author

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Long press scan to add “Python Assistant”

Click here to become a community member

Related Posts

Spring Security password authentication dynamic salt authentication processing

C++ to solve large number operations (large number addition, large number power operation, large number remainder)

Scrum – a real agile development case study