This is the 28th day of my participation in the August Text Challenge.More challenges in August

What are statistical indicators

Concepts and values that represent the quantitative characteristics of a population

Statistical indicators can vary depending on the purpose of the data analysis

When analyzing hiring data: skills, salary, years on the job

Analyze user conversion rates: site views, landing pages, and miss rates

Analysis of financial products: past performance, risk coefficient, annualized return

Statistical indicators are divided into two categories according to the content they embody: total indicators and relative indicators

Total indicator

An indicator that describes the total size, level, or amount of work under specific conditions

GDP, total sales, total population

The relative indicators

It describes relative relationships, not the big picture

The ratio of two related phenomenal values

Proportion: each data/total ratio %

Ratio: data think: data item

Multiple: highlight the rise, growth range

Sequential growth rate (short term) : (period number – previous period number)/previous period number * 100%

Year-on-year growth rate (long term) :(period number – period number)/period number * 100%

Note: sequential pay more attention to short-term increase performance, year-on-year pay more attention to long-term increase performance

In addition to the above two categories, there are three statistical indicators worth learning, namely, central trend index, discrete trend index and distribution pattern

Statistical indicators: Central trend indicator – average indicator

The average

Using a number to show the general level of the population is an average indicator, also known as a central tendency indicator, the most commonly used central tendency indicator is the average

Average = sum of all data/number of data

Abnormal data often appear in the usual data processing, for example, the data is very large or very small will affect the average value, so the average value is misleading to a certain extent, the average value is not sensitive to the number of anomalies

My monthly salary is over 100 million yuan on average.

And because of that, there’s another indicator, the median, that needs to be looked at

The median

The median is the number in the middle of the order

The data is odd: the number in position (n+1) / 2 is the median

The data is even: the middle two digits add up to the median / 2

Such data are more representative

The number of

Mode number indicates the number of the most, reflecting the local characteristics, intensity

Statistical indicator: Discrete trend indicator

Now that we’ve talked about the central trend indicator, what is the discrete trend indicator

The discrete trend index is an indicator that reflects the degree of internal difference, and it mainly has three types: range, mean check and standard deviation

poor

Range represents the greatest variation within the data

Range = maximum minus minimum

But the range doesn’t show the actual variation within the data, the actual variation within the data we use the mean difference

Mean difference

Average difference represents the average difference between a set of data and the average difference

Mean difference = | each data item – the sum of the mean | / item number

The greater the difference between the data item and the average value, the more scattered the data, and vice versa

However, it should be noted that when there are data outliers in a set of data, it is easy to cause errors, so in this case, standard deviation is more sensitive to discrete values

The standard deviation

Standard deviation is a better indicator of dispersion than mean difference

Sd = ((each data item – mean | |) the sum of square/data item number) prescribing

Standard deviation can be used to more intuitively understand the degree of difference, and it is the most commonly used discrete indicator

Statistical index: distribution pattern

Distributed morphology refers to the graphically presented morphology

The common forms are: left-biased distribution, right-biased distribution, normal distribution

Left-biased distribution: The mean is left-biased and the mode (that is, the peak) is right-biased

Right bias distribution: The mean is skewed to the right and the mode (that is, the peak) is skewed to the left

Normal distribution: Mean is middle, mode (that is, peak) is middle

outliers

When introducing various indicators above, there is always a concept that will affect our judgment, that is, outliers. Then how to identify outliers?

Outliers are generally values that deviate greatly or extremely from the average value, also known as outliers

As mentioned above, in general, the criteria depend on the business object being analyzed

For example, cyclical industries, such as tourism, are divided into off-season and off-season, and the data in the peak season can usually reach more than twice that in the off-season. Such data cannot be regarded as outliers

Identifying outliers

General business data identifies outliers by looking at the gap between outliers and the overall data

In general, the method we use is to calculate the multiple of outliers and average values. The multiple calculated by outliers and average values is usually much larger than (less than) the multiple calculated by other data and average values, so that it is easy to identify which data items are outliers.

How to handle outliers generally depends on the specific business analysis

Outlier determination

1. For the outliers recorded incorrectly, we can directly modify them to normal data. For example, if the salary data is recorded incorrectly as negative, we can directly modify them

2. For the outliers added incorrectly, we can delete them directly. For example, in the pre-processing, age data was mixed into salary data, so we can delete them directly

3. For correct and true outliers, this needs to be analyzed according to the specific business and determine whether the outliers reflect special events.

For example, in the trend chart of the fund market, there is a large fluctuation of the fund trend caused by dividends. If we are analyzing the trend of the fund, then we cannot deal with this outlier. If we analyze the data for quantitative trading, we need to adjust the outlier

In addition, we do not deal with periodic data, such as the tourism data mentioned above

Handling outliers

1. For error data, we can fill in null values and sample average values

2, for correct and true data, we can adjust according to the actual situation, value * need to adjust the ratio

For example, in the previous example, when the fund fell by 8% on the day due to dividends, we can adjust the subsequent price to the closing price * (1+0.08).