This is the 28th day of my participation in the August Text Challenge.More challenges in August
What are statistical indicators
Concepts and values that represent the quantitative characteristics of a population
Statistical indicators can vary depending on the purpose of the data analysis
When analyzing hiring data: skills, salary, years on the job
Analyze user conversion rates: site views, landing pages, and miss rates
Analysis of financial products: past performance, risk coefficient, annualized return
Statistical indicators are divided into two categories according to the content they embody: total indicators and relative indicators
Total indicator
An indicator that describes the total size, level, or amount of work under specific conditions
GDP, total sales, total population
The relative indicators
It describes relative relationships, not the big picture
The ratio of two related phenomenal values
Proportion: each data/total ratio %
Ratio: data think: data item
Multiple: highlight the rise, growth range
Sequential growth rate (short term) : (period number – previous period number)/previous period number * 100%
Year-on-year growth rate (long term) :(period number – period number)/period number * 100%
Note: sequential pay more attention to short-term increase performance, year-on-year pay more attention to long-term increase performance
In addition to the above two categories, there are three statistical indicators worth learning, namely, central trend index, discrete trend index and distribution pattern
Statistical indicators: Central trend indicator – average indicator
The average
Using a number to show the general level of the population is an average indicator, also known as a central tendency indicator, the most commonly used central tendency indicator is the average
Average = sum of all data/number of data
Abnormal data often appear in the usual data processing, for example, the data is very large or very small will affect the average value, so the average value is misleading to a certain extent, the average value is not sensitive to the number of anomalies
My monthly salary is over 100 million yuan on average.
And because of that, there’s another indicator, the median, that needs to be looked at
The median
The median is the number in the middle of the order
The data is odd: the number in position (n+1) / 2 is the median
The data is even: the middle two digits add up to the median / 2
Such data are more representative
The number of
Mode number indicates the number of the most, reflecting the local characteristics, intensity
Statistical indicator: Discrete trend indicator
Now that we’ve talked about the central trend indicator, what is the discrete trend indicator
The discrete trend index is an indicator that reflects the degree of internal difference, and it mainly has three types: range, mean check and standard deviation
poor
Range represents the greatest variation within the data
Range = maximum minus minimum
But the range doesn’t show the actual variation within the data, the actual variation within the data we use the mean difference
Mean difference
Average difference represents the average difference between a set of data and the average difference
Mean difference = | each data item – the sum of the mean | / item number
The greater the difference between the data item and the average value, the more scattered the data, and vice versa
However, it should be noted that when there are data outliers in a set of data, it is easy to cause errors, so in this case, standard deviation is more sensitive to discrete values
The standard deviation
Standard deviation is a better indicator of dispersion than mean difference
Sd = ((each data item – mean | |) the sum of square/data item number) prescribing
Standard deviation can be used to more intuitively understand the degree of difference, and it is the most commonly used discrete indicator
Statistical index: distribution pattern
Distributed morphology refers to the graphically presented morphology
The common forms are: left-biased distribution, right-biased distribution, normal distribution
Left-biased distribution: The mean is left-biased and the mode (that is, the peak) is right-biased
Right bias distribution: The mean is skewed to the right and the mode (that is, the peak) is skewed to the left
Normal distribution: Mean is middle, mode (that is, peak) is middle
outliers
When introducing various indicators above, there is always a concept that will affect our judgment, that is, outliers. Then how to identify outliers?
Outliers are generally values that deviate greatly or extremely from the average value, also known as outliers
As mentioned above, in general, the criteria depend on the business object being analyzed
For example, cyclical industries, such as tourism, are divided into off-season and off-season, and the data in the peak season can usually reach more than twice that in the off-season. Such data cannot be regarded as outliers
Identifying outliers
General business data identifies outliers by looking at the gap between outliers and the overall data
In general, the method we use is to calculate the multiple of outliers and average values. The multiple calculated by outliers and average values is usually much larger than (less than) the multiple calculated by other data and average values, so that it is easy to identify which data items are outliers.
How to handle outliers generally depends on the specific business analysis
Outlier determination
1. For the outliers recorded incorrectly, we can directly modify them to normal data. For example, if the salary data is recorded incorrectly as negative, we can directly modify them
2. For the outliers added incorrectly, we can delete them directly. For example, in the pre-processing, age data was mixed into salary data, so we can delete them directly
3. For correct and true outliers, this needs to be analyzed according to the specific business and determine whether the outliers reflect special events.
For example, in the trend chart of the fund market, there is a large fluctuation of the fund trend caused by dividends. If we are analyzing the trend of the fund, then we cannot deal with this outlier. If we analyze the data for quantitative trading, we need to adjust the outlier
In addition, we do not deal with periodic data, such as the tourism data mentioned above
Handling outliers
1. For error data, we can fill in null values and sample average values
2, for correct and true data, we can adjust according to the actual situation, value * need to adjust the ratio
For example, in the previous example, when the fund fell by 8% on the day due to dividends, we can adjust the subsequent price to the closing price * (1+0.08).