This is the fifth day of my participation in the August Wen Challenge.More challenges in August
Hello, everyone, I am the talented brother.
In addition to cleaning and screening, I will also be involved in statistical calculation. Here we will introduce some common statistical calculation functions.
1. Preview data
This article case demonstration data from the national data center of each region in the last 5 years of gross national product data, background reply GDP can get data files, convenient for yourself to try ha.
In [1]: df.head() Preview the first 5 pieces of data
Out[1] : the district2020years2019years2018years2017years2016years0The Beijing municipal36102.6 35445.1 33106.0 29883.0 27041.2
1tianjin14083.7 14055.5 13362.9 12450.6 11477.2
2In hebei province36206.9 34978.6 32494.6 30640.8 28474.1
3Shanxi Province17651.9 16961.6 15958.1 14484.3 11946.4
4Inner Mongolia Autonomous Region17359.8 17212.5 16140.8 14898.1 13789.3
In [2]: df.info() # check the data type of each field, the number of items and the number of null values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0region32 non-null object
1 2020years31 non-null float64
2 2019years31 non-null float64
3 2018years31 non-null float64
4 2017years31 non-null float64
5 2016years31 non-null float64
dtypes: float64(5), object(1)
memory usage: 1.6+ KB
Copy the code
2. Description statistics
The describe function method can return descriptive statistics of a data set
Signature:
df.describe(
percentiles=None,
include=None,
exclude=None,
datetime_is_numeric=False) - >'FrameOrSeries'
Docstring:
Generate descriptive statistics.
Copy the code
For Dataframe types, each row corresponds to one statistic, total, mean, standard deviation, minimum, quartile (default: 25/50/75), and maximum.
In [3]: df.describe()
Out[3] :2020years2019years2018years2017years2016In the count31.000000 31.000000 31.000000 31.000000 31.000000
mean 32658.551613 31687.758065 29487.661290 26841.819355 24224.148387
std 26661.811640 25848.652250 24136.181387 22161.575235 20008.278500
min 1902.700000 1697.800000 1548.400000 1349.000000 1173.000000
25% 13940.650000 13826.300000 13104.700000 12381.800000 11634.800000
50% 25115.000000 24667.300000 22716.500000 20210.800000 18388.600000
75% 42612.500000 41110.350000 37508.750000 33835.250000 30370.250000
max 110760.900000 107986.900000 99945.200000 91648.700000 82163.200000
Copy the code
In the descriptive statistics table above, we can see that in 2020, there are a total of 31 regions with an average GDP of 3.26 trillion yuan, the highest 11.07 trillion yuan, and the lowest 0.19 trillion yuan.
As you can see, there are also parameters that can be customised as follows:
Percentiles can be customized to specify quantiles
In [4]: df.describe(percentiles=[2..4..6..8.])
Out[4] :2020years2019years2018years2017years2016In the count31.000000 31.000000 31.000000 31.000000 31.000000
mean 32658.551613 31687.758065 29487.661290 26841.819355 24224.148387
std 26661.811640 25848.652250 24136.181387 22161.575235 20008.278500
min 1902.700000 1697.800000 1548.400000 1349.000000 1173.000000
20% 13698.500000 13544.400000 12809.400000 11159.900000 10427.000000
40% 22156.700000 21237.100000 19627.800000 17790.700000 16116.600000
50% 25115.000000 24667.300000 22716.500000 20210.800000 18388.600000
60% 36102.600000 34978.600000 32494.600000 29676.200000 26307.700000
80% 43903.900000 45429.000000 42022.000000 37235.000000 33138.500000
max 110760.900000 107986.900000 99945.200000 91648.700000 82163.200000
Copy the code
Include and exclude specify and exclude data types, respectively, for example
df.describe(include=[np.number]) # specify a field of numeric type
df.describe(exclude=[np.float]) # Exclude floating-point fields
Copy the code
As you can see, by default describe specifies the type of numbers, not the part of the locale field. If you want to participate, you can specify it by include=’all’.
In [5]: df.describe(include='all') Out[5]: Area 2020 2019 2018 2017 2016 Count 32 31.00 31.00 31.00 31.00 31.00 31.00 unique 32 NaN NaN NaN Top NaN NaN NaN freq 1 NaN NaN NaN NaN NaN ... . . . . . . 25% NaN 13940.65 13826.30 13104.70 12381.80 11634.80 50% NaN 25115.00 24667.30 22716.50 20210.80 18388.60 42612.50 41110.35 37508.75 33835.25 30370.25 Max NaN 110760.90 107986.90 99945.20 91648.70 82163.20 [11 rows x 6 columns]Copy the code
In the case data, the data under the region field is of object type and not numeric correlation. We can see that in the description statistics result, it adds three new indicators unique, top and FREP, whereas these three indicators are not available for purely numeric columns. These three indicators correspond to the number of non-duplicates, maximum, and frequency (if there are duplicates), as in the following individual case:
In [6]: s = pd.Series(['red'.'blue'.'black'.'grey'.'red'.'grey'])
In [7]: s.describe()
Out[7]:
count 6
unique 4
top red
freq 2
dtype: object
Copy the code
In descripe, there is another parameter, datetime_is_numeric, which requires the value True for statistical descriptions of time types.
In [8]: s = pd.Series([np.datetime64("2000-01-01"),
...: np.datetime64("2010-01-01"),
...: np.datetime64("2010-01-01")
...: ])
In [9]: s.describe()
FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
s.describe()
Out[9]:
count 3
unique 2
top 2010- 01-0100:00:00
freq 2
first 2000- 01-0100:00:00
last 2010- 01-0100:00:00
dtype: object
In [10]: s.describe(datetime_is_numeric=True)
Out[10]:
count 3
mean 2006- 09-01 08:00:00
min 2000- 01-0100:00:00
25% 2004-12-31 12:00:00
50% 2010- 01-0100:00:00
75% 2010- 01-0100:00:00
max 2010- 01-0100:00:00
dtype: object
Copy the code
In our daily data processing, in addition to describing these statistical dimensions in statistics, we will also use some other statistical calculations, such as variance, mode and so on.
3. Statistical calculation
Here we demonstrate the common method of calculating statistical functions. By default, we count by column. We can also specify by row, as shown below
A maximum #
In [11]: df.max(numeric_only=True)
Out[11] :2020years110760.9
2019years107986.9
2018years99945.2
2017years91648.7
2016years82163.2
dtype: float64
# the minimum
In [12]: df.min(numeric_only=True)
Out[12] :2020years1902.7
2019years1697.8
2018years1548.4
2017years1349.0
2016years1173.0
dtype: float64
# Average value (for calculation of statistics, it is recommended to specify data type as only number, can specify column via Axis, default is column)
In [13]: df.mean(axis=1, numeric_only=True)
Out[13] :0 32315.58
1 13085.98
2 32559.00
3 15400.46.28 2683.66
29 3432.18
30 12198.96
31 NaN
Length: 32, dtype: float64
Copy the code
The following sections are not specific demonstrations, only to introduce the functions of the function, all these should be used to pay attention to the original data type, non-numeric types may cause errors
df.sum(a)# sum
df.corr() # Correlation coefficient
df.cov() # covariance
df.count() # non-null count
df.abs(a)# the absolute value
df.median() # the median
df.mode() # the number
df.std() # standard deviation
df.var() # unbiased variance
df.sem() # Standard error of the mean
df.mad() # Average absolute difference
df.prod() # LianCheng
df.cumprod() # multiplicative
df.cumsum() # accumulation
df.nunique() # non-repeat count
df.idxmax() Index name (argmax)
df.idxmin() The minimum index name
df.sample(5) # 5 data were randomly sampled
df.skew() Sample skewness (third order)
df.kurt() Sample skewness (4th order)
df.quantile() # Sample quantile
df.rank() Rank #
df.pct_change() # change rate
df.value_counts() # do not duplicate values and quantities
s.argmax() # Maximum index (automatic index), dataframe does not
s.argmin() # Minimum index (automatic index), dataframe does not
Copy the code
In fact, there are other parameters in every function that can make the function more powerful, so you can try them out on your own, and we’ll give you a few examples.
>>> s = pd.Series([90.91.85])
>>> s
0 90
1 91
2 85
dtype: int64
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
>>> s.pct_change(periods=2) Change rate every 2 rows (default: 1 row)
0 NaN
1 NaN
2 -0.055556
dtype: float64
Copy the code
In addition to these functions, the following functions are also commonly used
The largest first 5 rows of a column
In [14]: df.nlargest(5,columns='2020')
Out[14] : the district2020years2019years2018years2017years2016years18Guangdong province,110760.9 107986.9 99945.2 91648.7 82163.2
9Jiangsu province102719.0 98656.8 93207.6 85869.8 77350.9
14In shandong province73129.0 70540.5 66648.9 63012.1 58762.5
10Zhejiang province64613.3 62462.0 58002.8 52403.1 47254.0
15Henan province54997.1 53717.8 49935.9 44824.9 40249.3
The smallest first five rows of a column
In [15]: df.nsmallest(5,columns='2020')
Out[15] : the district2020years2019years2018years2017years2016years25Tibet Autonomous Region1902.7 1697.8 1548.4 1349.0 1173.0
28Qinghai province3005.9 2941.1 2748.0 2465.1 2258.2
29Ningxia Hui Autonomous Region3920.5 3748.5 3510.2 3200.3 2781.4
20Hainan province,5532.4 5330.8 4910.7 4497.5 4090.2
27Gansu province9016.7 8718.3 8104.1 7336.7 6907.9
Copy the code
Addition, subtraction, multiplication and division four operations
You can use operational notation, you can use functional methods; You can also pass a value, a DataFrame or a Serice.
''' Among flexible wrappers (`add`, `sub`, `mul`, `div`, `mod`, `pow`) to arithmetic operators: '+', '-', '*', '/', '%', '**'. '
df + 1
df.add(1)
Copy the code
The above is all the content of this time, interested partners can run the code to try, or add the author’s wechat to communicate with oh!