Source: Python Data Analysis
# # # directory:
- DIKW model and data engineering
- Scientific calculation tool Numpy
- Data analysis tool Pandas
- Pandas function applications, hierarchical indexing, and statistical calculations
- Pandas groups and aggregates
- Data cleansing, merging, transforming, and refactoring
-
The data sets were grouped and then statistically analyzed for each group
-
SQL can filter and aggregate data
-
Pandas can use groupby for more complex grouping operations
-
Group operation procedure: split->apply->combine
-
Split: The basis for grouping
-
Application: Calculation rules run for each group
-
Merge: Combine the calculation results of each group
-
Sample code:
import pandas as pd
import numpy as np
dict_obj = {'key1' : ['a'.'b'.'a'.'b'.'a'.'b'.'a'.'a'].'key2' : ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
Copy the code
Running results:
Data1 data2 KEY1 KEY2 0 0.974685-0.672494 a one 1-0.214324 0.758372 b one 2 1.508838 0.392787 a two 3 0.522911 0.630814 b three 4 1.347359-0.177858 a two 5-0.264616 1.017155b two 6-0.624708 0.450885 a one 7-1.019229-1.143825 a threeCopy the code
DataFrameGroupBy, SeriesGroupBy
1. Group operation
Groupby () is used to group groups. The groupby object does not perform actual operations, but only contains the intermediate data of the group
Groupby column name: obj.groupby(' label ')
Sample code:
# dataframe groups by key1
print(type(df_obj.groupby('key1')))
The data1 column of the # dataframe is grouped according to key1
print(type(df_obj['data1'].groupby(df_obj['key1')))Copy the code
Running results:
<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>
Copy the code
###2. Grouping operations
Grouping/multiple grouping of GroupBy objects, e.g. Mean ()
Non-numeric data is not grouped
Sample code:
# Grouping operation
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())
grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
Copy the code
Running results:
Data1 data2 key1 a 0.437389-0.230101 b 0.014657 0.802114 key1 a 0.437389 b 0.014657 Name: data1, dtype:float64
Copy the code
Size () returns the number of elements in each group
Sample code:
# size
print(grouped1.size())
print(grouped2.size())
Copy the code
Running results:
key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64
Copy the code
###3. Group by custom key
obj.groupby(self_def_key)
A custom key can be a list or a multi-tier list
Groupby ([' label1 ', 'label2'])-> multilevel dataframe
Group by custom key, list
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())
Group by custom key, multi-level list
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())
# Multilevel grouping by multiple columns
grouped2 = df_obj.groupby(['key1'.'key2'])
print(grouped2.size())
Multilayer grouping is performed in key order
grouped3 = df_obj.groupby(['key2'.'key1'])
print(grouped3.mean())
Unstack can convert the results of multiple indexes into a single layer of dataframe
print(grouped3.mean().unstack())
Copy the code
Running results:
0 1 1 1 2 1 3 2 4 1 5 1 7 1 dtype: int64 key1 key2 a one 2 three 1 two 2 b one 1 three 1 two 1 dtype: int64 key1 key2 a one 2 three 1 two 2 b one 1 three 1 two 1 dtype: Int64 data1 data2 key2 key1 one a 0.174988-0.110804 b-0.214324 0.758372 three a-1.019229-1.143825b 0.522911 data1 data2 key2 key1 one a 0.174988-0.110804 b-0.214324 0.758372 three a-1.019229-1.143825b 0.522911 0.630814 two a 1.428099 0.107465b -0.264616 1.017155 data1 data2 key1 a b a b key2 one 0.174988-0.214324-0.110804 0.758372 three -1.019229 0.522911-1.143825 0.630814 two 1.428099-0.264616 0.107465 1.017155Copy the code
The GroupBy object supports iteration
Return one tuple per iteration (group_name, group_data)
Concrete operations that can be used to group data ###
# Single layer grouping, according to KEY1
for group_name, group_data in grouped1:
print(group_name)
print(group_data)
Copy the code
Running results:
A data1 data2 key1 key2 0 0.974685-0.672494 a one 2 1.508838 0.392787 a two 4 1.3473599-0.177858 a two 6-0.624708 0.450885 a one 7-1.019229-1.143825 a three b data1 data2 key1 key2 1-0.214324 0.758372 b one 3 0.522911 0.630814b Three 5 -0.264616 1.017155 b twoCopy the code
###2. Multi-level grouping
Sample code:
# Multilevel grouping, based on key1 and key2
for group_name, group_data in grouped2:
print(group_name)
print(group_data)
Copy the code
Running results:
('a'.'one') data1 data2 key1 key2 0 0.974685-0.672494 a one 6-0.624708 0.450885 a one ('a'.'three') data1 data2 key1 key2 7-1.019229-1.143825 a three ('a'.'two') data1 data2 key1 key2 2 1.508838 0.392787 a two 4 1.347359 0.177858 a two ('b'.'one') data1 data2 key1 key2 1-0.214324 0.758372 b one ('b'.'three')
data1 data2 key1 key2
3 0.522911 0.630814 b three
('b'.'two') data1 data2 key1 key2 5-0.264616 1.017155 b twoCopy the code
GroupBy objects can be converted to lists or dictionaries
Sample code:
The GroupBy object converts the list
print(list(grouped1))
# GroupBy object converts dict
print(dict(list(grouped1)))
Copy the code
Running results:
[('a'. Data1 data2 key1 key2 0 0.974685-0.672494 a one 2 1.508838 0.392787 a two 4 1.3473599-0.177858 a two 6-0.624708 0.450885 a one 7-1.0192290-1.143825 a three), ('b', data1 data2 key1 key2 1-0.214324 0.758372 b one 3 0.522911 0.630814 b three 5-0.264616 1.017155 b two)] {'a': Data1 data2 key1 key2 0 0.974685-0.672494 a one 2 1.508838 0.392787 a two 4 1.3473599-0.177858 a two 6-0.624708 0.450885 a one 7-1.019229-1.143825 a three,'b': data1 data2 key1 key2 1-0.214324 0.758372 b one 3 0.522911 0.630814 b three 5-0.264616 1.017155 b two}Copy the code
1. Group by column and by data type
Sample code:
Group by column
print(df_obj.dtypes)
Group by data type
print(df_obj.groupby(df_obj.dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())
Copy the code
Running results:
data1 float64
data2 float64
key1 object
key2 object
dtype: object
float64 2
object 2
dtype: int64
float64 object 0 0.302191 a one 1 0.544048 b one 2 1.901626 a two 3 1.153725 b three 4 1.169501 a two 5 0.752539 b two 6 -0.173823 a one 7-2.163054 a threeCopy the code
###2. Other grouping methods
Sample code:
Df_obj2 = pd.dataframe (np.random. Randint (1, 10, (5,5)), columns=[df_obj2 = pd.dataframe (np.random. Randint (1, 10, (5,5))), columns=['a'.'b'.'c'.'d'.'e'],
index=['A'.'B'.'C'.'D'.'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)
Copy the code
Running results:
A b c d e a 7 2.0 4.0 5.0 b 4 NaN NaN NaN 1 c 3 2.0 5.0 4.0 6 d 3 1.0 9.0 7.0 3 e 6 1.0 6.0 8.0 1Copy the code
###3. Group by dictionary
Sample code:
# Group by dictionary
mapping_dict = {'a':'Python'.'b':'Python'.'c':'Java'.'d':'C'.'e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # Number of nans
print(df_obj2.groupby(mapping_dict, axis=1).sum())
Copy the code
Running results:
C 1 Java 2 Python 2 dtype: Int64 C Java Python A 12 2 B 0 11 C 12 2 D 12 2 E 12 2 C Java Python A 5.0 12.0 9.0 B NaN 1.0 4.0 C 4.0 11.0 5.0 D 7.0 12.0 4.0 E 8.0 7.0 7.0 7.0Copy the code
###4. By grouping functions, the function takes a row index or a column index as an argument
Sample code:
# Group by functionDf_obj3 = pd.dataframe (np.random. Randint (1, 10, (5,5)), columns=[df_obj3 = pd.dataframe (np.random. Randint (1, 10, (5,5))), columns=['a'.'b'.'c'.'d'.'e'],
index=['AA'.'BBB'.'CC'.'D'.'EE'])
#df_obj3
def group_key(idx):
"""Idx for column index or row index"""
#return idx
return len(idx)
print(df_obj3.groupby(group_key).size())
# The above custom function is equivalent to
#df_obj3.groupby(len).size()
Copy the code
Running results:
1 1
2 3
3 1
dtype: int64
Copy the code
5. Group by index level
Sample code:
Group by index level
columns = pd.MultiIndex.from_arrays([['Python'.'Java'.'Python'.'Java'.'Python'],
['A'.'A'.'B'.'C'.'B']], names=['language'.'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)
Group by language
print(df_obj4.groupby(level='language', axis=1).sum())
Group by index
print(df_obj4.groupby(level='index', axis=1).sum())
Copy the code
Running results:
language Python Java Python Java Python
index A A B C B
0 2 7 8 4 3
1 5 2 6 1 2
2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8
language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18
index A B C
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4
Copy the code
# # polymerization (aggregation)
-
The process of generating scalars from arrays, such as mean(), count(), etc
-
It is used to calculate the grouped data
Sample code:
dict_obj = {'key1' : ['a'.'b'.'a'.'b'.'a'.'b'.'a'.'a'].'key2' : ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'data1': np. Random. Randint (1, 10, 8),'data2': np.random. Randint (1,10, 8)} df_obj5 = pd.dataframe (dict_obj)print(df_obj5)
Copy the code
Running results:
data1 data2 key1 key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three
Copy the code
###1. Built-in aggregate functions
sum(), mean(), max(), min(), count(), size(), describe()
Sample code:
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
Copy the code
Running results:
data1 data2 key1 a 27 24 b 12 18 data1 data2 key2 key1 a 8 7 two b 9 9 two data1 data2 key2 key1 a 3 4 one b 1 4 one Data1 data2 key1 a 5.4 4.8b 4.0 6.0 key1 a 5 b 3 dtype: Int64 data1 data2 key2 key1 a 5 5 5 b 3 3 3 data1 data2 key1 a count 5.000000 5.000000 mean 5.400000 4.800000 STD 2.302173 1.303840 min 3.000000 4.000000 25% 3.000000 4.000000 50% 6.000000 4.000000 75% 7.000000 5.000000 Max 8.000000 7.000000 B count 3.000000 3.000000 mean 4.000000 6.000000 STD 4.358899 2.645751 min 1.000000 4.000000 25% 1.500000 4.500000 50% 2.000000 5.000000 75% 5.500000 7.000000 Max 9.000000 9.000000Copy the code
###2. Customizable function, passed into agG method
grouped.agg(func)
The parameter func is the record corresponding to the groupby index
Sample code:
# Custom aggregate function
def peak_range(df):
"""Return value range"""
Print type(df
return df.max() - df.min()
print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
Copy the code
Running results:
Data1 data2 key1 a 5 3 b 8 5 data1 data2 key1 a 2.528067 1.594711b 0.787527 0.386341 In [25]:Copy the code
###3. Apply multiple aggregate functions to aggregate operations at the same time, using lists of functions
Sample code:
# Apply multiple aggregate functions
# Apply multiple aggregate functions simultaneously
print(df_obj.groupby('key1').agg(['mean'.'std'.'count', peak_range])) The default column name is the function name
print(df_obj.groupby('key1').agg(['mean'.'std'.'count', ('range', peak_range)])) Provide new column names via tuples
Copy the code
Running results:
Data1 data2 mean STD count peak_range mean STD count peak_range key1 a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341 data1 data2 mean STD count range mean STD count Range key1 a 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711 b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341Copy the code
###4. Apply different aggregate functions to different columns, using dict
Sample code:
# Each column acts on a different aggregate function
dict_mapping = {'data1':'mean'.'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
dict_mapping = {'data1': ['mean'.'max'].'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
Copy the code
Running results:
Data1 data2 key1 a 0.437389-1.1505050b 0.014657 2.406341 data1 data2 mean Max sum key1 a 0.437389 1.508838-1.1505050b 0.014657 0.522911 2.406341Copy the code
###5. Common built-in aggregate functions
Group operation of data
import pandas as pd
import numpy as np
dict_obj = {'key1' : ['a'.'b'.'a'.'b'.'a'.'b'.'a'.'a'].'key2' : ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'data1': np.random.randint(1, 10, 8),
'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
After grouping by key1, calculate data1, datA2 statistics and append them to the original table with a header prefix
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
print(k1_sum)
Copy the code
Running results:
data1 data2 key1 key2
0 5 1 a one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two
6 3 5 a one
7 8 3 a three
sum_data1 sum_data2
key1
a 26 26
b 17 17
Copy the code
The aggregation changes the shape of the original data,
How to preserve the shape of the raw data?
###1. merge
Use merge external join, which is complicated
Sample code:
Method 1, merge
k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
print(k1_sum_merge)
Copy the code
Running results:
data1 data2 key1 key2 sum_data1 sum_data2
0 5 1 a one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 a one 26 26
7 8 3 a three 26 26
1 7 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17
Copy the code
###2. transform
The calculated result of transform is consistent with the shape of the original data,
Such as: grouped. The transform (np) sum)
Sample code:
# transform # transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
print(df_obj)
Copy the code
Running results:
data1 data2 key1 key2 sum_data1 sum_data2 sum_key2 0 5 1 a one 26 26 onetwotwoonethree 1 7 8 b one 17 17 onethreetwo 2 1 9 a two 26 26 onetwotwoonethree 3 2 6 b three 17 17 onethreetwo 4 9 8 a two 26 26 onetwotwoonethree 5 8 3 b two 17 17 Onetwotwoonethree 6 3 5 a one 26 26 OnetwotwoOnethree 7 8 3 a three 26 26Copy the code
Sample code:
# custom function pass transform
def diff_mean(s):
"""Return the difference between the data and the mean."""
return s - s.mean()
print(df_obj.groupby('key1').transform(diff_mean))
Copy the code
Running results:
Data1 data2 sum_data1 sum_datA2 0 -0.200000 -4.200000 00 1 1.333333 2.333333 00 2 -4.200000 3.800000 00 3 -3.666667 0.333333 00 4 3.800000 2.800000 00 5 2.333333-2.666667 00 6 -2.200000 -0.200000 00 7 2.800000 -2.200000 00Copy the code
#groupby.apply(func)
The #####func function can also be called separately on each grouping, with the final results assembled together via pd.concat (data merge)
Sample code:
import pandas as pd
import numpy as np
dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex'.'Age'.'HoursPerWeek'.'TotalHours'.'APM'])
def top_n(df, n=3, column='APM') :"""Return top N data for each group by column"""
return df.sort_values(by=column, ascending=False)[:n]
print(df_data.groupby('LeagueIndex').apply(top_n))
Copy the code
Running results:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
Copy the code
Generate the hierarchy index: the outer index is the group name and the inner index is the row index of DF_obj
Sample code:
The parameters received by the apply function are passed to the custom function
print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))
Copy the code
Running results:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300.0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404
3341 8 NaN NaN NaN 287.8128
Copy the code
###2. Disable hierarchy indexing, group_keys=False
Sample code:
print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))
Copy the code
Running results:
LeagueIndex Age HoursPerWeek TotalHours APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
Copy the code
Apply can be used to deal with missing data filling in different groups and fill the mean value of the group.