Source: Python Data Analysis

# # # directory:

  • DIKW model and data engineering
  • Scientific calculation tool Numpy
  • Data analysis tool Pandas
  • Pandas function applications, hierarchical indexing, and statistical calculations
  • Pandas groups and aggregates
  • Data cleansing, merging, transforming, and refactoring
  • The data sets were grouped and then statistically analyzed for each group

  • SQL can filter and aggregate data

  • Pandas can use groupby for more complex grouping operations

  • Group operation procedure: split->apply->combine

    • Split: The basis for grouping

    • Application: Calculation rules run for each group

    • Merge: Combine the calculation results of each group

Sample code:

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a'.'b'.'a'.'b'.'a'.'b'.'a'.'a'].'key2' : ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'data1': np.random.randn(8),
            'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
Copy the code

Running results:

Data1 data2 KEY1 KEY2 0 0.974685-0.672494 a one 1-0.214324 0.758372 b one 2 1.508838 0.392787 a two 3 0.522911 0.630814 b three 4 1.347359-0.177858 a two 5-0.264616 1.017155b two 6-0.624708 0.450885 a one 7-1.019229-1.143825 a threeCopy the code

DataFrameGroupBy, SeriesGroupBy

1. Group operation

Groupby () is used to group groups. The groupby object does not perform actual operations, but only contains the intermediate data of the group

Groupby column name: obj.groupby(' label ')

Sample code:

# dataframe groups by key1
print(type(df_obj.groupby('key1')))

The data1 column of the # dataframe is grouped according to key1
print(type(df_obj['data1'].groupby(df_obj['key1')))Copy the code

Running results:

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>
Copy the code

###2. Grouping operations

Grouping/multiple grouping of GroupBy objects, e.g. Mean ()

Non-numeric data is not grouped

Sample code:

# Grouping operation
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
Copy the code

Running results:

Data1 data2 key1 a 0.437389-0.230101 b 0.014657 0.802114 key1 a 0.437389 b 0.014657 Name: data1, dtype:float64
Copy the code

Size () returns the number of elements in each group

Sample code:

# size
print(grouped1.size())
print(grouped2.size())
Copy the code

Running results:

key1
a    5
b    3
dtype: int64
key1
a    5
b    3
dtype: int64
Copy the code

###3. Group by custom key

obj.groupby(self_def_key)

A custom key can be a list or a multi-tier list

Groupby ([' label1 ', 'label2'])-> multilevel dataframe

Group by custom key, list
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())

Group by custom key, multi-level list
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())

# Multilevel grouping by multiple columns
grouped2 = df_obj.groupby(['key1'.'key2'])
print(grouped2.size())

Multilayer grouping is performed in key order
grouped3 = df_obj.groupby(['key2'.'key1'])
print(grouped3.mean())
Unstack can convert the results of multiple indexes into a single layer of dataframe
print(grouped3.mean().unstack())
Copy the code

Running results:

0 1 1 1 2 1 3 2 4 1 5 1 7 1 dtype: int64 key1 key2 a one 2 three 1 two 2 b one 1 three 1 two 1 dtype: int64 key1 key2 a one 2 three 1 two 2 b one 1 three 1 two 1 dtype: Int64 data1 data2 key2 key1 one a 0.174988-0.110804 b-0.214324 0.758372 three a-1.019229-1.143825b 0.522911 data1 data2 key2 key1 one a 0.174988-0.110804 b-0.214324 0.758372 three a-1.019229-1.143825b 0.522911 0.630814 two a 1.428099 0.107465b -0.264616 1.017155 data1 data2 key1 a b a b key2 one 0.174988-0.214324-0.110804 0.758372 three -1.019229 0.522911-1.143825 0.630814 two 1.428099-0.264616 0.107465 1.017155Copy the code

The GroupBy object supports iteration

Return one tuple per iteration (group_name, group_data)

Concrete operations that can be used to group data ###

# Single layer grouping, according to KEY1
for group_name, group_data in grouped1:
    print(group_name)
    print(group_data)
Copy the code

Running results:

A data1 data2 key1 key2 0 0.974685-0.672494 a one 2 1.508838 0.392787 a two 4 1.3473599-0.177858 a two 6-0.624708 0.450885 a one 7-1.019229-1.143825 a three b data1 data2 key1 key2 1-0.214324 0.758372 b one 3 0.522911 0.630814b Three 5 -0.264616 1.017155 b twoCopy the code

###2. Multi-level grouping

Sample code:

# Multilevel grouping, based on key1 and key2
for group_name, group_data in grouped2:
    print(group_name)
    print(group_data)
Copy the code

Running results:

('a'.'one') data1 data2 key1 key2 0 0.974685-0.672494 a one 6-0.624708 0.450885 a one ('a'.'three') data1 data2 key1 key2 7-1.019229-1.143825 a three ('a'.'two') data1 data2 key1 key2 2 1.508838 0.392787 a two 4 1.347359 0.177858 a two ('b'.'one') data1 data2 key1 key2 1-0.214324 0.758372 b one ('b'.'three')
      data1     data2 key1   key2
3  0.522911  0.630814    b  three

('b'.'two') data1 data2 key1 key2 5-0.264616 1.017155 b twoCopy the code

GroupBy objects can be converted to lists or dictionaries

Sample code:

The GroupBy object converts the list
print(list(grouped1))

# GroupBy object converts dict
print(dict(list(grouped1)))
Copy the code

Running results:

[('a'. Data1 data2 key1 key2 0 0.974685-0.672494 a one 2 1.508838 0.392787 a two 4 1.3473599-0.177858 a two 6-0.624708 0.450885 a one 7-1.0192290-1.143825 a three), ('b', data1 data2 key1 key2 1-0.214324 0.758372 b one 3 0.522911 0.630814 b three 5-0.264616 1.017155 b two)] {'a': Data1 data2 key1 key2 0 0.974685-0.672494 a one 2 1.508838 0.392787 a two 4 1.3473599-0.177858 a two 6-0.624708 0.450885 a one 7-1.019229-1.143825 a three,'b': data1 data2 key1 key2 1-0.214324 0.758372 b one 3 0.522911 0.630814 b three 5-0.264616 1.017155 b two}Copy the code

1. Group by column and by data type

Sample code:

Group by column
print(df_obj.dtypes)

Group by data type
print(df_obj.groupby(df_obj.dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())
Copy the code

Running results:

data1    float64
data2    float64
key1      object
key2      object
dtype: object

float64    2
object     2
dtype: int64

    float64 object 0 0.302191 a one 1 0.544048 b one 2 1.901626 a two 3 1.153725 b three 4 1.169501 a two 5 0.752539 b two 6 -0.173823 a one 7-2.163054 a threeCopy the code

###2. Other grouping methods

Sample code:

Df_obj2 = pd.dataframe (np.random. Randint (1, 10, (5,5)), columns=[df_obj2 = pd.dataframe (np.random. Randint (1, 10, (5,5))), columns=['a'.'b'.'c'.'d'.'e'],
                       index=['A'.'B'.'C'.'D'.'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)
Copy the code

Running results:

A b c d e a 7 2.0 4.0 5.0 b 4 NaN NaN NaN 1 c 3 2.0 5.0 4.0 6 d 3 1.0 9.0 7.0 3 e 6 1.0 6.0 8.0 1Copy the code

###3. Group by dictionary

Sample code:

# Group by dictionary
mapping_dict = {'a':'Python'.'b':'Python'.'c':'Java'.'d':'C'.'e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # Number of nans
print(df_obj2.groupby(mapping_dict, axis=1).sum())
Copy the code

Running results:

C 1 Java 2 Python 2 dtype: Int64 C Java Python A 12 2 B 0 11 C 12 2 D 12 2 E 12 2 C Java Python A 5.0 12.0 9.0 B NaN 1.0 4.0 C 4.0 11.0 5.0 D 7.0 12.0 4.0 E 8.0 7.0 7.0 7.0Copy the code

###4. By grouping functions, the function takes a row index or a column index as an argument

Sample code:

# Group by functionDf_obj3 = pd.dataframe (np.random. Randint (1, 10, (5,5)), columns=[df_obj3 = pd.dataframe (np.random. Randint (1, 10, (5,5))), columns=['a'.'b'.'c'.'d'.'e'],
                       index=['AA'.'BBB'.'CC'.'D'.'EE'])
#df_obj3

def group_key(idx):
    """Idx for column index or row index"""
    #return idx
    return len(idx)

print(df_obj3.groupby(group_key).size())

# The above custom function is equivalent to
#df_obj3.groupby(len).size()
Copy the code

Running results:

1    1
2    3
3    1
dtype: int64
Copy the code

5. Group by index level

Sample code:

Group by index level
columns = pd.MultiIndex.from_arrays([['Python'.'Java'.'Python'.'Java'.'Python'],
                                     ['A'.'A'.'B'.'C'.'B']], names=['language'.'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

Group by language
print(df_obj4.groupby(level='language', axis=1).sum())
Group by index
print(df_obj4.groupby(level='index', axis=1).sum())
Copy the code

Running results:

language Python Java Python Java Python
index         A    A      B    C      B
0             2    7      8    4      3
1             5    2      6    1      2
2             6    4      4    5      2
3             4    7      4    3      1
4             7    4      3    4      8

language  Java  Python
0           11      13
1            3      13
2            9      12
3           10       9
4            8      18

index   A   B  C
0       9  11  4
1       7   8  1
2      10   6  5
3      11   5  3
4      11  11  4
Copy the code

# # polymerization (aggregation)

  • The process of generating scalars from arrays, such as mean(), count(), etc

  • It is used to calculate the grouped data

Sample code:

dict_obj = {'key1' : ['a'.'b'.'a'.'b'.'a'.'b'.'a'.'a'].'key2' : ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'data1': np. Random. Randint (1, 10, 8),'data2': np.random. Randint (1,10, 8)} df_obj5 = pd.dataframe (dict_obj)print(df_obj5)
Copy the code

Running results:

   data1  data2 key1   key2
0      3      7    a    one
1      1      5    b    one
2      7      4    a    two
3      2      4    b  three
4      6      4    a    two
5      9      9    b    two
6      3      5    a    one
7      8      4    a  three
Copy the code

###1. Built-in aggregate functions

sum(), mean(), max(), min(), count(), size(), describe()

Sample code:

print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
Copy the code

Running results:

data1 data2 key1 a 27 24 b 12 18 data1 data2 key2 key1 a 8 7 two b 9 9 two data1 data2 key2 key1 a 3 4 one b 1 4 one Data1 data2 key1 a 5.4 4.8b 4.0 6.0 key1 a 5 b 3 dtype: Int64 data1 data2 key2 key1 a 5 5 5 b 3 3 3 data1 data2 key1 a count 5.000000 5.000000 mean 5.400000 4.800000 STD 2.302173 1.303840 min 3.000000 4.000000 25% 3.000000 4.000000 50% 6.000000 4.000000 75% 7.000000 5.000000 Max 8.000000 7.000000 B count 3.000000 3.000000 mean 4.000000 6.000000 STD 4.358899 2.645751 min 1.000000 4.000000 25% 1.500000 4.500000 50% 2.000000 5.000000 75% 5.500000 7.000000 Max 9.000000 9.000000Copy the code

###2. Customizable function, passed into agG method

grouped.agg(func)

The parameter func is the record corresponding to the groupby index

Sample code:

# Custom aggregate function
def peak_range(df):
    """Return value range"""
    Print type(df
    return df.max() - df.min()

print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
Copy the code

Running results:

Data1 data2 key1 a 5 3 b 8 5 data1 data2 key1 a 2.528067 1.594711b 0.787527 0.386341 In [25]:Copy the code

###3. Apply multiple aggregate functions to aggregate operations at the same time, using lists of functions

Sample code:

# Apply multiple aggregate functions

# Apply multiple aggregate functions simultaneously
print(df_obj.groupby('key1').agg(['mean'.'std'.'count', peak_range])) The default column name is the function name

print(df_obj.groupby('key1').agg(['mean'.'std'.'count', ('range', peak_range)])) Provide new column names via tuples
Copy the code

Running results:

Data1 data2 mean STD count peak_range mean STD count peak_range key1 a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341 data1 data2 mean STD count range mean STD count Range key1 a 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711 b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341Copy the code

###4. Apply different aggregate functions to different columns, using dict

Sample code:

# Each column acts on a different aggregate function
dict_mapping = {'data1':'mean'.'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

dict_mapping = {'data1': ['mean'.'max'].'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
Copy the code

Running results:

Data1 data2 key1 a 0.437389-1.1505050b 0.014657 2.406341 data1 data2 mean Max sum key1 a 0.437389 1.508838-1.1505050b 0.014657 0.522911 2.406341Copy the code

###5. Common built-in aggregate functions

Group operation of data

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a'.'b'.'a'.'b'.'a'.'b'.'a'.'a'].'key2' : ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'data1': np.random.randint(1, 10, 8),
            'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

After grouping by key1, calculate data1, datA2 statistics and append them to the original table with a header prefix
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
print(k1_sum)
Copy the code

Running results:

   data1  data2 key1   key2
0      5      1    a    one
1      7      8    b    one
2      1      9    a    two
3      2      6    b  three
4      9      8    a    two
5      8      3    b    two
6      3      5    a    one
7      8      3    a  three

      sum_data1  sum_data2
key1                      
a            26         26
b            17         17
Copy the code

The aggregation changes the shape of the original data,

How to preserve the shape of the raw data?

###1. merge

Use merge external join, which is complicated

Sample code:

Method 1, merge
k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
print(k1_sum_merge)
Copy the code

Running results:

   data1  data2 key1   key2  sum_data1  sum_data2
0      5      1    a    one         26         26
2      1      9    a    two         26         26
4      9      8    a    two         26         26
6      3      5    a    one         26         26
7      8      3    a  three         26         26
1      7      8    b    one         17         17
3      2      6    b  three         17         17
5      8      3    b    two         17         17
Copy the code

###2. transform

The calculated result of transform is consistent with the shape of the original data,

Such as: grouped. The transform (np) sum)

Sample code:

# transform # transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
print(df_obj)
Copy the code

Running results:

data1 data2 key1 key2 sum_data1 sum_data2 sum_key2 0 5 1 a one 26 26 onetwotwoonethree 1 7 8 b one 17 17 onethreetwo 2 1  9 a two 26 26 onetwotwoonethree 3 2 6 b three 17 17 onethreetwo 4 9 8 a two 26 26 onetwotwoonethree 5 8 3 b two 17 17 Onetwotwoonethree 6 3 5 a one 26 26 OnetwotwoOnethree 7 8 3 a three 26 26Copy the code

Sample code:

# custom function pass transform
def diff_mean(s):
    """Return the difference between the data and the mean."""
    return s - s.mean()

print(df_obj.groupby('key1').transform(diff_mean))
Copy the code

Running results:

Data1 data2 sum_data1 sum_datA2 0 -0.200000 -4.200000 00 1 1.333333 2.333333 00 2 -4.200000 3.800000 00 3 -3.666667 0.333333 00 4 3.800000 2.800000 00 5 2.333333-2.666667 00 6 -2.200000 -0.200000 00 7 2.800000 -2.200000 00Copy the code

#groupby.apply(func)

The #####func function can also be called separately on each grouping, with the final results assembled together via pd.concat (data merge)

Sample code:

import pandas as pd
import numpy as np

dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex'.'Age'.'HoursPerWeek'.'TotalHours'.'APM'])

def top_n(df, n=3, column='APM') :"""Return top N data for each group by column"""
    return df.sort_values(by=column, ascending=False)[:n]

print(df_data.groupby('LeagueIndex').apply(top_n))
Copy the code

Running results:

                  LeagueIndex   Age  HoursPerWeek  TotalHours       APM
LeagueIndex                                                            
1           2214            1  20.0          12.0       730.0  172.9530
            2246            1  27.0           8.0       250.0  141.6282
            1753            1  20.0          28.0       100.0  139.6362
2           3062            2  20.0           6.0       100.0  179.6250
            3229            2  16.0          24.0       110.0  156.7380
            1520            2  29.0           6.0       250.0  151.6470
3           1557            3  22.0           6.0       200.0  226.6554
            484             3  19.0          42.0       450.0  220.0692
            2883            3  16.0           8.0       800.0  208.9500
4           2688            4  26.0          24.0       990.0  249.0210
            1759            4  16.0           6.0        75.0  229.9122
            2637            4  23.0          24.0       650.0  227.2272
5           3277            5  18.0          16.0       950.0  372.6426
            93              5  17.0          36.0       720.0  335.4990
            202             5  37.0          14.0       800.0  327.7218
6           734             6  16.0          28.0       730.0  389.8314
            2746            6  16.0          28.0      4000.0  350.4114
            1810            6  21.0          14.0       730.0  323.2506
7           3127            7  23.0          42.0      2000.0  298.7952
            104             7  21.0          24.0      1000.0  286.4538
            1654            7  18.0          98.0       700.0  236.0316
8           3393            8   NaN           NaN         NaN  375.8664
            3373            8   NaN           NaN         NaN  364.8504
            3372            8   NaN           NaN         NaN  355.3518
Copy the code

Generate the hierarchy index: the outer index is the group name and the inner index is the row index of DF_obj

Sample code:

The parameters received by the apply function are passed to the custom function
print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))
Copy the code

Running results:

                  LeagueIndex   Age  HoursPerWeek  TotalHours       APM
LeagueIndex                                                            
1           3146            1  40.0          12.0       150.0   38.5590
            3040            1  39.0          10.0       500.0   29.8764
2           920             2  43.0          10.0       730.0   86.0586
            2437            2  41.0           4.0       200.0   54.2166
3           1258            3  41.0          14.0       800.0   77.6472
            2972            3  40.0          10.0       500.0   60.5970
4           1696            4  44.0           6.0       500.0   89.5266
            1729            4  39.0           8.0       500.0   86.7246
5           202             5  37.0          14.0       800.0  327.7218
            2745            5  37.0          18.0      1000.0  123.4098
6           3069            6  31.0           8.0       800.0  133.1790
            2706            6  31.0           8.0       700.0   66.9918
7           2813            7  26.0          36.0      1300.0  188.5512
            1992            7  26.0          24.0      1000.0  219.6690
8           3340            8   NaN           NaN         NaN  189.7404
            3341            8   NaN           NaN         NaN  287.8128
Copy the code

###2. Disable hierarchy indexing, group_keys=False

Sample code:

print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))
Copy the code

Running results:

      LeagueIndex   Age  HoursPerWeek  TotalHours       APM
2214            1  20.0          12.0       730.0  172.9530
2246            1  27.0           8.0       250.0  141.6282
1753            1  20.0          28.0       100.0  139.6362
3062            2  20.0           6.0       100.0  179.6250
3229            2  16.0          24.0       110.0  156.7380
1520            2  29.0           6.0       250.0  151.6470
1557            3  22.0           6.0       200.0  226.6554
484             3  19.0          42.0       450.0  220.0692
2883            3  16.0           8.0       800.0  208.9500
2688            4  26.0          24.0       990.0  249.0210
1759            4  16.0           6.0        75.0  229.9122
2637            4  23.0          24.0       650.0  227.2272
3277            5  18.0          16.0       950.0  372.6426
93              5  17.0          36.0       720.0  335.4990
202             5  37.0          14.0       800.0  327.7218
734             6  16.0          28.0       730.0  389.8314
2746            6  16.0          28.0      4000.0  350.4114
1810            6  21.0          14.0       730.0  323.2506
3127            7  23.0          42.0      2000.0  298.7952
104             7  21.0          24.0      1000.0  286.4538
1654            7  18.0          98.0       700.0  236.0316
3393            8   NaN           NaN         NaN  375.8664
3373            8   NaN           NaN         NaN  364.8504
3372            8   NaN           NaN         NaN  355.3518
Copy the code

Apply can be used to deal with missing data filling in different groups and fill the mean value of the group.