Introduction to the

The DF data types in Pandas can be groupBy operations just like database tables. In general, a GroupBy operation can be divided into three parts: splitting the data, applying the transform, and merging the data.

This article will explain the groupby operation in Pandas in detail.

Segmentation data

The purpose of splitting data is to split DF into individual groups. In order to do a GroupBy operation, we need to specify the corresponding label when we create a DF:

df = pd.DataFrame( ... : {... : "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ... : "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ... : "C": np.random.randn(8), ... : "D": np.random.randn(8), ... :}... :)... : df Out[61]: A B C D 0 foo one - 0.490565-0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449-1.155530 3 bar three -0.610380 -0.447735 4 Foo Two-0.934961 0.256358 5 Bar Two-0.256263-0.661954 6 Foo One -1.132186-0.304330 7 Foo Three 2.129757 0.445744

By default, GroupBy’s axis is the X-axis. Groups can be one column or more than one column:

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

Multiple index

In version 0.24, if we have multiple indexes, we can select a specific index from which to group:

In [10]: df2 = df.set_index(["A", "B"]) In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"])) In [12]: grouped.sum() Out[12]: C D A bar-1.591710-1.739537 foo-0.752861-1.402938 C D A bar-1.591710-1.739537 foo-0.752861-1.402938

get_group

Get_group can get the data after grouping:

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

dropna

By default, NaN data is excluded from GroupBy and can be allowed by setting DROPNA =False:

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"]) In [29]: df_dropna Out[29]: 1 1 2 2 3 3 3 1 1 2 2 2 3 3 1 2 2
# Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: A C B 1.0 2 3 2.0 2 5 # In order to allow NaN In keys, set ' 'dropna' 'to False In [31]: GroupBy (by=["b"], dropna=False). SUM () OUT [31]: A C B 1.02 3 2.02 5 NaN 1 4

Groups attribute

The GroupBy object has a group attribute, which is a key-value dictionary where the key is the data to be classified and the value is the value to be classified.

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

The index of the hierarchy

For a multilevel index object, groupBy can specify the index level of the group:

In [40]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] .... : In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [42]: s = pd.Series(np.random.randn(8), index=index) In [43]: s Out[43]: First second bar one -0.919854 two-0.042379 baz one 1.247642 two-0.009920 foo one 0.290213 two 0.495767 qux one TWO 1.548106 DTYPE: FLOAT64

Group Level 1:

In [44]: grouped = s.groupby(level=0) In [45]: grouped.sum() Out[45]: FIRST BAR-0.962232 BAZ 1.237723 FOO 0.785980 QUX 1.911055 DTYPE: FLOAT64

Group Level 2:

In [46]: s.groupby(level="second"). Sum () Out[46]: second one 0.980950 two 1.991575 dtype: float64

Group of traversal

After we have the group object, we can iterate through the group with the for statement:

In [62]: grouped = df.groupby('A') In [63]: for name, group in grouped: .... : print(name) .... : print(group) .... : Bar A B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897-0.990582 5 bar two-0.077118 1.211526 foo A B C D 0 foo One-0.575247 1.346061 2 foo two-1.143704 1.627081 4 foo two-1.193555-0.441652 6 foo one-0.408530 0.268520 7 foo Three - 0.862495-0.024580

If it is a multi-field group, the name of the group is a tuple:

In [64]: for name, group in df.groupby(['A', 'B']): .... : print(name) .... : print(group) .... : ('bar', 'one') A B C D 1 bar one 0.254161 1.511763 ('bar', 'three') A B C D 3 bar three 0.215897-0.990582 ('bar', 'one') 'two') A B C D 5 bar two-0.077118 1.211526 ('foo', 'ONE ') A B C D 0 foo one-0.575247 1.346061 6 foo one-0.408530 0.268520 (' ONE ') A B C D 0 foo one-0.575247 1.346061 'foo') A B C D 7 foo three-0.862495 0.02458 ('foo', 'foo') 'two') A B C D 2 foo two-1.143704 1.627081 4 foo two-1.19355-0.441652) A B C D 2 foo two-1.143704 1.627081 4 foo two-1.19355-0.441652

Aggregation operations

After grouping, you can perform aggregation:

In [67]: grouped = df.groupby("A") In [68]: grouped.aggregate(np.sum) Out[68]: Zhai Zhai, grouped = df. GroupBy (["A", "B"]]) In [70]: Zhai Zhai, grouped = df. Zhai Zhai, grouped = f. grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763 three 0.215897-0.990582 two-0.077118 1.211526 foo one -0.983776 1.614581 three -0.862495 0.024580 TWO 0.049851 1.185429

For multi-index data, the default return value is also multi-index. If you want to use a new index, you can add as_index = False:

In [71]: grouped = df.groupby(["A", "B"], as_index=False) In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897-0.990582 2 bar two-0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three-0.862495 0.024580 5 foo two 0.049851 1.185429 In [73]: Df.groupBy ("A", as_index=False). Sum () Out[73]: A C D 0 bar 0.392940 1.732707 1 foo-1.796421 2.824590

The above effect is equivalent to reset_index

In [74]: df.groupby(["A", "B"]).sum().reset_index()

Grouped.size() calculates the size of the group:

In [75]: grouped.size()
Out[75]: 
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2

Grouped. Describe ()

In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... STD min 25% 50% 75% Max 0 1.0 0.254161 NaN 0.254161 0.254161 0.254161... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897... NaN-0.990582-0.990582-0.990582-0.990582-0.990582 2 1.0-0.077118 NaN-0.077118-0.077118-0.077118... NaN-0.990582-0.990582-0.990582-0.990582-0.990582 2 1.0-0.077118 NaN-0.077118-0.077118... Nan 1.211526 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0-0.491888 0.117887-0.575247-0.533567-0.491888... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0-0.862495 NaN-0.862495-0.862495-0.862495... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692-1.143704-0.559389 0.024925... [5 rows x 5 columns] Rows x 5 columns

General polymerization method

The following is a common aggregation method:

function describe
mean() The average
sum() sum
size() Calculate the size
count() Group of statistical
std() The standard deviation
var() The variance
sem() The standard error of the mean
describe() Description of statistical information
first() The first value is group
last() The last value is group
nth() The NTH group value
min() The minimum value
max() The maximum

Use multiple aggregation methods at the same time

Multiple aggregation methods can be specified at the same time:

In [81]: grouped = df.groupby("A") In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: Sum mean STD A bar 0.392940 0.130980 0.181231 foo-1.796421-0.359284 0.912265

You can rename it:

In [84]: (
   ....:     grouped["C"]
   ....:     .agg([np.sum, np.mean, np.std])
   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
   ....: )
   ....: 
Out[84]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

NamedAgg

Namedagg provides a more precise definition of the aggregate, which includes two custom fields, Column and aggfunc.

In [88]: animals = pd.DataFrame(
   ....:     {
   ....:         "kind": ["cat", "dog", "cat", "dog"],
   ....:         "height": [9.1, 6.0, 9.5, 34.0],
   ....:         "weight": [7.9, 7.5, 9.9, 198.0],
   ....:     }
   ....: )
   ....: 

In [89]: animals
Out[89]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [90]: animals.groupby("kind").agg(
   ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
   ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
   ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
   ....: )
   ....: 
Out[90]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Or just use a tuple:

In [91]: animals.groupby("kind").agg( .... : min_height=("height", "min"), .... : max_height=("height", "max"), .... : average_weight=("weight", np.mean), .... :)... : Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75

Different columns specify different aggregation methods

By passing in a dictionary to the agg method, we can specify different aggregations for different columns:

Grouped. In [95]: grouped. Agg ({"C": "sum", "D": "STD "}) Out[95]: C D A bar 0.392940 1.366330 foo-1.796421 0.884785

Conversion operations

A conversion is the operation of converting an object to an object of the same size. In the process of data analysis, data conversion is often needed.

We can take the lambda operation:

In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

Filled NA value:

In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

Filtering operation

The filter method uses lambda expressions to filter data we don’t need:

In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]: 
3    3
4    3
5    3
dtype: int64

The Apply operation

Some data may not be suitable for aggregation or conversion. Pandas provides an apply method for more flexible conversion operations.

In [156]: df Out[156]: A B C D 0 foo one-0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two-1.143704 1.627081 3 bar three 0.215897 A B C D 0 foo one-0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two-1.143704 1.627081 3 bar three 0.215897 -0.990582 4 Foo Two 1.193555 -0.441652 5 Bar Two -0.077118 1.211526 6 Foo One -0.408530 0.268520 7 Foo Three -0.862495 Zhai Zhai, the grouped = f. Groupby ("A") # could also just call. Describe () In ["C"]. Zhai Zhai ["C"]. X.Describe ()) Out[158]: A bar count 3.000000 mean 0.130980 STD 0.181231 min-0.077118 25% 0.069390... FOO MIN-1.143704 25%-0.862495 50%-0.575247 75%-0.408530 Max 1.193555 NAME: C, LENGTH: 16, DTYPE: FLOAT64

You can attach functions to it:

In [159]: grouped = df.groupby('A')['C'] In [160]: def f(group): ..... : return pd.DataFrame({'original': group, ..... : 'demeaned': group - group.mean()}) ..... : In [161]: grouped.apply(f) Out[161]: Original Demeaned 0-0.575247-0.215962 1 0.254161 0.123181 2-1.143704-0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5-0.077118-0.208098 6-0.408530-0.049245 7-0.862495-0.503211

This article has been included in http://www.flydean.com/11-python-pandas-groupby/

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the tips to wait for you to discover!