Pandas would never have become the dominant data analysis center component if it had simply been able to put some data in the elegant format of a dataframe. In the process of data analysis, the description of data is achieved through a series of statistical indicators, and the analysis results also need to be compared horizontally and longitudinally by specific group behaviors.

GroupBy is one such weapon. In fact, the SQL language became the standard tool for advanced data analysts decades before Pandas appeared, A big part of this is because it has the standard SELECT xx FROM XX WHERE condition GROUP BY XX HAVING condition paradigm.

Thanks to Wes Mckinney and his team, we have a more flexible and adaptable tool beyond SQL, rather than trudging through the SQL Shell or Python.

[Example] The SQL statement is expressed in Pandas

SQL

SELECT Column1, Column2, mean(Column3), sum(Column4)

FROM SomeTable

WHERE Condition 1

GROUP BY Column1, Column2

HAVING Condition2

Pandas

df [Condition1].groupby([Column1, Column2], as_index=False).agg({Column3: “mean”, Column4: “sum”}).filter(Condition2)


Group By: split – apply – combine

GroupBy can be broken down into three steps:

  • Splitting data into groups with primary keys
  • Applying: Using functions independently for each group
  • To combine: to combine the results obtained

So how does this flowing routine work?

  • Splittinggroupbyimplementation
  • Applyingagg,apply,transform,filterImplement specific operations
  • CombiningconcatSuch as implementation

In the Apply step, there are usually four types of operations:

  • Aggregation: Do some statistical calculations
  • Apply: Do some data conversion
  • Transformation: Performs some data processing Transformation
  • Filtration: Do some group-level filtering

Notice what we’re talking about hereapply.agg.transform.filterMethods are all limited topandas.core.groupby.DataFrameGroupByInside, not withpandas.core.groupby.DataFrameConfusion.


Import the required modules first

import numpy as np
import pandas as pd
import sys, traceback
from itertools import chain
Copy the code

Part 1: Groupby

df_0 = pd.DataFrame({'A': list(chain(*[['foo'.'bar'] *4)),'B': ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'C': np.random.randn(8),
                     'D': np.random.randn(8)})
Copy the code
df_0
Copy the code
A B C D
0 foo one 1.145852 0.210586
1 bar one 1.343518 2.064735
2 foo two 0.544624 1.125505
3 bar three 1.090288 0.296160
4 foo two 1.854274 1.348597
5 bar two 0.246072 0.598949
6 foo one 0.348484 0.429300
7 bar three 1.477379 0.917027

Talk 1: Create oneGroupbyObject should be noted

Good Practice

df_01 = df_0.copy()
df_01.groupby(["A"."B"], as_index=False, sort=False).agg({"C": "sum"."D": "mean"})
Copy the code
A B C D
0 foo one 1.494336 0.319943
1 bar one 1.343518 2.064735
2 foo two 1.309649 1.237051
3 bar three 2.567667 0.310433
4 bar two 0.246072 0.598949

Poor Practice

df_02 = df_0.copy()
df_02.groupby(["A"."B"]).agg({"C": "sum"."D": "mean"}).reset_index()
Copy the code
A B C D
0 bar one 1.343518 2.064735
1 bar three 2.567667 0.310433
2 bar two 0.246072 0.598949
3 foo one 1.494336 0.319943
4 foo two 1.309649 1.237051
  • Direct use ofas_index=FalseArguments are a good habit, because if the Dataframe is very large (e.g., larger than GB), make it a Groupby object and call it laterreset_index()There will be additional time consumption.
  • Sorting is a luxury in any operation that involves data. If you are simply grouping objects and do not care about order, you should turn off sorting when creating Groupby objects, as it is enabled by default. This is especially important if you’re working on large data sets.
  • Note that Groupby arranges the order of the data within each new group in the same order as it was in the original data box. This is independent of whether a sort is specified.

To get a multi-indexed box, use the default as_index=True, as in the following example:

df_03 = df_0.copy()
df_03.groupby(["A"."B"]).agg({"C": "sum"."D": "mean"})

Copy the code
C D
A B
bar one 1.343518 2.064735
three 2.567667 0.310433
two 0.246072 0.598949
foo one 1.494336 0.319943
two 1.309649 1.237051

Note that as_index is valid only when the aggregation operation is used, and if it is used for other operations, such as transform, specify that this parameter is invalid

df_04 = df_0.copy()
df_04.groupby(["A"."B"], as_index=True).transform(lambda x: x * x)

Copy the code
C D
0 1.312976 0.044347
1 1.805040 4.263130
2 0.296616 1.266761
3 1.188727 0.087711
4 3.438331 1.818714
5 0.060552 0.358740
6 0.121441 0.184298
7 2.182650 0.840938

As you can see, we got A new dataframe of the same length as df_0, and we also wanted A and B to be indexes, but that didn’t work.


Talk 2: Usepd.Grouper

More powerful and flexible than groupby, pd.Grouper supports not only normal grouping, but also up-sampling or down-sampling grouping by time

df_1 = pd.read_excel("dataset\sample-salesv3.xlsx")
df_1["date"] = pd.to_datetime(df_1["date"])

Copy the code
df_1.head()

Copy the code
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 The 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 – 1 63.16 63.16 The 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 The 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 The 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 The 2014-01-01 23:26:55

Calculate the sum of ext prices for each month

df_1.set_index("date").resample("M") ["ext price"].sum()

Copy the code
Date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 Freq: M, Name: ext Price, dType: float64Copy the code
df_1.groupby(pd.Grouper(key="date", freq="M"))"ext price"].sum()

Copy the code
Date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 Freq: M, Name: ext Price, dType: float64Copy the code

Both get the same result, and the second seems a little hard to understand. Let’s do another example

Calculate the ext price sum for each customer each month

df_1.set_index("date").groupby("name") ["ext price"].resample("M").sum().head(20)

Copy the code
Name Date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 The 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Name: ext Price, DTYPE: float64 Name: ext Price, DType: float64Copy the code
df_1.groupby(["name", pd.Grouper(key="date",freq="M"[])"ext price"].sum().head(20)

Copy the code
Name Date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 The 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Name: ext Price, DTYPE: float64 Name: ext Price, DType: float64Copy the code

This time, the second is far cleaner and easier to understand than the first. For such mixed grouping by specific field and time sampling, pd.Grouper is preferred


Talk 3: How do I access groups

If you just split without applying, you get a Groupby object. The main methods to access the split group are as follows:

  • groups
  • get_group
  • Iteration traversal
df_2 = pd.DataFrame({'X': ['A'.'B'.'A'.'B'].'Y': [1.4.3.2]})
df_2

Copy the code
X Y
0 A 1
1 B 4
2 A 3
3 B 2
  1. usegroupsMethod to see all groups
df_2.groupby("X").groups

Copy the code
{'A': Int64Index([0, 2], dtype='int64'),
 'B': Int64Index([1, 3], dtype='int64')}
Copy the code
  1. useget_groupMethod can access the specified group
df_2.groupby("X", as_index=True).get_group(name="A")

Copy the code
X Y
0 A 1
2 A 3

Pay attention to,get_groupMethod,nameOnly a single parameter can be passedstr, cannot be passedlistAlthough they are often found elsewhere in Pandas. If it is a split of multi-column primary keys, it can be passedtuple.

  1. Iteration traversal
for name, group in df_2.groupby("X"):
    print(name)
    print(group,"\n")

Copy the code
A
   X  Y
0  A  1
2  A  3 

B
   X  Y
1  B  4
3  B  2 
Copy the code

Here is introduced a little skill, if you get a < pandas. Core. Groupby. Groupby. DataFrameGroupBy object, want to put it back into its original dataframe, there is a very simple method worth mentioning:

gropbyed_object.apply(lambda x: x)

I won’t explain the API individually for lack of space, but I’ll just point out the three parameters that are easiest to ignore and most likely to go wrong

parameter Matters needing attention
level This parameter is valid only for data boxes with hierarchical indexes
as_index Only for data enclosuresaggValid when operating,
group_keys Only in the callapplyEffective when

Part 2: Explain the Apply phase

After splitting, you can do some operations for each group. Generally speaking, it can be divided into the following four categories:

  • aggregation
  • apply
  • transform
  • filter

Let’s compare these four types of operations

  1. Anyone who can put aSeriesAll of them are compressed to a scalar valueaggOperations, such as summation, mean, extremum, and other statistical calculations
  2. To the data box orgroupbyObject is transformed to get a subset or a new data boxapplyortransform
  3. The standard filtering of aggregate results isfilter

applytransformThere are a few similarities, which will be highlighted below

Talk 4:agg VS apply

Both AGG and Apply can pass functions to data for a particular column and evaluate according to the function. The difference, however, is that AGG is more flexible and efficient and can be done in one go. Apply requires multiple trips to complete the same operation.

df_3 = pd.DataFrame({"name": ["Foo"."Bar"."Foo"."Bar"]."score": [80.80.95.70]})
df_3

Copy the code
name score
0 Foo 80
1 Bar 80
2 Foo 95
3 Bar 70

We need to calculate each person’s total score, highest score, lowest score

(1) Use apply method

df_3.groupby("name", sort=False).score.apply(lambda x: x.sum())

Copy the code
name
Foo    175
Bar    150
Name: score, dtype: int64
Copy the code
df_3.groupby("name", sort=False).score.apply(lambda x: x.max())

Copy the code
name
Foo    95
Bar    80
Name: score, dtype: int64
Copy the code
df_3.groupby("name", sort=False).score.apply(lambda x: x.min())

Copy the code
name
Foo    80
Bar    70
Name: score, dtype: int64
Copy the code

Obviously, we toss and turn three times and need an additional operation (gluing the three values we get)

(2) Use agG method

df_3.groupby("name", sort=False).agg({"score": [np.sum, np.max, np.min]})

Copy the code
score
sum amax amin
name
Foo 175 95 80
Bar 150 80 70

Agg can call different functions independently for multiple columns at a time, whereas Apply can only call the same function for multiple columns at a time.


Talk 5:transform VS agg

The transform applies to the data box itself and returns the transformed value. The returned object has the same number of rows as the original object, but columns can be extended. Note that instead of modifying the original object in place, a new object is created. So the original object is the same.

df_4 = pd.DataFrame({'A': range(3), 'B': range(1.4)})
df_4

Copy the code
A B
0 0 1
1 1 2
2 2 3
df_4.transform(lambda x: x + 1)

Copy the code
A B
0 1 2
1 2 3
2 3 4

You can group data boxes and then apply a transformation to each group, such as incrementing the element by 1. So this is a trivial example, but I can just do the transformation.

df_2.groupby("X").transform(lambda x: x + 1)

Copy the code
Y
0 2
1 5
2 4
3 3

Let’s take a more practical example

df_5 = pd.read_csv(r"dataset\tips.csv")
df_5.head()

Copy the code
total_bill tip sex smoker day time size
0 16.99 1.01 female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 female No Sun Dinner 4

Now we want to know the mean value of each value column compared with the following two operations: AGG and Transform

df_5.groupby("day").aggregate("mean")

Copy the code
total_bill tip size
day
Fri 17.151579 2.734737 2.105263
Sat 20.441379 2.993103 2.517241
Sun 21.410000 3.255132 2.842105
Thur 17.682742 2.771452 2.451613
df_5.groupby('day').transform(lambda x : x.mean()).total_bill.unique()

Copy the code
Array ([21.41, 20.44137931, 17.68274194, 17.15157895])Copy the code

The observation is that the two operations are the same, taking the mean of each group. The difference is that the AGG method only returns 4 rows (compressed statistics), while the Transform returns a new data box of the same length as the original data box.


Talk 6:transform VS apply

Transform and Apply differ in two main ways:

  1. applyFor each group, the function is called on all columns at the same time; whiletransformFor each group, the function is called on each column in turn
  2. Determined by the above working method:applyCan return scalar,Series,dataframeDepends on what you call it onapplyMethods; whiletransformOnly one array-like sequence, such as one dimension, can be returnedSeries,array,list, and most importantly, have the same length as the original group, otherwise an error will be raised.

Compare the working objects of the two methods by printing the type of the object

df_6 = pd.DataFrame({'State': ['Texas'.'Texas'.'Florida'.'Florida'].'a': [4.5.1.3].'b': [6.10.3.11]})
df_6

Copy the code
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
def inspect(x):
    print(type(x))
    print(x)

Copy the code
df_6.groupby("State").apply(inspect)

Copy the code
<class 'pandas.core.frame.DataFrame'> State a b 2 Florida 1 3 3 Florida 3 11 <class 'pandas.core.frame.DataFrame'> State  a b 2 Florida 1 3 3 Florida 3 11 <class 'pandas.core.frame.DataFrame'> State a b 0 Texas 4 6 1 Texas 5 10Copy the code

Two points are clear from the print: The dataframe used by Apply is computed twice. The first group is computed twice because pandas uses this mechanism to see if there is a faster way to compute the remaining groups.

df_6.groupby("State").transform(inspect)

Copy the code
<class 'pandas.core.series.Series'>
2    1
3    3
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
2     3
3    11
Name: b, dtype: int64
<class 'pandas.core.frame.DataFrame'>
   a   b
2  1   3
3  3  11
<class 'pandas.core.series.Series'>
0    4
1    5
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
0     6
1    10
Name: b, dtype: int64
Copy the code

Two points are also clear from the print results: the transform evaluates only one column at a time; It’s going to be the case that you’re counting a whole group, and that’s kind of a puzzle to study.

From the above comparison, we get a useful caveat: Do not pass a function that involves multiple columns to the transform method, because doing so will only get you an error. For example, the following code shows:

def subtract(x):
    return x["a"] - x["b"]

Copy the code
try:
    df_6.groupby("State").transform(subtract)
except Exception:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    formatted_lines = traceback.format_exc().splitlines()
    print(formatted_lines[- 1])

Copy the code
KeyError: ('a', 'occurred at index a')
Copy the code

Another caveat: When using the Transform method, don’t try to change the length of the returned result. Not only will this cause errors, but traceback information is so cryptic that it may take you a long time to realize the error.

def return_more(x):
    return  np.arange(3)

Copy the code
try:
    df_6.groupby("State").transform(return_more)
except Exception:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    formatted_lines = traceback.format_exc().splitlines()
    print(formatted_lines[- 1])

Copy the code
ValueError: Length mismatch: Expected axis has 6 elements, new values have 4 elements
Copy the code

This error message is a bit awkward, expecting six elements, but returning only four; In fact, it is easier to understand the error when six elements are returned instead of four expected.

Finally, let’s end the talk with a useful lesson: don’t use the transform method if you’re sure you want to operate on multiple columns at the same time, and preferably fast. Talk9 has a good example of this.


Talk 7:aggUsage summary

(1) Call multiple functions on all columns at once

df_0.groupby("A").agg([np.sum, np.mean, np.min])

Copy the code
C D
sum mean amin sum mean amin
A
bar 0.978077 0.244519 1.343518 2.042817 0.510704 2.064735
foo 0.184686 0.046172 1.854274 3.113988 0.778497 0.210586

(2) Call more than one function on a particular column at a time

df_0.groupby("A") ["C"].agg([np.sum, np.mean, np.min])

Copy the code
sum mean amin
A
bar 0.978077 0.244519 1.343518
foo 0.184686 0.046172 1.854274

(3) Call different functions for different columns

df_0.groupby("A").agg({"C": [np.sum, np.mean], "D": [np.max, np.min]})

Copy the code
C D
sum mean amax amin
A
bar 0.978077 0.244519 0.917027 2.064735
foo 0.184686 0.046172 1.348597 0.210586
df_0.groupby("A").agg({"C": "sum"."D": "min"})

Copy the code
C D
A
bar 0.978077 2.064735
foo 0.184686 0.210586

(4) Call different functions for the same column and rename them directly

df_0.groupby("A") ["C"].agg([("Largest"."max"), ("Smallest"."min")])

Copy the code
Largest Smallest
A
bar 1.477379 1.343518
foo 1.145852 1.854274

(5) Call the same function for multiple columns

agg_keys = {}.fromkeys(["C"."D"]."sum")
df_0.groupby("A").agg(agg_keys)

Copy the code
C D
A
bar 0.978077 2.042817
foo 0.184686 3.113988

(6) Notice that AGG will ignore missing values, which needs to be paid attention to when counting

df_7 = pd.DataFrame({"ID": ["A"."A"."A"."B"."B"]."Num": [1,np.nan, 1.1.1]})
df_7

Copy the code
ID Num
0 A 1.0
1 A NaN
2 A 1.0
3 B 1.0
4 B 1.0
df_7.groupby("ID").agg({"Num":"count"})

Copy the code
Num
ID
A 2
B 2

Note:PandasIn thecount.sum.mean.median.std.var.min.maxAnd other functions are optimized in C language. So, again, if you use it on a big data setagg, it is better to use these functions than fromnumpyborrownp.sumA slow process is the slow accumulation of each step.


Talk 8: ** bugs analysis

Typically, after grouping a Dataframe and performing a given operation, the results can be returned directly or filtered as required. This filter generally refers to the filter operation, but it is important to understand whether you need to filter the group or each row in the group. This Talk is about filtering.

Find out the students whose scores in each course are below the average for that course

df_8 = pd.DataFrame({"Subject": list(chain(*[["Math"] *3["Computer"] *3)),"Student": list(chain(*[["Chan"."Ida"."Ada"] *2)),"Score": [80.90.85.90.85.95]})

Copy the code
df_8

Copy the code
Subject Student Score
0 Math Chan 80
1 Math Ida 90
2 Math Ada 85
3 Computer Chan 90
4 Computer Ida 85
5 Computer Ada 95

Is such a requirement appropriate to handle with a filter? Let’s try:

try:
    df_8.groupby("Subject").filter(lambda x: x["Score"] < x["Score"].mean())
except Exception:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    formatted_lines = traceback.format_exc().splitlines()
    print(formatted_lines[- 1])

Copy the code
TypeError: filter function returned a Series, but expected a scalar bool
Copy the code

Obviously not, because what the filter actually does is either leave the group behind or filter it out. The confusion here is the same as the confusion between WHERE and HAVING when we first learned SQL. Just as you need to remember that HAVING is an in-group syntax, remember that filter is an in-group method.

Let’s solve this example first. The correct way to do this is:

df_8.groupby("Subject").apply(lambda g: g[g.Score < g.Score.mean()])

Copy the code
Subject Student Score
Subject
Computer 4 Computer Ida 85
Math 0 Math Chan 80

As for filter, we cite examples from official documents as a comparison

df_9 = pd.DataFrame({'A' : ['foo'.'bar'.'foo'.'bar'.'foo'.'bar'].'B' : [1.2.3.4.5.6].'C' : [2.0.5..8..1..2..9.]})
df_9

Copy the code
A B C
0 foo 1 2.0
1 bar 2 5.0
2 foo 3 8.0
3 bar 4 1.0
4 foo 5 2.0
5 bar 6 9.0
df_9.groupby('A').filter(lambda x: x['B'].mean() > 3.)

Copy the code
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0

Part 3:groupbyApplication, for example,

Talk 9: Missing value padding within a group

df_10 = pd.DataFrame({"ID": ["A"."A"."A"."B"."B"."B"]."Num": [100,np.nan,300,np.nan,500.600]})
df_10

Copy the code
ID Num
0 A 100.0
1 A NaN
2 A 300.0
3 B NaN
4 B 500.0
5 B 600.0
df_10.groupby("ID", as_index=False).Num.transform(lambda x: x.fillna(method="ffill")).transform(lambda x: x.fillna(method="bfill"))

Copy the code
Num
0 100.0
1 100.0
2 300.0
3 500.0
4 500.0
5 600.0

If the Dataframe is large (over 1GB) and the transform + lambda method is slow, you can use the following method, which is about 100 times faster than the above combination.

df_10.groupby("ID",as_index=False).ffill().groupby("ID",as_index=False).bfill()

Copy the code
ID Num
0 A 100.0
1 A 100.0
2 A 300.0
3 B 500.0
4 B 500.0
5 B 600.0

References:

Stackoverflow.com/questions/2…

Stackoverflow.com/questions/4…

Stackoverflow.com/questions/2…

Pandas.pydata.org/pandas-docs…