3. The book is written for Pandas

Pandas would never have become the dominant data analysis center component if it had simply been able to put some data in the elegant format of a dataframe. In the process of data analysis, the description of data is achieved through a series of statistical indicators, and the analysis results also need to be compared horizontally and longitudinally by specific group behaviors.

GroupBy is one such weapon. In fact, the SQL language became the standard tool for advanced data analysts decades before Pandas appeared, A big part of this is because it has the standard SELECT xx FROM XX WHERE condition GROUP BY XX HAVING condition paradigm.

Thanks to Wes Mckinney and his team, we have a more flexible and adaptable tool beyond SQL, rather than trudging through the SQL Shell or Python.

[Example] The SQL statement is expressed in Pandas

SQL

SELECT Column1, Column2, mean(Column3), sum(Column4)

FROM SomeTable

WHERE Condition 1

GROUP BY Column1, Column2

HAVING Condition2

Pandas

df [Condition1].groupby([Column1, Column2], as_index=False).agg({Column3: “mean”, Column4: “sum”}).filter(Condition2)

Group By: split – apply – combine

GroupBy can be broken down into three steps:

Splitting data into groups with primary keys
Applying: Using functions independently for each group
To combine: to combine the results obtained

So how does this flowing routine work?

Splitting 由 groupbyimplementation
Applying 由 agg,apply,transform,filterImplement specific operations
Combining 由 concatSuch as implementation

In the Apply step, there are usually four types of operations:

Aggregation: Do some statistical calculations
Apply: Do some data conversion
Transformation: Performs some data processing Transformation
Filtration: Do some group-level filtering

Notice what we’re talking about hereapply.agg.transform.filterMethods are all limited topandas.core.groupby.DataFrameGroupByInside, not withpandas.core.groupby.DataFrameConfusion.

Import the required modules first

import numpy as np
import pandas as pd
import sys, traceback
from itertools import chain
Copy the code

Part 1: Groupby

df_0 = pd.DataFrame({'A': list(chain(*[['foo'.'bar'] *4)),'B': ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'C': np.random.randn(8),
                     'D': np.random.randn(8)})
Copy the code

df_0
Copy the code

	A	B	C	D
0	foo	one	1.145852	0.210586
1	bar	one	1.343518	2.064735
2	foo	two	0.544624	1.125505
3	bar	three	1.090288	0.296160
4	foo	two	1.854274	1.348597
5	bar	two	0.246072	0.598949
6	foo	one	0.348484	0.429300
7	bar	three	1.477379	0.917027

Talk 1: Create oneGroupbyObject should be noted

Good Practice

df_01 = df_0.copy()
df_01.groupby(["A"."B"], as_index=False, sort=False).agg({"C": "sum"."D": "mean"})
Copy the code

	A	B	C	D
0	foo	one	1.494336	0.319943
1	bar	one	1.343518	2.064735
2	foo	two	1.309649	1.237051
3	bar	three	2.567667	0.310433
4	bar	two	0.246072	0.598949

Poor Practice

df_02 = df_0.copy()
df_02.groupby(["A"."B"]).agg({"C": "sum"."D": "mean"}).reset_index()
Copy the code

	A	B	C	D
0	bar	one	1.343518	2.064735
1	bar	three	2.567667	0.310433
2	bar	two	0.246072	0.598949
3	foo	one	1.494336	0.319943
4	foo	two	1.309649	1.237051

Direct use ofas_index=FalseArguments are a good habit, because if the Dataframe is very large (e.g., larger than GB), make it a Groupby object and call it laterreset_index()There will be additional time consumption.
Sorting is a luxury in any operation that involves data. If you are simply grouping objects and do not care about order, you should turn off sorting when creating Groupby objects, as it is enabled by default. This is especially important if you’re working on large data sets.
Note that Groupby arranges the order of the data within each new group in the same order as it was in the original data box. This is independent of whether a sort is specified.

To get a multi-indexed box, use the default as_index=True, as in the following example:

df_03 = df_0.copy()
df_03.groupby(["A"."B"]).agg({"C": "sum"."D": "mean"})

Copy the code

		C	D
A	B
bar	one	1.343518	2.064735
	three	2.567667	0.310433
	two	0.246072	0.598949
foo	one	1.494336	0.319943
foo	two	1.309649	1.237051

Note that as_index is valid only when the aggregation operation is used, and if it is used for other operations, such as transform, specify that this parameter is invalid

df_04 = df_0.copy()
df_04.groupby(["A"."B"], as_index=True).transform(lambda x: x * x)

Copy the code

	C	D
0	1.312976	0.044347
1	1.805040	4.263130
2	0.296616	1.266761
3	1.188727	0.087711
4	3.438331	1.818714
5	0.060552	0.358740
6	0.121441	0.184298
7	2.182650	0.840938

As you can see, we got A new dataframe of the same length as df_0, and we also wanted A and B to be indexes, but that didn’t work.

Talk 2: Use`pd.Grouper`

More powerful and flexible than groupby, pd.Grouper supports not only normal grouping, but also up-sampling or down-sampling grouping by time

df_1 = pd.read_excel("dataset\sample-salesv3.xlsx")
df_1["date"] = pd.to_datetime(df_1["date"])

Copy the code

df_1.head()

Copy the code

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	The 2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	– 1	63.16	63.16	The 2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	The 2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	The 2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	The 2014-01-01 23:26:55

Calculate the sum of ext prices for each month

df_1.set_index("date").resample("M") ["ext price"].sum()

Copy the code

Date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 Freq: M, Name: ext Price, dType: float64Copy the code

df_1.groupby(pd.Grouper(key="date", freq="M"))"ext price"].sum()

Copy the code

Date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 Freq: M, Name: ext Price, dType: float64Copy the code

Both get the same result, and the second seems a little hard to understand. Let’s do another example

Calculate the ext price sum for each customer each month

df_1.set_index("date").groupby("name") ["ext price"].resample("M").sum().head(20)

Copy the code

Name Date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 The 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Name: ext Price, DTYPE: float64 Name: ext Price, DType: float64Copy the code

df_1.groupby(["name", pd.Grouper(key="date",freq="M"[])"ext price"].sum().head(20)

Copy the code

Name Date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 The 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Name: ext Price, DTYPE: float64 Name: ext Price, DType: float64Copy the code

This time, the second is far cleaner and easier to understand than the first. For such mixed grouping by specific field and time sampling, pd.Grouper is preferred

Talk 3: How do I access groups

If you just split without applying, you get a Groupby object. The main methods to access the split group are as follows:

groups
get_group
Iteration traversal

df_2 = pd.DataFrame({'X': ['A'.'B'.'A'.'B'].'Y': [1.4.3.2]})
df_2

Copy the code

	X	Y
0	A	1
1	B	4
2	A	3
3	B	2

usegroupsMethod to see all groups

df_2.groupby("X").groups

Copy the code

{'A': Int64Index([0, 2], dtype='int64'),
 'B': Int64Index([1, 3], dtype='int64')}
Copy the code

useget_groupMethod can access the specified group

df_2.groupby("X", as_index=True).get_group(name="A")

Copy the code

	X	Y
0	A	1
2	A	3

Pay attention to,get_groupMethod,nameOnly a single parameter can be passedstr, cannot be passedlistAlthough they are often found elsewhere in Pandas. If it is a split of multi-column primary keys, it can be passedtuple.

Iteration traversal

for name, group in df_2.groupby("X"):
    print(name)
    print(group,"\n")

Copy the code

A
   X  Y
0  A  1
2  A  3 

B
   X  Y
1  B  4
3  B  2 
Copy the code

Here is introduced a little skill, if you get a < pandas. Core. Groupby. Groupby. DataFrameGroupBy object, want to put it back into its original dataframe, there is a very simple method worth mentioning:

gropbyed_object.apply(lambda x: x)

I won’t explain the API individually for lack of space, but I’ll just point out the three parameters that are easiest to ignore and most likely to go wrong

parameter	Matters needing attention
level	This parameter is valid only for data boxes with hierarchical indexes
as_index	Only for data enclosures`agg`Valid when operating,
group_keys	Only in the call`apply`Effective when

Part 2: Explain the Apply phase

After splitting, you can do some operations for each group. Generally speaking, it can be divided into the following four categories:

aggregation
apply
transform
filter

Let’s compare these four types of operations

Anyone who can put aSeriesAll of them are compressed to a scalar valueaggOperations, such as summation, mean, extremum, and other statistical calculations
To the data box orgroupbyObject is transformed to get a subset or a new data boxapplyortransform
The standard filtering of aggregate results isfilter

apply 和 transformThere are a few similarities, which will be highlighted below

Talk 4:agg VS apply

Both AGG and Apply can pass functions to data for a particular column and evaluate according to the function. The difference, however, is that AGG is more flexible and efficient and can be done in one go. Apply requires multiple trips to complete the same operation.

df_3 = pd.DataFrame({"name": ["Foo"."Bar"."Foo"."Bar"]."score": [80.80.95.70]})
df_3

Copy the code

	name	score
0	Foo	80
1	Bar	80
2	Foo	95
3	Bar	70

We need to calculate each person’s total score, highest score, lowest score

(1) Use apply method

df_3.groupby("name", sort=False).score.apply(lambda x: x.sum())

Copy the code

name
Foo    175
Bar    150
Name: score, dtype: int64
Copy the code

df_3.groupby("name", sort=False).score.apply(lambda x: x.max())

Copy the code

name
Foo    95
Bar    80
Name: score, dtype: int64
Copy the code

df_3.groupby("name", sort=False).score.apply(lambda x: x.min())

Copy the code

name
Foo    80
Bar    70
Name: score, dtype: int64
Copy the code

Obviously, we toss and turn three times and need an additional operation (gluing the three values we get)

(2) Use agG method

df_3.groupby("name", sort=False).agg({"score": [np.sum, np.max, np.min]})

Copy the code

	score
	sum	amax	amin
name
Foo	175	95	80
Bar	150	80	70

Agg can call different functions independently for multiple columns at a time, whereas Apply can only call the same function for multiple columns at a time.

Talk 5:transform VS agg

The transform applies to the data box itself and returns the transformed value. The returned object has the same number of rows as the original object, but columns can be extended. Note that instead of modifying the original object in place, a new object is created. So the original object is the same.

df_4 = pd.DataFrame({'A': range(3), 'B': range(1.4)})
df_4

Copy the code

	A	B
0	0	1
1	1	2
2	2	3

df_4.transform(lambda x: x + 1)

Copy the code

	A	B
0	1	2
1	2	3
2	3	4

You can group data boxes and then apply a transformation to each group, such as incrementing the element by 1. So this is a trivial example, but I can just do the transformation.

df_2.groupby("X").transform(lambda x: x + 1)

Copy the code

	Y
0	2
1	5
2	4
3	3

Let’s take a more practical example

df_5 = pd.read_csv(r"dataset\tips.csv")
df_5.head()

Copy the code

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	female	No	Sun	Dinner	4

Now we want to know the mean value of each value column compared with the following two operations: AGG and Transform

df_5.groupby("day").aggregate("mean")

Copy the code

	total_bill	tip	size
day
Fri	17.151579	2.734737	2.105263
Sat	20.441379	2.993103	2.517241
Sun	21.410000	3.255132	2.842105
Thur	17.682742	2.771452	2.451613

df_5.groupby('day').transform(lambda x : x.mean()).total_bill.unique()

Copy the code

Array ([21.41, 20.44137931, 17.68274194, 17.15157895])Copy the code

The observation is that the two operations are the same, taking the mean of each group. The difference is that the AGG method only returns 4 rows (compressed statistics), while the Transform returns a new data box of the same length as the original data box.

Talk 6:transform VS apply

Transform and Apply differ in two main ways:

applyFor each group, the function is called on all columns at the same time; whiletransformFor each group, the function is called on each column in turn
Determined by the above working method:applyCan return scalar,Series,dataframeDepends on what you call it onapplyMethods; whiletransformOnly one array-like sequence, such as one dimension, can be returnedSeries,array,list, and most importantly, have the same length as the original group, otherwise an error will be raised.

Compare the working objects of the two methods by printing the type of the object

df_6 = pd.DataFrame({'State': ['Texas'.'Texas'.'Florida'.'Florida'].'a': [4.5.1.3].'b': [6.10.3.11]})
df_6

Copy the code

	State	a	b
0	Texas	4	6
1	Texas	5	10
2	Florida	1	3
3	Florida	3	11

def inspect(x):
    print(type(x))
    print(x)

Copy the code

df_6.groupby("State").apply(inspect)

Copy the code

<class 'pandas.core.frame.DataFrame'> State a b 2 Florida 1 3 3 Florida 3 11 <class 'pandas.core.frame.DataFrame'> State  a b 2 Florida 1 3 3 Florida 3 11 <class 'pandas.core.frame.DataFrame'> State a b 0 Texas 4 6 1 Texas 5 10Copy the code

Two points are clear from the print: The dataframe used by Apply is computed twice. The first group is computed twice because pandas uses this mechanism to see if there is a faster way to compute the remaining groups.

df_6.groupby("State").transform(inspect)

Copy the code

<class 'pandas.core.series.Series'>
2    1
3    3
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
2     3
3    11
Name: b, dtype: int64
<class 'pandas.core.frame.DataFrame'>
   a   b
2  1   3
3  3  11
<class 'pandas.core.series.Series'>
0    4
1    5
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
0     6
1    10
Name: b, dtype: int64
Copy the code

Two points are also clear from the print results: the transform evaluates only one column at a time; It’s going to be the case that you’re counting a whole group, and that’s kind of a puzzle to study.

From the above comparison, we get a useful caveat: Do not pass a function that involves multiple columns to the transform method, because doing so will only get you an error. For example, the following code shows:

def subtract(x):
    return x["a"] - x["b"]

Copy the code

try:
    df_6.groupby("State").transform(subtract)
except Exception:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    formatted_lines = traceback.format_exc().splitlines()
    print(formatted_lines[- 1])

Copy the code

KeyError: ('a', 'occurred at index a')
Copy the code

Another caveat: When using the Transform method, don’t try to change the length of the returned result. Not only will this cause errors, but traceback information is so cryptic that it may take you a long time to realize the error.

def return_more(x):
    return  np.arange(3)

Copy the code

try:
    df_6.groupby("State").transform(return_more)
except Exception:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    formatted_lines = traceback.format_exc().splitlines()
    print(formatted_lines[- 1])

Copy the code

ValueError: Length mismatch: Expected axis has 6 elements, new values have 4 elements
Copy the code

This error message is a bit awkward, expecting six elements, but returning only four; In fact, it is easier to understand the error when six elements are returned instead of four expected.

Finally, let’s end the talk with a useful lesson: don’t use the transform method if you’re sure you want to operate on multiple columns at the same time, and preferably fast. Talk9 has a good example of this.

Talk 7:aggUsage summary

(1) Call multiple functions on all columns at once

df_0.groupby("A").agg([np.sum, np.mean, np.min])

Copy the code

	C			D
	sum	mean	amin	sum	mean	amin
A
bar	0.978077	0.244519	1.343518	2.042817	0.510704	2.064735
foo	0.184686	0.046172	1.854274	3.113988	0.778497	0.210586

(2) Call more than one function on a particular column at a time

df_0.groupby("A") ["C"].agg([np.sum, np.mean, np.min])

Copy the code

	sum	mean	amin
A
bar	0.978077	0.244519	1.343518
foo	0.184686	0.046172	1.854274

(3) Call different functions for different columns

df_0.groupby("A").agg({"C": [np.sum, np.mean], "D": [np.max, np.min]})

Copy the code

	C		D
	sum	mean	amax	amin
A
bar	0.978077	0.244519	0.917027	2.064735
foo	0.184686	0.046172	1.348597	0.210586

df_0.groupby("A").agg({"C": "sum"."D": "min"})

Copy the code

	C	D
A
bar	0.978077	2.064735
foo	0.184686	0.210586

(4) Call different functions for the same column and rename them directly

df_0.groupby("A") ["C"].agg([("Largest"."max"), ("Smallest"."min")])

Copy the code

	Largest	Smallest
A
bar	1.477379	1.343518
foo	1.145852	1.854274

(5) Call the same function for multiple columns

agg_keys = {}.fromkeys(["C"."D"]."sum")
df_0.groupby("A").agg(agg_keys)

Copy the code

	C	D
A
bar	0.978077	2.042817
foo	0.184686	3.113988

(6) Notice that AGG will ignore missing values, which needs to be paid attention to when counting

df_7 = pd.DataFrame({"ID": ["A"."A"."A"."B"."B"]."Num": [1,np.nan, 1.1.1]})
df_7

Copy the code

	ID	Num
0	A	1.0
1	A	NaN
2	A	1.0
3	B	1.0
4	B	1.0

df_7.groupby("ID").agg({"Num":"count"})

Copy the code

	Num
ID
A	2
B	2

Note:PandasIn thecount.sum.mean.median.std.var.min.maxAnd other functions are optimized in C language. So, again, if you use it on a big data setagg, it is better to use these functions than fromnumpyborrownp.sumA slow process is the slow accumulation of each step.

Talk 8: ** bugs analysis

Typically, after grouping a Dataframe and performing a given operation, the results can be returned directly or filtered as required. This filter generally refers to the filter operation, but it is important to understand whether you need to filter the group or each row in the group. This Talk is about filtering.

Find out the students whose scores in each course are below the average for that course

df_8 = pd.DataFrame({"Subject": list(chain(*[["Math"] *3["Computer"] *3)),"Student": list(chain(*[["Chan"."Ida"."Ada"] *2)),"Score": [80.90.85.90.85.95]})

Copy the code

df_8

Copy the code

	Subject	Student	Score
0	Math	Chan	80
1	Math	Ida	90
2	Math	Ada	85
3	Computer	Chan	90
4	Computer	Ida	85
5	Computer	Ada	95

Is such a requirement appropriate to handle with a filter? Let’s try:

try:
    df_8.groupby("Subject").filter(lambda x: x["Score"] < x["Score"].mean())
except Exception:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    formatted_lines = traceback.format_exc().splitlines()
    print(formatted_lines[- 1])

Copy the code

TypeError: filter function returned a Series, but expected a scalar bool
Copy the code

Obviously not, because what the filter actually does is either leave the group behind or filter it out. The confusion here is the same as the confusion between WHERE and HAVING when we first learned SQL. Just as you need to remember that HAVING is an in-group syntax, remember that filter is an in-group method.

Let’s solve this example first. The correct way to do this is:

df_8.groupby("Subject").apply(lambda g: g[g.Score < g.Score.mean()])

Copy the code

		Subject	Student	Score
Subject
Computer	4	Computer	Ida	85
Math	0	Math	Chan	80

As for filter, we cite examples from official documents as a comparison

df_9 = pd.DataFrame({'A' : ['foo'.'bar'.'foo'.'bar'.'foo'.'bar'].'B' : [1.2.3.4.5.6].'C' : [2.0.5..8..1..2..9.]})
df_9

Copy the code

	A	B	C
0	foo	1	2.0
1	bar	2	5.0
2	foo	3	8.0
3	bar	4	1.0
4	foo	5	2.0
5	bar	6	9.0

df_9.groupby('A').filter(lambda x: x['B'].mean() > 3.)

Copy the code

	A	B	C
1	bar	2	5.0
3	bar	4	1.0
5	bar	6	9.0

Part 3:groupbyApplication, for example,

Talk 9: Missing value padding within a group

df_10 = pd.DataFrame({"ID": ["A"."A"."A"."B"."B"."B"]."Num": [100,np.nan,300,np.nan,500.600]})
df_10

Copy the code

	ID	Num
0	A	100.0
1	A	NaN
2	A	300.0
3	B	NaN
4	B	500.0
5	B	600.0

df_10.groupby("ID", as_index=False).Num.transform(lambda x: x.fillna(method="ffill")).transform(lambda x: x.fillna(method="bfill"))

Copy the code

	Num
0	100.0
1	100.0
2	300.0
3	500.0
4	500.0
5	600.0

If the Dataframe is large (over 1GB) and the transform + lambda method is slow, you can use the following method, which is about 100 times faster than the above combination.

df_10.groupby("ID",as_index=False).ffill().groupby("ID",as_index=False).bfill()

Copy the code

	ID	Num
0	A	100.0
1	A	100.0
2	A	300.0
3	B	500.0
4	B	500.0
5	B	600.0

References:

Stackoverflow.com/questions/2…

Stackoverflow.com/questions/4…