Pandas would never have become the dominant data analysis center component if it had simply been able to put some data in the elegant format of a dataframe. In the process of data analysis, the description of data is achieved through a series of statistical indicators, and the analysis results also need to be compared horizontally and longitudinally by specific group behaviors.
GroupBy is one such weapon. In fact, the SQL language became the standard tool for advanced data analysts decades before Pandas appeared, A big part of this is because it has the standard SELECT xx FROM XX WHERE condition GROUP BY XX HAVING condition paradigm.
Thanks to Wes Mckinney and his team, we have a more flexible and adaptable tool beyond SQL, rather than trudging through the SQL Shell or Python.
[Example] The SQL statement is expressed in Pandas
SQL
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
WHERE Condition 1
GROUP BY Column1, Column2
HAVING Condition2
Pandas
df [Condition1].groupby([Column1, Column2], as_index=False).agg({Column3: “mean”, Column4: “sum”}).filter(Condition2)
Group By: split – apply – combine
GroupBy can be broken down into three steps:
- Splitting data into groups with primary keys
- Applying: Using functions independently for each group
- To combine: to combine the results obtained
So how does this flowing routine work?
- Splitting 由
groupby
implementation - Applying 由
agg
,apply
,transform
,filter
Implement specific operations - Combining 由
concat
Such as implementation
In the Apply step, there are usually four types of operations:
- Aggregation: Do some statistical calculations
- Apply: Do some data conversion
- Transformation: Performs some data processing Transformation
- Filtration: Do some group-level filtering
Notice what we’re talking about hereapply
.agg
.transform
.filter
Methods are all limited topandas.core.groupby.DataFrameGroupBy
Inside, not withpandas.core.groupby.DataFrame
Confusion.
Import the required modules first
import numpy as np
import pandas as pd
import sys, traceback
from itertools import chain
Copy the code
Part 1: Groupby
df_0 = pd.DataFrame({'A': list(chain(*[['foo'.'bar'] *4)),'B': ['one'.'one'.'two'.'three'.'two'.'two'.'one'.'three'].'C': np.random.randn(8),
'D': np.random.randn(8)})
Copy the code
df_0
Copy the code
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 1.145852 | 0.210586 |
1 | bar | one | 1.343518 | 2.064735 |
2 | foo | two | 0.544624 | 1.125505 |
3 | bar | three | 1.090288 | 0.296160 |
4 | foo | two | 1.854274 | 1.348597 |
5 | bar | two | 0.246072 | 0.598949 |
6 | foo | one | 0.348484 | 0.429300 |
7 | bar | three | 1.477379 | 0.917027 |
Talk 1: Create oneGroupbyObject should be noted
Good Practice
df_01 = df_0.copy()
df_01.groupby(["A"."B"], as_index=False, sort=False).agg({"C": "sum"."D": "mean"})
Copy the code
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 1.494336 | 0.319943 |
1 | bar | one | 1.343518 | 2.064735 |
2 | foo | two | 1.309649 | 1.237051 |
3 | bar | three | 2.567667 | 0.310433 |
4 | bar | two | 0.246072 | 0.598949 |
Poor Practice
df_02 = df_0.copy()
df_02.groupby(["A"."B"]).agg({"C": "sum"."D": "mean"}).reset_index()
Copy the code
A | B | C | D | |
---|---|---|---|---|
0 | bar | one | 1.343518 | 2.064735 |
1 | bar | three | 2.567667 | 0.310433 |
2 | bar | two | 0.246072 | 0.598949 |
3 | foo | one | 1.494336 | 0.319943 |
4 | foo | two | 1.309649 | 1.237051 |
- Direct use of
as_index=False
Arguments are a good habit, because if the Dataframe is very large (e.g., larger than GB), make it a Groupby object and call it laterreset_index()
There will be additional time consumption. - Sorting is a luxury in any operation that involves data. If you are simply grouping objects and do not care about order, you should turn off sorting when creating Groupby objects, as it is enabled by default. This is especially important if you’re working on large data sets.
- Note that Groupby arranges the order of the data within each new group in the same order as it was in the original data box. This is independent of whether a sort is specified.
To get a multi-indexed box, use the default as_index=True, as in the following example:
df_03 = df_0.copy()
df_03.groupby(["A"."B"]).agg({"C": "sum"."D": "mean"})
Copy the code
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 1.343518 | 2.064735 |
three | 2.567667 | 0.310433 | |
two | 0.246072 | 0.598949 | |
foo | one | 1.494336 | 0.319943 |
two | 1.309649 | 1.237051 |
Note that as_index is valid only when the aggregation operation is used, and if it is used for other operations, such as transform, specify that this parameter is invalid
df_04 = df_0.copy()
df_04.groupby(["A"."B"], as_index=True).transform(lambda x: x * x)
Copy the code
C | D | |
---|---|---|
0 | 1.312976 | 0.044347 |
1 | 1.805040 | 4.263130 |
2 | 0.296616 | 1.266761 |
3 | 1.188727 | 0.087711 |
4 | 3.438331 | 1.818714 |
5 | 0.060552 | 0.358740 |
6 | 0.121441 | 0.184298 |
7 | 2.182650 | 0.840938 |
As you can see, we got A new dataframe of the same length as df_0, and we also wanted A and B to be indexes, but that didn’t work.
Talk 2: Usepd.Grouper
More powerful and flexible than groupby, pd.Grouper supports not only normal grouping, but also up-sampling or down-sampling grouping by time
df_1 = pd.read_excel("dataset\sample-salesv3.xlsx")
df_1["date"] = pd.to_datetime(df_1["date"])
Copy the code
df_1.head()
Copy the code
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | The 2014-01-01 07:21:51 |
1 | 714466 | Trantow-Barrows | S2-77896 | – 1 | 63.16 | 63.16 | The 2014-01-01 10:00:47 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | The 2014-01-01 13:24:58 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | The 2014-01-01 15:05:22 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | The 2014-01-01 23:26:55 |
Calculate the sum of ext prices for each month
df_1.set_index("date").resample("M") ["ext price"].sum()
Copy the code
Date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 Freq: M, Name: ext Price, dType: float64Copy the code
df_1.groupby(pd.Grouper(key="date", freq="M"))"ext price"].sum()
Copy the code
Date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 Freq: M, Name: ext Price, dType: float64Copy the code
Both get the same result, and the second seems a little hard to understand. Let’s do another example
Calculate the ext price sum for each customer each month
df_1.set_index("date").groupby("name") ["ext price"].resample("M").sum().head(20)
Copy the code
Name Date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 The 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Name: ext Price, DTYPE: float64 Name: ext Price, DType: float64Copy the code
df_1.groupby(["name", pd.Grouper(key="date",freq="M"[])"ext price"].sum().head(20)
Copy the code
Name Date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 The 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 Name: ext Price, DTYPE: float64 Name: ext Price, DType: float64Copy the code
This time, the second is far cleaner and easier to understand than the first. For such mixed grouping by specific field and time sampling, pd.Grouper is preferred
Talk 3: How do I access groups
If you just split without applying, you get a Groupby object. The main methods to access the split group are as follows:
groups
get_group
- Iteration traversal
df_2 = pd.DataFrame({'X': ['A'.'B'.'A'.'B'].'Y': [1.4.3.2]})
df_2
Copy the code
X | Y | |
---|---|---|
0 | A | 1 |
1 | B | 4 |
2 | A | 3 |
3 | B | 2 |
- use
groups
Method to see all groups
df_2.groupby("X").groups
Copy the code
{'A': Int64Index([0, 2], dtype='int64'),
'B': Int64Index([1, 3], dtype='int64')}
Copy the code
- use
get_group
Method can access the specified group
df_2.groupby("X", as_index=True).get_group(name="A")
Copy the code
X | Y | |
---|---|---|
0 | A | 1 |
2 | A | 3 |
Pay attention to,get_group
Method,name
Only a single parameter can be passedstr
, cannot be passedlist
Although they are often found elsewhere in Pandas. If it is a split of multi-column primary keys, it can be passedtuple
.
- Iteration traversal
for name, group in df_2.groupby("X"):
print(name)
print(group,"\n")
Copy the code
A
X Y
0 A 1
2 A 3
B
X Y
1 B 4
3 B 2
Copy the code
Here is introduced a little skill, if you get a < pandas. Core. Groupby. Groupby. DataFrameGroupBy object, want to put it back into its original dataframe, there is a very simple method worth mentioning:
gropbyed_object.apply(lambda x: x)
I won’t explain the API individually for lack of space, but I’ll just point out the three parameters that are easiest to ignore and most likely to go wrong
parameter | Matters needing attention |
---|---|
level | This parameter is valid only for data boxes with hierarchical indexes |
as_index | Only for data enclosuresagg Valid when operating, |
group_keys | Only in the callapply Effective when |
Part 2: Explain the Apply phase
After splitting, you can do some operations for each group. Generally speaking, it can be divided into the following four categories:
- aggregation
- apply
- transform
- filter
Let’s compare these four types of operations
- Anyone who can put a
Series
All of them are compressed to a scalar valueagg
Operations, such as summation, mean, extremum, and other statistical calculations - To the data box or
groupby
Object is transformed to get a subset or a new data boxapply
ortransform
- The standard filtering of aggregate results is
filter
apply
和 transform
There are a few similarities, which will be highlighted below
Talk 4:agg VS apply
Both AGG and Apply can pass functions to data for a particular column and evaluate according to the function. The difference, however, is that AGG is more flexible and efficient and can be done in one go. Apply requires multiple trips to complete the same operation.
df_3 = pd.DataFrame({"name": ["Foo"."Bar"."Foo"."Bar"]."score": [80.80.95.70]})
df_3
Copy the code
name | score | |
---|---|---|
0 | Foo | 80 |
1 | Bar | 80 |
2 | Foo | 95 |
3 | Bar | 70 |
We need to calculate each person’s total score, highest score, lowest score
(1) Use apply method
df_3.groupby("name", sort=False).score.apply(lambda x: x.sum())
Copy the code
name
Foo 175
Bar 150
Name: score, dtype: int64
Copy the code
df_3.groupby("name", sort=False).score.apply(lambda x: x.max())
Copy the code
name
Foo 95
Bar 80
Name: score, dtype: int64
Copy the code
df_3.groupby("name", sort=False).score.apply(lambda x: x.min())
Copy the code
name
Foo 80
Bar 70
Name: score, dtype: int64
Copy the code
Obviously, we toss and turn three times and need an additional operation (gluing the three values we get)
(2) Use agG method
df_3.groupby("name", sort=False).agg({"score": [np.sum, np.max, np.min]})
Copy the code
score | |||
---|---|---|---|
sum | amax | amin | |
name | |||
Foo | 175 | 95 | 80 |
Bar | 150 | 80 | 70 |
Agg can call different functions independently for multiple columns at a time, whereas Apply can only call the same function for multiple columns at a time.
Talk 5:transform VS agg
The transform applies to the data box itself and returns the transformed value. The returned object has the same number of rows as the original object, but columns can be extended. Note that instead of modifying the original object in place, a new object is created. So the original object is the same.
df_4 = pd.DataFrame({'A': range(3), 'B': range(1.4)})
df_4
Copy the code
A | B | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 2 |
2 | 2 | 3 |
df_4.transform(lambda x: x + 1)
Copy the code
A | B | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 3 |
2 | 3 | 4 |
You can group data boxes and then apply a transformation to each group, such as incrementing the element by 1. So this is a trivial example, but I can just do the transformation.
df_2.groupby("X").transform(lambda x: x + 1)
Copy the code
Y | |
---|---|
0 | 2 |
1 | 5 |
2 | 4 |
3 | 3 |
Let’s take a more practical example
df_5 = pd.read_csv(r"dataset\tips.csv")
df_5.head()
Copy the code
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | female | No | Sun | Dinner | 4 |
Now we want to know the mean value of each value column compared with the following two operations: AGG and Transform
df_5.groupby("day").aggregate("mean")
Copy the code
total_bill | tip | size | |
---|---|---|---|
day | |||
Fri | 17.151579 | 2.734737 | 2.105263 |
Sat | 20.441379 | 2.993103 | 2.517241 |
Sun | 21.410000 | 3.255132 | 2.842105 |
Thur | 17.682742 | 2.771452 | 2.451613 |
df_5.groupby('day').transform(lambda x : x.mean()).total_bill.unique()
Copy the code
Array ([21.41, 20.44137931, 17.68274194, 17.15157895])Copy the code
The observation is that the two operations are the same, taking the mean of each group. The difference is that the AGG method only returns 4 rows (compressed statistics), while the Transform returns a new data box of the same length as the original data box.
Talk 6:transform VS apply
Transform and Apply differ in two main ways:
apply
For each group, the function is called on all columns at the same time; whiletransform
For each group, the function is called on each column in turn- Determined by the above working method:
apply
Can return scalar,Series
,dataframe
Depends on what you call it onapply
Methods; whiletransform
Only one array-like sequence, such as one dimension, can be returnedSeries
,array
,list
, and most importantly, have the same length as the original group, otherwise an error will be raised.
Compare the working objects of the two methods by printing the type of the object
df_6 = pd.DataFrame({'State': ['Texas'.'Texas'.'Florida'.'Florida'].'a': [4.5.1.3].'b': [6.10.3.11]})
df_6
Copy the code
State | a | b | |
---|---|---|---|
0 | Texas | 4 | 6 |
1 | Texas | 5 | 10 |
2 | Florida | 1 | 3 |
3 | Florida | 3 | 11 |
def inspect(x):
print(type(x))
print(x)
Copy the code
df_6.groupby("State").apply(inspect)
Copy the code
<class 'pandas.core.frame.DataFrame'> State a b 2 Florida 1 3 3 Florida 3 11 <class 'pandas.core.frame.DataFrame'> State a b 2 Florida 1 3 3 Florida 3 11 <class 'pandas.core.frame.DataFrame'> State a b 0 Texas 4 6 1 Texas 5 10Copy the code
Two points are clear from the print: The dataframe used by Apply is computed twice. The first group is computed twice because pandas uses this mechanism to see if there is a faster way to compute the remaining groups.
df_6.groupby("State").transform(inspect)
Copy the code
<class 'pandas.core.series.Series'>
2 1
3 3
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
2 3
3 11
Name: b, dtype: int64
<class 'pandas.core.frame.DataFrame'>
a b
2 1 3
3 3 11
<class 'pandas.core.series.Series'>
0 4
1 5
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
0 6
1 10
Name: b, dtype: int64
Copy the code
Two points are also clear from the print results: the transform evaluates only one column at a time; It’s going to be the case that you’re counting a whole group, and that’s kind of a puzzle to study.
From the above comparison, we get a useful caveat: Do not pass a function that involves multiple columns to the transform method, because doing so will only get you an error. For example, the following code shows:
def subtract(x):
return x["a"] - x["b"]
Copy the code
try:
df_6.groupby("State").transform(subtract)
except Exception:
exc_type, exc_value, exc_traceback = sys.exc_info()
formatted_lines = traceback.format_exc().splitlines()
print(formatted_lines[- 1])
Copy the code
KeyError: ('a', 'occurred at index a')
Copy the code
Another caveat: When using the Transform method, don’t try to change the length of the returned result. Not only will this cause errors, but traceback information is so cryptic that it may take you a long time to realize the error.
def return_more(x):
return np.arange(3)
Copy the code
try:
df_6.groupby("State").transform(return_more)
except Exception:
exc_type, exc_value, exc_traceback = sys.exc_info()
formatted_lines = traceback.format_exc().splitlines()
print(formatted_lines[- 1])
Copy the code
ValueError: Length mismatch: Expected axis has 6 elements, new values have 4 elements
Copy the code
This error message is a bit awkward, expecting six elements, but returning only four; In fact, it is easier to understand the error when six elements are returned instead of four expected.
Finally, let’s end the talk with a useful lesson: don’t use the transform method if you’re sure you want to operate on multiple columns at the same time, and preferably fast. Talk9 has a good example of this.
Talk 7:aggUsage summary
(1) Call multiple functions on all columns at once
df_0.groupby("A").agg([np.sum, np.mean, np.min])
Copy the code
C | D | |||||
---|---|---|---|---|---|---|
sum | mean | amin | sum | mean | amin | |
A | ||||||
bar | 0.978077 | 0.244519 | 1.343518 | 2.042817 | 0.510704 | 2.064735 |
foo | 0.184686 | 0.046172 | 1.854274 | 3.113988 | 0.778497 | 0.210586 |
(2) Call more than one function on a particular column at a time
df_0.groupby("A") ["C"].agg([np.sum, np.mean, np.min])
Copy the code
sum | mean | amin | |
---|---|---|---|
A | |||
bar | 0.978077 | 0.244519 | 1.343518 |
foo | 0.184686 | 0.046172 | 1.854274 |
(3) Call different functions for different columns
df_0.groupby("A").agg({"C": [np.sum, np.mean], "D": [np.max, np.min]})
Copy the code
C | D | |||
---|---|---|---|---|
sum | mean | amax | amin | |
A | ||||
bar | 0.978077 | 0.244519 | 0.917027 | 2.064735 |
foo | 0.184686 | 0.046172 | 1.348597 | 0.210586 |
df_0.groupby("A").agg({"C": "sum"."D": "min"})
Copy the code
C | D | |
---|---|---|
A | ||
bar | 0.978077 | 2.064735 |
foo | 0.184686 | 0.210586 |
(4) Call different functions for the same column and rename them directly
df_0.groupby("A") ["C"].agg([("Largest"."max"), ("Smallest"."min")])
Copy the code
Largest | Smallest | |
---|---|---|
A | ||
bar | 1.477379 | 1.343518 |
foo | 1.145852 | 1.854274 |
(5) Call the same function for multiple columns
agg_keys = {}.fromkeys(["C"."D"]."sum")
df_0.groupby("A").agg(agg_keys)
Copy the code
C | D | |
---|---|---|
A | ||
bar | 0.978077 | 2.042817 |
foo | 0.184686 | 3.113988 |
(6) Notice that AGG will ignore missing values, which needs to be paid attention to when counting
df_7 = pd.DataFrame({"ID": ["A"."A"."A"."B"."B"]."Num": [1,np.nan, 1.1.1]})
df_7
Copy the code
ID | Num | |
---|---|---|
0 | A | 1.0 |
1 | A | NaN |
2 | A | 1.0 |
3 | B | 1.0 |
4 | B | 1.0 |
df_7.groupby("ID").agg({"Num":"count"})
Copy the code
Num | |
---|---|
ID | |
A | 2 |
B | 2 |
Note:Pandas
In thecount
.sum
.mean
.median
.std
.var
.min
.max
And other functions are optimized in C language. So, again, if you use it on a big data setagg
, it is better to use these functions than fromnumpy
borrownp.sum
A slow process is the slow accumulation of each step.
Talk 8: ** bugs analysis
Typically, after grouping a Dataframe and performing a given operation, the results can be returned directly or filtered as required. This filter generally refers to the filter operation, but it is important to understand whether you need to filter the group or each row in the group. This Talk is about filtering.
Find out the students whose scores in each course are below the average for that course
df_8 = pd.DataFrame({"Subject": list(chain(*[["Math"] *3["Computer"] *3)),"Student": list(chain(*[["Chan"."Ida"."Ada"] *2)),"Score": [80.90.85.90.85.95]})
Copy the code
df_8
Copy the code
Subject | Student | Score | |
---|---|---|---|
0 | Math | Chan | 80 |
1 | Math | Ida | 90 |
2 | Math | Ada | 85 |
3 | Computer | Chan | 90 |
4 | Computer | Ida | 85 |
5 | Computer | Ada | 95 |
Is such a requirement appropriate to handle with a filter? Let’s try:
try:
df_8.groupby("Subject").filter(lambda x: x["Score"] < x["Score"].mean())
except Exception:
exc_type, exc_value, exc_traceback = sys.exc_info()
formatted_lines = traceback.format_exc().splitlines()
print(formatted_lines[- 1])
Copy the code
TypeError: filter function returned a Series, but expected a scalar bool
Copy the code
Obviously not, because what the filter actually does is either leave the group behind or filter it out. The confusion here is the same as the confusion between WHERE and HAVING when we first learned SQL. Just as you need to remember that HAVING is an in-group syntax, remember that filter is an in-group method.
Let’s solve this example first. The correct way to do this is:
df_8.groupby("Subject").apply(lambda g: g[g.Score < g.Score.mean()])
Copy the code
Subject | Student | Score | ||
---|---|---|---|---|
Subject | ||||
Computer | 4 | Computer | Ida | 85 |
Math | 0 | Math | Chan | 80 |
As for filter, we cite examples from official documents as a comparison
df_9 = pd.DataFrame({'A' : ['foo'.'bar'.'foo'.'bar'.'foo'.'bar'].'B' : [1.2.3.4.5.6].'C' : [2.0.5..8..1..2..9.]})
df_9
Copy the code
A | B | C | |
---|---|---|---|
0 | foo | 1 | 2.0 |
1 | bar | 2 | 5.0 |
2 | foo | 3 | 8.0 |
3 | bar | 4 | 1.0 |
4 | foo | 5 | 2.0 |
5 | bar | 6 | 9.0 |
df_9.groupby('A').filter(lambda x: x['B'].mean() > 3.)
Copy the code
A | B | C | |
---|---|---|---|
1 | bar | 2 | 5.0 |
3 | bar | 4 | 1.0 |
5 | bar | 6 | 9.0 |
Part 3:groupbyApplication, for example,
Talk 9: Missing value padding within a group
df_10 = pd.DataFrame({"ID": ["A"."A"."A"."B"."B"."B"]."Num": [100,np.nan,300,np.nan,500.600]})
df_10
Copy the code
ID | Num | |
---|---|---|
0 | A | 100.0 |
1 | A | NaN |
2 | A | 300.0 |
3 | B | NaN |
4 | B | 500.0 |
5 | B | 600.0 |
df_10.groupby("ID", as_index=False).Num.transform(lambda x: x.fillna(method="ffill")).transform(lambda x: x.fillna(method="bfill"))
Copy the code
Num | |
---|---|
0 | 100.0 |
1 | 100.0 |
2 | 300.0 |
3 | 500.0 |
4 | 500.0 |
5 | 600.0 |
If the Dataframe is large (over 1GB) and the transform + lambda method is slow, you can use the following method, which is about 100 times faster than the above combination.
df_10.groupby("ID",as_index=False).ffill().groupby("ID",as_index=False).bfill()
Copy the code
ID | Num | |
---|---|---|
0 | A | 100.0 |
1 | A | 100.0 |
2 | A | 300.0 |
3 | B | 500.0 |
4 | B | 500.0 |
5 | B | 600.0 |
References:
Stackoverflow.com/questions/2…
Stackoverflow.com/questions/4…
Stackoverflow.com/questions/2…
Pandas.pydata.org/pandas-docs…