This article is participating in Python Theme Month. See the link for details
Introduction to the
The DF data types in PANDAS can perform groupby operations like database tables. Generally speaking, groupby operations can be divided into three parts: split data, apply transformations, and merge data.
This article explains the groupby operation in Pandas in detail.
Segmentation data
The purpose of data segmentation is to divide DF into groups. To perform the groupby operation, you need to specify the corresponding label when creating DF:
df = pd.DataFrame( ... : {... : "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ... : "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ... : "C": np.random.randn(8), ... : "D": np.random.randn(8), ... :}... :)... : df Out[61]: A B C D 0 foo one-0.490565-0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449-1.155530 3 bar three-0.610380 0.447735 4 Foo two -0.934961 0.256358 5 bar two - 0.256263-0.661954 6 foo one - 1.132186-0.304330 7 foo three 2.129757 0.445744Copy the code
By default, the axis of groupby is the X-axis. A group can be a single column or a multi-column group:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
Copy the code
Multiple index
In version 0.24, if we had multiple indexes, we could select a specific index from them to group:
In [10]: df2 = df.set_index(["A", "B"])
In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
Copy the code
get_group
Get_group retrieves grouped data:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})
In [25]: df3.groupby(["X"]).get_group("A")
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(["X"]).get_group("B")
Out[26]:
X Y
1 B 4
3 B 2
Copy the code
dropna
NaN data is excluded from groupby by default. By setting Dropna =False NaN data is allowed:
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"]) In [29]: df_dropna Out[29]: A b c 0 1 2 2 3 3 1 2 2Copy the code
# Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: A c b 1.0 2 3 2.0 2 5 # In order to allow NaN In keys, set ' 'dropna' 'to False In [31]: Df_dropna. Groupby (by=["b"], dropna=False). Sum () Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4Copy the code
Groups attribute
The groupby object has a Groups attribute, which is a key-value dictionary. The key is the data to classify and the value is the corresponding value of the classification.
In [34]: grouped = df.groupby(["A", "B"])
In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
In [36]: len(grouped)
Out[36]: 6
Copy the code
The index of the hierarchy
For multilevel index objects, groupby can specify the index level of the group:
In [40]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] .... : In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [42]: s = pd.Series(np.random.randn(8), index=index) In [43]: s Out[43]: First second bar one -0.919854 two-0.042379 Baz one 1.247642 two-0.009920 Foo One 0.290213 two 0.495767 qux One 0.362949 two 1.548106 DTYPE: float64Copy the code
Group Level 1:
In [44]: grouped = s.groupby(level=0)
In [45]: grouped.sum()
Out[45]:
first
bar -0.962232
baz 1.237723
foo 0.785980
qux 1.911055
dtype: float64
Copy the code
Group level 2:
In [46]: s.groupby(level="second").Sum () Out[46]: second one 0.980950 two 1.991575 dType: Float64Copy the code
Group of traversal
Once we have the group object, we can iterate through the group with the for statement:
In [62]: grouped = df.groupby('A') In [63]: for name, group in grouped: .... : print(name) .... : print(group) .... : Bar A B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897-0.990582 5 bar two-0.077118 1.211526 foo A B C D 0 One-0.575247 1.346061 2 Foo two -1.143704 1.627081 4 Foo two 1.193555 -0.441652 6 foo one -0.408530 0.268520 7 foo Three - 0.862495-0.024580Copy the code
If it is a multi-field group, the group name is a tuple:
In [64]: for name, group in df.groupby(['A', 'B']): .... : print(name) .... : print(group) .... : ('bar', 'one') A B C D 3 bar one 0.254161 1.511763 ('bar', 'three') A B C D 3 bar one 0.254161 1.511763 ('bar', 'three') 'two') A B C D 5 bar two -0.077118 1.211526 ('foo', 'one') A B C D 0 foo one -0.575247 1.346061 6 foo one -0.408530 0.268520 ('foo', 'three') A B C D 7 foo three -0.862495 0.02458 ('foo', 'two') A B C D 2 Foo two -1.143704 1.627081 4 Foo two 1.193555 -0.441652Copy the code
Aggregation operations
After grouping, you can aggregate:
In [67]: grouped = df.groupby("A") In [68]: grouped.aggregate(np.sum) Out[68]: Grouped In [69]: grouped = df. Groupby (["A", "B"]) In [70]: Grouped In [70] grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763 three 0.215897-0.990582 two-0.077118 1.211526 foo one -0.983776 1.614581 three -0.862495 0.024580 two 0.049851 1.185429Copy the code
For multi-index data, the default return value is also multi-index. If you want to use a new index, add as_index = False:
In [71]: grouped = df.groupby(["A", "B"], as_index=False) In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897-0.990582 2 bar two-0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429 In [73]: Df.groupby ("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590Copy the code
The above effect is equivalent to reset_index
In [74]: df.groupby(["A", "B"]).sum().reset_index()
Copy the code
Grouped. Size () Counted the size of the group:
In [75]: grouped.size()
Out[75]:
A B size
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
Copy the code
From the head office. ()
In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... STD min 25% 50% 75% Max 0 1.0 0.254161 NaN 0.254161 0.254161... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897... NaN - 0.990582-0.990582-0.990582-0.990582-0.990582-0.990582 2 1.0-0.077118 NaN - 0.077118-0.077118-0.077118... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0-0.491888 0.117887-0.575247-0.533567-0.491888... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0-0.862495 NaN 0.862495-0.862495-0.862495... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692-1.143704-0.559389 0.024925... [6 rows x 16 columns]Copy the code
General polymerization method
The following is a common aggregation method:
function | describe |
---|---|
mean() |
The average |
sum() |
sum |
size() |
Calculate the size |
count() |
Group of statistical |
std() |
The standard deviation |
var() |
The variance |
sem() |
Standard error of the mean |
describe() |
Description of Statistics |
first() |
The first group value |
last() |
The last group value |
nth() |
The NTH group value |
min() |
The minimum value |
max() |
The maximum |
Use multiple aggregation methods simultaneously
Multiple aggregation methods can be specified simultaneously:
In [81]: grouped = df.groupby("A") In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: Sum mean STD A bar 0.392940 0.130980 0.181231 foo-1.796421-0.359284 0.912265Copy the code
Can be renamed:
In [84]: (
....: grouped["C"]
....: .agg([np.sum, np.mean, np.std])
....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
....: )
....:
Out[84]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
Copy the code
NamedAgg
NamedAgg provides a more precise definition of aggregation, with two custom columns, Column and AGgFunc.
In [88]: animals = pd.DataFrame( .... : {... : "kind": ["cat", "dog", "cat", "dog"], .... : "height": [9.1, 6.0, 9.5, 34.0],.... : "weight": [7.9, 7.5, 9.9, 198.0],.... :}... :)... : In [89]: animals Out[89]: kind height weight 0 CAT 9.1 7.9 1 dog 6.0 7.5 2 CAT 9.5 9.9 3 dog 34.0 198.0 In [90]: animals.groupby("kind").agg( .... : min_height=pd.NamedAgg(column="height", aggfunc="min"), .... : max_height=pd.NamedAgg(column="height", aggfunc="max"), .... : average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), .... :)... : Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75Copy the code
Or just use a tuple:
In [91]: animals.groupby("kind").agg( .... : min_height=("height", "min"), .... : max_height=("height", "max"), .... : average_weight=("weight", np.mean), .... :)... : Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75Copy the code
Different columns specify different aggregation methods
By passing a dictionary to the AGG method, you can specify different columns to use different aggregations:
In [95]: Grouped. Agg ({"C": "sum", "D": "STD "}) Out[95]: C D A bar 0.392940 1.366330 foo-1.796421 0.884785Copy the code
Conversion operations
A conversion is the operation of converting an object to an object of the same size. In the process of data analysis, data conversion is often needed.
Lambda operations can be followed:
In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Copy the code
Filling NA value:
In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
Copy the code
Filtering operation
The filter method can use lambda expressions to filter unwanted data:
In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]:
3 3
4 3
5 3
dtype: int64
Copy the code
The Apply operation
Some data may not be suitable for aggregation or transformation. Pandas provides a apply method for flexible transformation.
In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 Foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 Foo one -0.408530 0.268520 7 Foo three -0.862495 From the [157]: grouped = df. Groupby ("A") # could also just call. Describe () In [158]: grouped["C"]. X.describe () Out[158]: A bar count 3.000000 mean 0.130980 STD 0.181231 min-0.077118 25% 0.069390... Foo min-1.143704 25%-0.862495 50%-0.575247 75%-0.408530 Max 1.193555 Name: C, Length: 16, DType: float64Copy the code
External functions can be added:
In [159]: grouped = df.groupby('A')['C'] In [160]: def f(group): ..... : return pd.DataFrame({'original': group, ..... : 'demeaned': group - group.mean()}) ..... : In [161]: grouped.apply(f) Out[161]: Original demeaned 0-0.575247-0.215962 1 0.254161 0.123181 2-1.143704-0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5-0.077118-0.208098 6-0.408530-0.049245 7-0.862495-0.503211Copy the code
This article is available at www.flydean.com/11-python-p…
The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!