Wechat official account: “Python reading money” if you have any questions or suggestions, please leave a message

In daily data analysis, data often need to be based on certain (multiple) field is divided into different groups (group) were analyzed, such as electricity field will be the country’s total sales according to the province, analyzing the change of the provincial sales situation, social domain user segment according to the picture (gender, age), the usage and user preferences, etc. In Pandas, groupby is used to process data. In this article, we introduce the basic principles of groupby and the corresponding AGG, Transform, and apply operations.

For the convenience of subsequent diagrams, 10 sample data generated by simulation are adopted, with codes and data as follows:

company=["A"."B"."C"]

data=pd.DataFrame({
    "company":[company[x] for x in np.random.randint(0,len(company),10)]."salary":np.random.randint(5.50.10),
    "age":np.random.randint(15.50.10)})Copy the code
company salary age
0 C 43 35
1 C 17 25
2 C 8 30
3 A 20 22
4 B 10 17
5 B 21 40
6 A 23 33
7 C 49 19
8 B 8 30

I. Basic principles of Groupby

In PANDAS, the code for grouping is as simple as a single line of code, in which the dataset above is divided by the company field:

In [5]: group = data.groupby("company")
Copy the code

When you type the above code into ipython, you get a DataFrameGroupBy object

In [6]: group
Out[6]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002B7E2650240>
Copy the code

So what is this generated DataFrameGroupBy? What happens after groupby on data? Ipython returns the memory address, which is not intuitive. To see what’s inside a group, let’s convert the group to a list:

In [8]: list(group)
Out[8] : [('A',   company  salary  age
  3       A      20   22
  6       A      23   33), 
 ('B',   company  salary  age
  4       B      10   17
  5       B      21   40
  8       B       8   30), 
 ('C',   company  salary  age
  0       C      43   35
  1       C      17   25
  2       C       8   30
  7       C      49   19)]
Copy the code

It can be seen that the list is composed of three tuples. In each tuple, the first element is the group (the group is based on company, so it is finally divided into A,B and C), and the second element is the DataFrame of the corresponding group. The whole process can be illustrated as follows:

To sum up, the groupby process is to divide the original dataframes into several group dataframes according to the groupby field (in this case company). There are as many group Dataframes as there are groups. ** Therefore, all operations after groupby (agG, apply, etc.) are based on sub-dataframe operations. With this understanding, the groupBY operation in Pandas is basically understood. Here are some common operations after groupby.

2. Agg aggregation operation

The aggregate operation is a very common operation after groupby, and anyone who can write SQL should be familiar with it. The aggregate operations can be used for summing, averaging, Max, min, etc. The table below lists the common aggregate operations in Pandas.

function use
min The minimum value
max The maximum
sum sum
mean The mean
median The median
std The standard deviation
var The variance
count count

For the sample data set, if I wanted to find the average age and salary of employees at different companies, I could use the following code:

In [12]: data.groupby("company").agg('mean')
Out[12]:
         salary    age
company
A         21.50  27.50
B         13.00  29.00
C         29.25  27.25
Copy the code

If you want to evaluate different values for different columns, such as the average age and median salary of employees at different companies, you can use dictionaries to specify aggregate operations:

In [17]: data.groupby('company').agg({'salary':'median'.'age':'mean'})
Out[17]:
         salary    age
company
A          21.5  27.50
B          10.0  29.00
C          30.0  27.25
Copy the code

The agG polymerization process can be illustrated as follows (for example, in the second example) :

Third, the transform

What kind of data operation is a transform? What’s the difference with AGG? To better understand the difference between Transform and AGG, the following is a comparison based on actual application scenarios.

In the agG above, we learned how to find the average salary of employees at different companies. What if we needed to add a new column avG_salary to the original data set to represent the average salary of employees in the same company (employees in the same company have the same average salary)? If we calculate according to the normal steps, we need to obtain the average salary of different companies first, and then fill the corresponding position according to the corresponding relationship between employees and companies. Without the transform, the implementation code is as follows:

In [21]: avg_salary_dict = data.groupby('company') ['salary'].mean().to_dict()

In [22]: data['avg_salary'] = data['company'].map(avg_salary_dict)

In [23]: data
Out[23]:
  company  salary  age  avg_salary
0       C      43   35       29.25
1       C      17   25       29.25
2       C       8   30       29.25
3       A      20   22       21.50
4       B      10   17       13.00
5       B      21   40       13.00
6       A      23   33       21.50
7       C      49   19       29.25
8       B       8   30       13.00
Copy the code

If you use transform, all you need is one line of code:

In [24]: data['avg_salary'] = data.groupby('company') ['salary'].transform('mean')

In [25]: data
Out[25]:
  company  salary  age  avg_salary
0       C      43   35       29.25
1       C      17   25       29.25
2       C       8   30       29.25
3       A      20   22       21.50
4       B      10   17       13.00
5       B      21   40       13.00
6       A      23   33       21.50
7       C      49   19       29.25
8       B       8   30       13.00
Copy the code

To see the implementation process of post-groupby transform in a graphic way (to make it more intuitive, company column is added in the figure, and actually there is only salary column according to the above code) :

The big box in the figure is the difference between Transform and AGG. For AGG, the mean values corresponding to company A, B and C will be calculated and directly returned, but for Transform, the corresponding results will be obtained for each piece of data, and the samples in the same group will have the same value. After calculating the mean value within the group, the results will be returned according to the order of the original index. If you don’t understand, you can compare this map with that of AGG.

Fourth, the apply

Apply is an old friend. It’s much more flexible than AGG and Transform and can pass in arbitrary custom functions for complex data manipulation. For Pandas, groupby, groupby, groupby, groupby, groupby, groupby, groupby, groupby, groupby

There are differences, but the overall implementation principle is basically the same. The difference between the two is that, for apply after groupby, the subdataframe after grouping is passed into the specified function as a parameter, and the basic operation unit is DataFrame, while the basic operation unit of apply introduced before is Series. Here is an example to illustrate the use of apply after groupby.

Suppose I now needed data on the oldest employees at various companies. How would I do that? This can be done with the following code:

In [38] :def get_oldest_staff(x):. : df = x.sort_values(by ='age',ascending=True)
    ...:     return df.iloc[- 1, :]... : In [39]: oldest_staff = data.groupby('company',as_index=False).apply(get_oldest_staff)

In [40]: oldest_staff
Out[40]:
  company  salary  age  
0       A      23   33       
1       B      21   40       
2       C      43   35      
Copy the code

The result is data on the oldest employees in each company, and the process is illustrated as follows:

As you can see, the principle of apply is basically the same as described in the previous article, except that the parameters passed to the function are changed from Series to grouped DataFrame.

Finally, here’s a tip on how to use Apply. Although Apply has more flexibility, Apply is slower to run than AGG and Transform. Therefore, if groupby can use AGG and Transform to solve the problem, it is preferred to use these two methods. If the problem cannot be solved, apply will be considered for operation.

Scan code to pay attention to the public number “Python read money”, the first time to get dry goods, but also can add Python learning exchange group!!