Compared with SQL, learn the groupby mechanism in Pandas

Public account: You and the cabin by: Peter Editor: Peter

Everybody, I’m Peter

I wish you a happy National Day 🌹

In SQL or MySQL database query operations, we often use the keyword group by to indicate the group by a certain field, and then perform the subsequent aggregation statistics operation.

A similar function can be implemented in Pandas using the keyword groupby.

The aggregation function -aggregation function

In Pandas and SQL databases, the following operations are used to aggregate statistics:

Peace: the sum
Maximum value: Max
Minimum value: min
Average: avg
Number of statistics: count

SQL multi-table query

We now have two tables: Student and Score, which are joined by s_id

SQL > select * from ‘SQL’; select * from ‘SQL’;

The name
gender
Course number
results

select 
	s1.s_name  Name -
	,s1.s_sex  - gender
	,s2.c_id   -- Course No.
	,s2.s_score  -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id;
Copy the code

We will use the above results as temporary table T before we proceed with the subsequent query operations.

Pymysql gets the data

To use Pandas for data analysis, we connect to the Pandas database, read the data, and generate the corresponding DataFrame:

1. Connect to the database

import pandas as pd
import numpy as np
import pymysql  Connect to database

con1 = pymysql.connect(
    host="127.0.0.1".# database IP
    port=3306.# port
    user="root".# username
    password="password".# your password
    charset="utf8".# character encoding
    db="test"  # database name
)

cur1 = con1.cursor()  Create a cursor

SQL statement to be executed
sql1 = Select * from Student s1 join Score s2 on s1.s_id select * from Student s1 join Score s2 on s1.s_id select * from Student s1 join Score s2 on s1.s_id  = s2.s_id; "" "

Execute SQL statement in cursor
cur1.execute(sql1)
Copy the code

It happens to be 18, which is the result of our temporary table T above, which is also 18

2. Generate data

data1 = []
for i in cur1.fetchall():
    data1.append(i)
    
df = pd.DataFrame(data1,columns=["Name"."Gender"."Course id"."Score"])
df
Copy the code

This df data is manipulated in the subsequent operations in PANDAS.

Single aggregate function

We want to know the total score of each student

SQL implementation:

select
	t.s_name
	,sum(t.s_score)
from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code

Pandas implementation:

There is another way to write Pandas:

2. Calculate the avG of each student

SQL implementation:

select
	t.s_name
	,avg(t.s_score)   Let's call it a mean function
from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code

There are two different ways to write Pandas:

3, calculate each student’s score maximum (minimum) that subject

SQL implementation:

select
	t.s_name
	,max(t.s_score)   Let's change that to the maximum function
from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code

Pandas

4, statistics: find how many subjects each student participated in the exam

SQL implementation:

select
	t.s_name
	,count(t.c_id)   -- Change to the count function with course c_id as the field
from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code

Pandas

Multiple aggregate functions

When querying data, we can use multiple aggregation functions at the same time. For example, we want to view the total score, average score, number of students and maximum score of students of different genders

SQL implementation process:

select
	t.s_sex
	,sum(t.s_score)  Total grade -
	,avg(t.s_score)  -- Average grade
	,count(t.s_name)  -- Statistics by gender
	,max(t.s_score)  -- Maximum score
from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_sex;   -- By gender
Copy the code

Pandas

Changing field Aliases

In the table below, the names of the fields used in Pandas and SQL are the default. We can modify the names to be more descriptive:

Single aggregate function

Or the total score of each student as an example

SQL implementation: alias through AS, as can be omitted

select
	t.s_name  as"Name"-- As
	,sum(t.s_score) as"Overall score"-- As
from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code

Pandas

Method 1: The rename function is used to rename an alias in pandas

Method 2: Modify the columns attribute directly

Multiple aggregate functions

SQL > select alias from as;

select
	t.s_sex  as"Gender",sum(t.s_score)  as"Overall score"-- As
	,avg(t.s_score)  as"Average grade",max(t.s_score)  as"Top score",count(t.s_name)  as"Total head count"from(
  select 
    s1.s_name  Name -
    ,s1.s_sex  - gender
    ,s2.c_id   -- Course No.
    ,s2.s_score  -- Grades
	from Student s1
	join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_sex;   -- By gender
Copy the code

2, Pandas

df2 = df.groupby("Gender").agg(total result = pd.namedagg (column="Score", aggfunc="sum"), average score = pd.namedagg (column="Score", aggfunc="mean"), the highest score = pd.namedagg (column="Score", aggfunc="max"), total number of people = pd.namedagg (column="Course id", aggfunc="count")
)
df2
Copy the code

Classical figure

Finally, a classic graphic. In this graph is recorded how we:

Select the grouping field Month
The same field (duration) performs multiple aggregation operations: Max, min, sum
Alias (rename) the results: max_duration, min_duration, total_duration

Compared with SQL, learn the groupby mechanism in Pandas

The aggregation function -aggregation function

SQL multi-table query

Pymysql gets the data

Single aggregate function

Multiple aggregate functions

Changing field Aliases

Single aggregate function

Multiple aggregate functions

Classical figure

Related Posts

Fundamentals of machine learning – Univariate linear regression

SQL advanced – sort and window functions

MAC Python PIP download speed is slow