Public account: You and the cabin by: Peter Editor: Peter
Everybody, I’m Peter
I wish you a happy National Day 🌹
In SQL or MySQL database query operations, we often use the keyword group by to indicate the group by a certain field, and then perform the subsequent aggregation statistics operation.
A similar function can be implemented in Pandas using the keyword groupby.
The aggregation function -aggregation function
In Pandas and SQL databases, the following operations are used to aggregate statistics:
- Peace: the sum
- Maximum value: Max
- Minimum value: min
- Average: avg
- Number of statistics: count
SQL multi-table query
We now have two tables: Student and Score, which are joined by s_id
SQL > select * from ‘SQL’; select * from ‘SQL’;
- The name
- gender
- Course number
- results
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id;
Copy the code
We will use the above results as temporary table T before we proceed with the subsequent query operations.
Pymysql gets the data
To use Pandas for data analysis, we connect to the Pandas database, read the data, and generate the corresponding DataFrame:
1. Connect to the database
import pandas as pd
import numpy as np
import pymysql Connect to database
con1 = pymysql.connect(
host="127.0.0.1".# database IP
port=3306.# port
user="root".# username
password="password".# your password
charset="utf8".# character encoding
db="test" # database name
)
cur1 = con1.cursor() Create a cursor
SQL statement to be executed
sql1 = Select * from Student s1 join Score s2 on s1.s_id select * from Student s1 join Score s2 on s1.s_id select * from Student s1 join Score s2 on s1.s_id = s2.s_id; "" "
Execute SQL statement in cursor
cur1.execute(sql1)
Copy the code
It happens to be 18, which is the result of our temporary table T above, which is also 18
2. Generate data
data1 = []
for i in cur1.fetchall():
data1.append(i)
df = pd.DataFrame(data1,columns=["Name"."Gender"."Course id"."Score"])
df
Copy the code
This df data is manipulated in the subsequent operations in PANDAS.
Single aggregate function
We want to know the total score of each student
SQL implementation:
select
t.s_name
,sum(t.s_score)
from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code
Pandas implementation:
There is another way to write Pandas:
2. Calculate the avG of each student
SQL implementation:
select
t.s_name
,avg(t.s_score) Let's call it a mean function
from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code
There are two different ways to write Pandas:
3, calculate each student’s score maximum (minimum) that subject
SQL implementation:
select
t.s_name
,max(t.s_score) Let's change that to the maximum function
from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code
Pandas
4, statistics: find how many subjects each student participated in the exam
SQL implementation:
select
t.s_name
,count(t.c_id) -- Change to the count function with course c_id as the field
from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code
Pandas
Multiple aggregate functions
When querying data, we can use multiple aggregation functions at the same time. For example, we want to view the total score, average score, number of students and maximum score of students of different genders
SQL implementation process:
select
t.s_sex
,sum(t.s_score) Total grade -
,avg(t.s_score) -- Average grade
,count(t.s_name) -- Statistics by gender
,max(t.s_score) -- Maximum score
from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_sex; -- By gender
Copy the code
Pandas
Changing field Aliases
In the table below, the names of the fields used in Pandas and SQL are the default. We can modify the names to be more descriptive:
Single aggregate function
Or the total score of each student as an example
SQL implementation: alias through AS, as can be omitted
select
t.s_name as"Name"-- As
,sum(t.s_score) as"Overall score"-- As
from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_name;
Copy the code
Pandas
Method 1: The rename function is used to rename an alias in pandas
Method 2: Modify the columns attribute directly
Multiple aggregate functions
SQL > select alias from as;
select
t.s_sex as"Gender",sum(t.s_score) as"Overall score"-- As
,avg(t.s_score) as"Average grade",max(t.s_score) as"Top score",count(t.s_name) as"Total head count"from(
select
s1.s_name Name -
,s1.s_sex - gender
,s2.c_id -- Course No.
,s2.s_score -- Grades
from Student s1
join Score s2 on s1.s_id = s2.s_id
)t
group by t.s_sex; -- By gender
Copy the code
2, Pandas
df2 = df.groupby("Gender").agg(total result = pd.namedagg (column="Score", aggfunc="sum"), average score = pd.namedagg (column="Score", aggfunc="mean"), the highest score = pd.namedagg (column="Score", aggfunc="max"), total number of people = pd.namedagg (column="Course id", aggfunc="count")
)
df2
Copy the code
Classical figure
Finally, a classic graphic. In this graph is recorded how we:
- Select the grouping field Month
- The same field (duration) performs multiple aggregation operations: Max, min, sum
- Alias (rename) the results: max_duration, min_duration, total_duration