This is the fourth day of my participation in Gwen Challenge
Antecedents to review
- Find xiaoming and Li Xiaohua’s student numbers with different scores (half completed)
- Find columns with different scores (done)
- Find the student number for that column
- Group and sum the scores of students who scored incorrectly (incomplete).
Find the student number for that column
Let’s recall the data structure briefly introduced below. Data of a data structure is a one-dimensional array data for a series and a two-dimensional array data for a Dataframe. So we can use the following operations directly
data_clean.values[:,0]
//data_clean.values[:3.0] // Here I only take the first three lines, because I also made up the single subject data.Copy the code
Group and sum the scores of the students who scored wrong
The data for individual subjects are as follows
filepath_danke = "Single subject data.csv"
data_table_danke = pd.read_csv(filepath_danke, encoding='utf-8')
data_table_danke
Copy the code
The output is as follows:
So how do you do the grouping sum?
First consider the group summation scenario:
According to [gender grouping], the [average] of the national population [life span] was counted. According to [class grouping], the [average] of the [math score] in the group was more than 80 points
It is clear that grouping requires three concepts:
- By what grouping
- What data to get
- What do you do with the data
Here we need to calculate the [total score] of the students according to the [student number group].
So pd gives us a group by function.
Df.groupby (by what group)[what data to get]. What to do with the dataCopy the code
For example
student_sum = data_table_danke.groupby('student id') ['results'].sum(a)Copy the code
Student ID 20160003 548 20160006 489 20160010 221 Name: result, DType: INT64Copy the code
So far we have learned filtering and grouping summation through a simple case study, but for some common data analysis operations, there is still a sorting. So let’s add a requirement.
Additional requirements: the highest and second highest marks for each subject
Logically, you should group, sort, and sum.
But it’s hard to sort the grouped data and pick the first two (at least I don’t yet)
So we can use a property of grouping
Grouping does not affect sorting within a group
So you can sort and group and sum
namely
filepath_danke = "Single subject data.csv"
data_table_danke = pd.read_csv(filepath_danke, encoding='utf-8')
data_table_danke.sort_values('results', ascending = False, inplace=True)
res = data_table_danke.groupby('student id').head(2).groupby('student id') ['results'].sum()
res
Copy the code