Public account: Youerhuts author: Peter editor: Peter
Hello, I’m Peter
In our life, we often encounter various ranking problems: student performance ranking, sales performance ranking, various competition ranking and so on. In a previous ARTICLE on SQL, “Interview Essentials: SQL Rankings and Window Functions,” I mentioned how to use SQL to implement three main ranking methods: sequential ranking, jump ranking, and dense ranking.
Pandas is a powerful library that allows you to quickly implement a variety of rankings, primarily through the Rank function, which is shown in several examples.
Rank parameters
Here are the main arguments to the rank function:
DataFrame.rank(axis=0,
method='average',
numeric_only=None,
na_option='keep',
ascending=True,
pct=False)
Copy the code
The specific explanation of parameters is as follows:
- Axis: indicates the axis on which the ranking is based. Axis =0 indicates the horizontal axis and axis=1 indicates the vertical axis
- Method: The value can be ‘average’, ‘first’, ‘min’, ‘Max’, ‘dense’; The default value is Average
- Numeric_only: specifies whether to calculate only numeric columns
- Na_optiaon: Whether the NaN value participates in the ranking and how to rank it. The value can be keep, top, or bottom
- Ascending: ascending or descending; The default is ascending
- PCT: Whether to display the ranking as a percentage of the ranking; Percentage of all rankings and maximum rankings
This article will show you how to use the rank function for both Series and DataFrame data types.
The Series number
import pandas as pd
import numpy as np
Copy the code
First let’s simulate a simple piece of data:
Parameter method
Method =”average”;
2, the method = “first”
The values are ranked according to the order in which they appear in the original data, and the same values are ranked by one:
Explain the above two results:
- First: Directly rank the values in order of size
- Average: Indicates that if two values are the same, the ranking is the average of them
We see the use of first as the natural order in which the values appear; In the case of the use of Average explained as follows:
-5 is ranked 1.0, 0 is ranked 2.0, 3 is ranked 3.0, 5(index 3 position) is ranked 4.0, 5(index 6 position) is ranked 5.0, 8(index 0 position) is ranked 6.0, and 8(index 2 position) is ranked 7.0
Through the use of average, the rank average of the same value will be taken out. The rank of 5 is unified into 4.5, and the rank of 8 is unified into 6.5
3. Use of Max and min
For example: method= “Max” : If the values are the same, the highest value is used. For example, the largest ranking of 5 is 5, so the ranking of both 5’s in the original data is 5; Both 8’s are ranked 7 (the two 8’s are ranked 6 and 7, so take the highest value 7)
4, method = “dense”
The same number ranks the same, and the next number does not jump
This time when the ranking is not a jump
Parameters of the ascending
The default is ascending, but you can use descending order: The larger the value, the higher the rank:
If method= “first”, the ranking is 1 and 2. If average is used, the ranking will become 1.5. Other values rank similarly. Take a look at Max:
Parameters of PCT
Whether to display the ranking as a percentage of the ranking; Percentage of all rankings and maximum rankings
How was the above ranking calculated? Our biggest ranking is 7:
For example, the PCT parameter in dense situation is similar:
Parameter na_option
This parameter indicates whether null values participate in the ranking. The value can be keep, top, or bottom. Let’s simulate another data with null values:
Look at three different scenarios:
DataFrame ranking
Simulated data
Let’s start with a simulation:
df0 = pd.DataFrame({"Subject": ["Chinese"."Chinese"."Chinese"."Chinese"."Chinese"."Mathematics"."Mathematics"."Mathematics"."Mathematics"."Mathematics"]."Name": ["Xiao Ming"."Little su"."Chou"."Note"."Wang"."Xiao Ming"."Little su"."Chou"."Note"."Wang"]."Score": [137.125.125.115.115.80.111.130.130.140]})
df = df0.copy() # Generate a copy of df
df
Copy the code
Individual Subject Ranking
For example, we want to see the ranking of Chinese and take out students’ Chinese scores:
Use sequential ranking, jump ranking, and dense ranking to show the ranking:
# Default ranking
df1["Mean ranking _ Default"] = df1["Score"].rank(ascending=False)
df1["Jump rank _min"] = df1["Score"].rank(method="min",ascending=False)
df1["Jumping _max"] = df1["Score"].rank(method="max",ascending=False)
df1["Dense"] = df1["Score"].rank(method="dense",ascending=False)
df1
Copy the code
Total ranking of students
First, generate the total score of each student through the transform:
df["Total"] = df.groupby("Name") ["Score"].transform("sum")
df
Copy the code
We ranked the total scores using an intensive ranking method:
Group retrieves the specified ranking
We now see the second place students in each subject, if they have the same score and the same ranking (no jumps), we use dense ranking:
# Define a second function
def rank_second(x) :
return x[x["Score"].rank(method="dense",ascending=False) = =2]
Copy the code
Let’s look at the real data for the second place students in each subject:
The second function above is a two-step custom;
1. Implement dense ranking first
2. Specify a rank equal to 2
When we use this custom function, we need to first group according to the subject, and then use this custom function in each group separately, to get the second place under each subject.
conclusion
After explaining the use of rank function, can be compared to SQL window functions:
- Row_number: order ranking, method=first in the rank function
- Rank: jump rank (method=min
- Dense_rank: dense rank, method=dense
Rank function (rank function) :
Pandas.pydata.org/docs/refere…