If you want to get the information of the top 10 students in total, you might want to sort by the total score and then head(10), but what if the number of students in the top 10 is more than 10?
Today, we’ll take a look at some of the convenience functions pandas provides to help you get ahead of the game
Directory:
1. Find the largest or smallest first N groups of data
In the process of data processing, we often encounter a scenario that is to find the largest or smallest first N groups of data in this set of data. Normally, we might use df.sort_values(columns, ascending=False).head(n), but often some of the data in the ranking will be truncated. So, today we can try the following method, also will be good.
In the following, we use the largest first N groups of data as an example to introduce:
DataFrame.nlargest
( n,columns,keep=’first’)
Series.nlargest
( n=5,keep=’first’)
Optional values for the keep argument: first by default, last and all (literal) optional
Let’s construct a case data first
>>> import pandas as pd
>>> df = pd.DataFrame({'population': [59000000.65000000.434000.. 434000.434000.337000.11300.. 11300.11300].. 'GDP': [1937894.2583560 , 12011.4520.12128.. 17036.182.38.311].. 'alpha-2': ["IT"."FR"."MT"."MV"."BN".. "IS"."NR"."TV"."AI"]},
. index=["Italy"."France"."Malta".. "Maldives"."Brunei"."Iceland".. "Nauru"."Tuvalu"."Anguilla"])
>>> df
population GDP alpha-2
Italy 59000000 1937894 IT
France 65000000 2583560 FR
Malta 434000 12011 MT
Maldives 434000 4520 MV
Brunei 434000 12128 BN
Iceland 337000 17036 IS
Nauru 11300 182 NR
Tuvalu 11300 38 TV
Anguilla 11300 311 AI
Copy the code
For the above case data, if we want to obtain the first three groups of data with the largest population field, we find that the third group is 434000. If we use head(3), we actually miss two rows that meet the requirements; Df.nlargest (3, ‘population’,keep=’all’) can be used to obtain the results we need.
>>> df.head(3)
population GDP alpha-2
Italy 59000000 1937894 IT
France 65000000 2583560 FR
Malta 434000 12011 MT
>>> df.nlargest(3.'population')
population GDP alpha-2
France 65000000 2583560 FR
Italy 59000000 1937894 IT
Malta 434000 12011 MT
# keep = 'all' indicates all returns that satisfy the ranking
>>> df.nlargest(3.'population',keep='all')
population GDP alpha-2
France 65000000 2583560 FR
Italy 59000000 1937894 IT
Malta 434000 12011 MT
Maldives 434000 4520 MV
Brunei 434000 12128 BN
Copy the code
Of course, we may face more complex requirements, such as taking the largest first N sets of data according to multiple fields. In this case, we want to take the top three DATA of GDP in China with the largest population.
>>> df.nlargest(3['population'.'GDP'])
population GDP alpha-2
France 65000000 2583560 FR
Italy 59000000 1937894 IT
Brunei 434000 12128 BN
Copy the code
For the smallest first N groups of data, the function is as follows (with the same meanings) :
DataFrame.nsmallest
( n, columns,keep=’first’)
Series.nsmallest
( n=5, keep=’first’)
2. Find the percentage change between the current element and the previous element
Sometimes, our data may be in a time series. In order to see the rate of change of the data of a row or column over time, we can use the pcT_change method to obtain the data directly.
pct_change
(periods=1,fill_method=’pad’, limit=None, freq=None, kwargs)
Let’s start with a Series:
>>> s = pd.Series([90.91.85])
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
Periods =2, in this case the rate of change between 85 and 90
>>> s.pct_change(periods=2)
0 NaN
1 NaN
2 -0.055556
dtype: float64
Copy the code
In the case of missing values, we can either fill in the missing values and participate in the calculation or set the fill_method parameter when calculating the percentage:
>>> s = pd.Series([90.91.None.85])
>>> s
0 90.0
1 91.0
2 NaN
3 85.0
dtype: float64
>>> s.pct_change(fill_method='bfill')
0 NaN
1 0.011111
2 -0.065934
3 0.000000
dtype: float64
>>> s.pct_change(fill_method='ffill')
0 NaN
1 0.011111
2 0.000000
3 -0.065934
dtype: float64
>>> s.pct_change()
0 NaN
1 0.011111
2 0.000000
3 -0.065934
dtype: float64
Copy the code
It is also possible to process Dataframe data directly:
>>> df = pd.DataFrame({
. 'FR': [4.0405.4.0963.4.3149].. 'GR': [1.7246.1.7482.1.8519].. 'IT': [804.74.810.01.860.13]},
. index=['1980-01-01'.'1980-02-01'.'1980-03-01'])
>>> df
FR GR IT
1980- 01-014.0405 1.7246 804.74
1980- 02-014.0963 1.7482 810.01
1980- 03-014.3149 1.8519 860.13
>>> df.pct_change()
FR GR IT
1980-01-01 NaN NaN NaN
1980- 02-010.013810 0.013684 0.006549
1980- 03-010.053365 0.059318 0.061876
>>> df = pd.DataFrame({
. '2016': [1769950.30586265].. '2015': [1500923.40912316].. '2014': [1371819.41403351]},
. index=['GOOG'.'APPL'])
>>> df
2016 2015 2014
GOOG 1769950 1500923 1371819
APPL 30586265 40912316 41403351
Axis =1 or 'columns'
>>> df.pct_change(axis = 1)
2016 2015 2014
GOOG NaN -0.151997 -0.086016
APPL NaN 0.337604 0.012002
>>> df.pct_change(axis = 'columns')
2016 2015 2014
GOOG NaN -0.151997 -0.086016
APPL NaN 0.337604 0.012002
Copy the code
3. Convert each element in the list to one row
Sometimes, some element in our raw data may be in the form of a list, and we need to expand it, so the explode method comes in.
Series.explode
( ignore_index=False)
DataFrame.explode
( column, ignore_index=False)
Let’s start with a Series:
>>> s = pd.Series([[1.2.3].'foo', [], [3.4]])
>>> s
0 [1.2.3]
1 foo
2 []
3 [3.4]
dtype: object
By default, the index is copied
>>> s.explode()
0 1
0 2
0 3
1 foo
2 NaN
3 3
3 4
dtype: object
# Set parameter ignore_index=True to reset the index
>>> s.explode(ignore_index=True)
0 1
1 2
2 3
3 foo
4 NaN
5 3
6 4
dtype: object
Copy the code
Take a look at what happens to Dataframe data:
>>> df = pd.DataFrame({'A': [[1.2.3].'foo', [], [3.4]], 'B': 1})
>>> df
A B
0 [1.2.3] 1
1 foo 1
2 [] 1
3 [3.4] 1
By default, the index is copied
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1
# Set parameter ignore_index=True to reset the index
>>> df.explode('A',ignore_index=True)
A B
0 1 1
1 2 1
2 3 1
3 foo 1
4 NaN 1
5 3 1
6 4 1
Copy the code