Learn these handy pandas functions to make your data processing a step faster

If you want to get the information of the top 10 students in total, you might want to sort by the total score and then head(10), but what if the number of students in the top 10 is more than 10?

Today, we’ll take a look at some of the convenience functions pandas provides to help you get ahead of the game

Directory:

1. Find the largest or smallest first N groups of data

In the process of data processing, we often encounter a scenario that is to find the largest or smallest first N groups of data in this set of data. Normally, we might use df.sort_values(columns, ascending=False).head(n), but often some of the data in the ranking will be truncated. So, today we can try the following method, also will be good.

In the following, we use the largest first N groups of data as an example to introduce:

DataFrame.nlargest( n,columns,keep=’first’)

Series.nlargest( n=5,keep=’first’)

Optional values for the keep argument: first by default, last and all (literal) optional

Let’s construct a case data first

>>> import pandas as pd
>>> df = pd.DataFrame({'population': [59000000.65000000.434000..                                  434000.434000.337000.11300..                                  11300.11300]..                   'GDP': [1937894.2583560 , 12011.4520.12128..                           17036.182.38.311]..                   'alpha-2': ["IT"."FR"."MT"."MV"."BN"..                               "IS"."NR"."TV"."AI"]},
.                  index=["Italy"."France"."Malta"..                         "Maldives"."Brunei"."Iceland"..                         "Nauru"."Tuvalu"."Anguilla"])

>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI
Copy the code

For the above case data, if we want to obtain the first three groups of data with the largest population field, we find that the third group is 434000. If we use head(3), we actually miss two rows that meet the requirements; Df.nlargest (3, ‘population’,keep=’all’) can be used to obtain the results we need.

>>> df.head(3)
        population      GDP alpha-2
Italy     59000000  1937894      IT
France    65000000  2583560      FR
Malta       434000    12011      MT

>>> df.nlargest(3.'population')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT
# keep = 'all' indicates all returns that satisfy the ranking
>>> df.nlargest(3.'population',keep='all')
          population      GDP alpha-2
France      65000000  2583560      FR
Italy       59000000  1937894      IT
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Copy the code

Of course, we may face more complex requirements, such as taking the largest first N sets of data according to multiple fields. In this case, we want to take the top three DATA of GDP in China with the largest population.

>>> df.nlargest(3['population'.'GDP'])
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN
Copy the code

For the smallest first N groups of data, the function is as follows (with the same meanings) :

DataFrame.nsmallest( n, columns,keep=’first’)

Series.nsmallest( n=5, keep=’first’)

2. Find the percentage change between the current element and the previous element

Sometimes, our data may be in a time series. In order to see the rate of change of the data of a row or column over time, we can use the pcT_change method to obtain the data directly.

pct_change(periods=1,fill_method=’pad’, limit=None, freq=None, kwargs)

Let’s start with a Series:

>>> s = pd.Series([90.91.85])
>>> s.pct_change()
0         NaN
1    0.011111
2   -0.065934
dtype: float64
Periods =2, in this case the rate of change between 85 and 90
>>> s.pct_change(periods=2)
0         NaN
1         NaN
2   -0.055556
dtype: float64
Copy the code

In the case of missing values, we can either fill in the missing values and participate in the calculation or set the fill_method parameter when calculating the percentage:

>>> s = pd.Series([90.91.None.85])
>>> s
0    90.0
1    91.0
2     NaN
3    85.0
dtype: float64
    
>>> s.pct_change(fill_method='bfill')
0         NaN
1    0.011111
2   -0.065934
3    0.000000
dtype: float64
    
>>> s.pct_change(fill_method='ffill')
0         NaN
1    0.011111
2    0.000000
3   -0.065934
dtype: float64
    
>>> s.pct_change()
0         NaN
1    0.011111
2    0.000000
3   -0.065934
dtype: float64
Copy the code

It is also possible to process Dataframe data directly:

>>> df = pd.DataFrame({
.    'FR': [4.0405.4.0963.4.3149]..    'GR': [1.7246.1.7482.1.8519]..    'IT': [804.74.810.01.860.13]},
.    index=['1980-01-01'.'1980-02-01'.'1980-03-01'])
>>> df
                FR      GR      IT
1980- 01-014.0405  1.7246  804.74
1980- 02-014.0963  1.7482  810.01
1980- 03-014.3149  1.8519  860.13

>>> df.pct_change()
                  FR        GR        IT
1980-01-01       NaN       NaN       NaN
1980- 02-010.013810  0.013684  0.006549
1980- 03-010.053365  0.059318  0.061876

>>> df = pd.DataFrame({
.    '2016': [1769950.30586265]..    '2015': [1500923.40912316]..    '2014': [1371819.41403351]},
.    index=['GOOG'.'APPL'])
>>> df
          2016      2015      2014
GOOG   1769950   1500923   1371819
APPL  30586265  40912316  41403351
Axis =1 or 'columns'
>>> df.pct_change(axis = 1)
      2016      2015      2014
GOOG   NaN -0.151997 -0.086016
APPL   NaN  0.337604  0.012002

>>> df.pct_change(axis = 'columns')
      2016      2015      2014
GOOG   NaN -0.151997 -0.086016
APPL   NaN  0.337604  0.012002
Copy the code

3. Convert each element in the list to one row

Sometimes, some element in our raw data may be in the form of a list, and we need to expand it, so the explode method comes in.

Series.explode( ignore_index=False)

DataFrame.explode( column, ignore_index=False)

Let’s start with a Series:

>>> s = pd.Series([[1.2.3].'foo', [], [3.4]])
>>> s
0    [1.2.3]
1          foo
2           []
3       [3.4]
dtype: object
By default, the index is copied
>>> s.explode()
0      1
0      2
0      3
1    foo
2    NaN
3      3
3      4
dtype: object
# Set parameter ignore_index=True to reset the index
>>> s.explode(ignore_index=True)
0      1
1      2
2      3
3    foo
4    NaN
5      3
6      4
dtype: object
Copy the code

Take a look at what happens to Dataframe data:

>>> df = pd.DataFrame({'A': [[1.2.3].'foo', [], [3.4]], 'B': 1})
>>> df
           A  B
0  [1.2.3]  1
1        foo  1
2         []  1
3     [3.4]  1
By default, the index is copied
>>> df.explode('A')
     A  B
0    1  1
0    2  1
0    3  1
1  foo  1
2  NaN  1
3    3  1
3    4  1
# Set parameter ignore_index=True to reset the index
>>> df.explode('A',ignore_index=True)
     A  B
0    1  1
1    2  1
2    3  1
3  foo  1
4  NaN  1
5    3  1
6    4  1
Copy the code

Learn these handy pandas functions to make your data processing a step faster

1. Find the largest or smallest first N groups of data

2. Find the percentage change between the current element and the previous element

3. Convert each element in the list to one row

Related Posts

K- Adjacency Algorithm for Data Mining

Amway 5 amazing Matplotlib operations!

PyTorch Distributed optimizer (2)—- Data parallel optimizer