Wechat official account: “Python reading money” if you have any questions or suggestions, please leave a message

Pandas is a Python library used for data processing and analysis. There are some tricks and cautions for using Pandas, especially for large data sets. If you do not use it properly, it may be very slow for Pandas.

For pandas, time is of the essence. This article summarizes some of the best performance tuning methods for pandas.

1. Optimization of data reading

The pd.read_csv() function is used to read data from a CSV file. It is used to read data from a CSV file. Which is faster? Let’s do an experiment to compare.

A total of 590,000 lines of data were used here, which were respectively saved in XLSX, CSV, HDF and PKL formats. Each format was read for 10 times, and the following results were obtained.

It can be seen that for the same data, PKL format data is the fastest to read, which is nearly 6 times that of CSV format data, followed by HDF format data, and the worst is to read XLSX format data (this is only a data set of about 15M in size).

To save time each time you read a data set (most of which is in CSV format), you can use pandas to read the data in PKL or HDF format. The code is as follows:

 import pandas as pd
 # read CSV
 df = pd.read_csv('xxx.csv')

 # PKL format
 df.to_pickle('xxx.pkl') Save format
 df = pd.read_pickle('xxx.pkl') # read

 # HDF format
df.to_hdf('xxx.hdf'.'df') Save format
df = pd.read_hdf('xxx.pkl'.'df') # read
Copy the code

Second, the optimization of aggregation operation

When using AGG and Transform, try to use Python built-in functions to improve operation efficiency. (Data used in the above test case)

1. Agg +Python built-in functions

2. Agg + non-built-in functions

Transform +Python built-in functions

4. Transform + non-built-in functions

The Transform method is twice as efficient using built-in functions.

3. Optimization of data row by row

Suppose we now have such a power consumption data set and the electricity price for the corresponding period, as shown below:

The data set records the power consumption per hour, for example, the first row represents 0.586 KWH of electricity consumed at zero on January 13, 2001. The price of electricity is different for different periods of use. Our purpose now is to figure out the total electricity cost, so we need to multiply the unit electricity cost of the corresponding period by the electricity consumed. Here are three ways to write it. Let’s test each of them and compare the differences between the three ways to write it and the differences in code efficiency.

# Write a function that yields the corresponding result
def get_cost(kwh, hour):
    if 0 <= hour < 7:
        rate = 0.6
    elif 7 <= hour < 17:
        rate = 0.68
    elif 17 <= hour < 24:
        rate = 0.75
    else:
        raise ValueError(f'Invalid hour: {hour}')
    return rate * kwh

# Method 1: Simple loop
def loop(df):
    cost_list = []
    for i in range(len(df)):
        energy_used = df.iloc[i]['energy_kwh']
        hour = df.iloc[i]['date_time'].hour
        energy_cost = get_cost(energy_used, hour)
        cost_list.append(energy_cost)
    df['cost'] = cost_list

# 2: Apply methods
def apply_method(df):
     df['cost'] = df.apply(
         lambda row: get_cost(
             kwh=row['energy_kwh'],
             hour=row['date_time'].hour),
         axis=1)

# Method 3: IsIN was used to screen out each period of time for segmented processing
df.set_index('date_time', inplace=True)
def isin_method(df):
    peak_hours = df.index.hour.isin(range(17.24))
    simple_hours = df.index.hour.isin(range(7.17))
    off_peak_hours = df.index.hour.isin(range(0.7))

    df.loc[peak_hours, 'cost'] = df.loc[peak_hours, 'energy_kwh'] * 0.75
    df.loc[simple_hours,'cost'] = df.loc[simple_hours, 'energy_kwh'] * 0.68
    df.loc[off_peak_hours,'cost'] = df.loc[off_peak_hours, 'energy_kwh'] * 0.6
Copy the code

Test results:

It can be seen that the speed of using ISIN () to screen out corresponding data and separate the calculation is nearly 606 times faster than that of simple loop. This is not to say that ISIN () is great, but the speed of method 3 is because it adopts vectorization data processing method (isIN () here is one of the methods, and there are others, you can try). That’s the point. What does that mean?

I’ve drawn a little diagram here, and you can use this diagram with the code to see if it’s faster to process one by one, or if you can do the same thing separately and batch it up.

Use numba to add

If your data processing involves a lot of numerical computation, numba can make your code run much faster, and numba is very simple to use, as shown below. (Code manipulation is not practical, just to show the effect)

The numBA module needs to be installed first

>>>pip install numba
Copy the code

Let’s test numba’s speed up effect with a simple example

import numba

@numba.vectorize
def f_with_numba(x): 
    return x * 2

def f_without_numba(x): 
    return x * 2

Apply line by line
df["double_energy"] = df.energy_kwh.apply(f_without_numba)

# Method 2: Vectorized operation
df["double_energy"] = df.energy_kwh*2

# NumBA speed up
# needs to be passed in as a numpy array
Otherwise, an error will be reported
df["double_energy"] = f_with_numba(df.energy_kwh.to_numpy())
Copy the code

The test results again highlighted the advantages of vectorization, and Numba was able to more than double the efficiency of the already fast vectorization. For more information about how to use NUMba, see the Numba usage documentation.

References:

1, pandas.pydata.org/pandasdocs/…

2, realpython.com/fast-flexib…

3, www.cnblogs.com/wkang/p/979…

Original is not easy, if you feel a little bit of use, I hope you can readily point a praise, thank you old iron.

Scan code to pay attention to the public number “Python read money”, the first time to get dry goods, but also can add Python learning exchange group!!