\
Pandas is a Python library used for data processing and analysis. There are some tricks and cautions for using Pandas, especially for large data sets. If you do not use it properly, it may be very slow for Pandas.
For pandas, time is of the essence. This article summarizes some of the best performance tuning methods for pandas. \
1. Optimization of data reading
The pd.read_csv() function is used to read data from a CSV file. It is used to read data from a CSV file. Which is faster? Let’s do an experiment to compare.
A total of 590,000 lines of data were used here, which were respectively saved in XLSX, CSV, HDF and PKL formats. Each format was read for 10 times, and the following results were obtained.
It can be seen that for the same data, PKL format data is the fastest to read, which is nearly 6 times that of CSV format data, followed by HDF format data, and the worst is to read XLSX format data (this is only a data set of about 15M in size).
To save time each time you read a data set (most of which is in CSV format), you can use pandas to read the data in PKL or HDF format. The code is as follows:
1import pandas as pd
2# read CSV3df = pd.read_csv('xxx.csv')
4
5# PKL format6df.to_pickle('xxx.pkl') # save the format7df = pd.read_pickle('xxx.pkl') # read8
9# HDF format10df.to_hdf('xxx.hdf'.'df') # save the format11df = pd.read_hdf('xxx.pkl'.'df') # readCopy the code
Second, the optimization of aggregation operation
When using AGG and Transform, try to use Python built-in functions to improve operation efficiency. (Data used in the above test case)
1. Agg +Python built-in functions
2. Agg + non-built-in functions
You can see a 60% increase in efficiency with the AGG method using built-in functions. \
Transform +Python built-in functions
4. Transform + non-built-in function ****
The Transform method is twice as efficient using built-in functions.
3. Optimization of data row by row
Suppose we now have such a power consumption data set, and the electricity price for the corresponding period, as shown below: \
The data set records the power consumption per hour, for example, the first row represents 0.586 KWH of electricity consumed at zero on January 13, 2001. The price of electricity is different for different periods of use. Our purpose now is to figure out the total electricity cost, so we need to multiply the unit electricity cost of the corresponding period by the electricity consumed. Here are three ways to write it. Let’s test each of them and compare the differences between the three ways to write it and the differences in code efficiency. \
Def get_cost(KWH, hour) def get_cost(KWH, hour)if 0 <= hour < 7:
rate = 0.6
elif 7 <= hour < 17:
rate = 0.68
elif 17 <= hour < 24:
rate = 0.75
else:
raise ValueError(f'Invalid hour: {hour}')
returnDef loop(df): cost_list = []for i in range(len(df)):
energy_used = df.iloc[i]['energy_kwh']
hour = df.iloc[i]['date_time'].hour
energy_cost = get_cost(energy_used, hour)
cost_list.append(energy_cost)
df['cost'] = cost_list def apply_method(df): df['cost'] = df.apply(
lambda row: get_cost(
kwh=row['energy_kwh'],
hour=row['date_time'].hour),
axis=1Df.set_index (df.set_index)'date_time', inplace=True)
def isin_method(df):
peak_hours = df.index.hour.isin(range(17.24))
simple_hours = df.index.hour.isin(range(7.17))
off_peak_hours = df.index.hour.isin(range(0.7))
df.loc[peak_hours, 'cost'] = df.loc[peak_hours, 'energy_kwh'] * 0.75
df.loc[simple_hours,'cost'] = df.loc[simple_hours, 'energy_kwh'] * 0.68
df.loc[off_peak_hours,'cost'] = df.loc[off_peak_hours, 'energy_kwh'] * 0.6
Copy the code
Test result: \
It can be seen that the speed of using ISIN () to screen out corresponding data and separate the calculation is nearly 606 times faster than that of simple loop. This is not to say that ISIN () is great, but the speed of method 3 is because it adopts vectorization data processing method (isIN () here is one of the methods, and there are others, you can try). That’s the point. What does that mean?
I’ve drawn a little diagram here, and you can use this diagram with the code to see if it’s faster to process one by one, or if you can do the same thing separately and batch it up.
Use NUMBA for acceleration
If your data processing involves a lot of numerical computation, numba can make your code run much faster, and numba is very simple to use, as shown below. (Code manipulation has no real meaning, just to show the effect) \
The numBA module needs to be installed first
>>>pip install numba
Copy the code
Let’s test numba’s speed up effect with a simple example
import numba
@numba.vectorize
def f_with_numba(x):
return x * 2
def f_without_numba(x):
return x * 2Apply df["double_energy"] = df.energy_kwh.apply(f_without_numba)"double_energy"] = df.energy_kwh*2[numpy] [numpy] [numpy]"double_energy"] = f_with_numba(df.energy_kwh.to_numpy())
Copy the code
The test results again highlighted the advantages of vectorization, and Numba was able to more than double the efficiency of the already fast vectorization. For more information about how to use NUMba, see the Numba usage documentation. \
References:
1, pandas.pydata.org/pandasdocs/…
2, realpython.com/fast-flexib…
3, www.cnblogs.com/wkang/p/979…
Click become:Community registered members ** like the article, click the **Looking at the