Write high-performance Pandas code
In my opinion, Python, one of the most commonly used languages for scientific computing, deals with a large number of data computations, and if it is too slow, it will take too long for the trial-and-error methods of scientific computing. So I’ve always wondered, how slow is Python to make it seem slow to people to start using it? How fast can everyone use it to compute gigabytes of data?
Pandas is the most commonly used framework for processing scientific computing data. In experimenting step by step, I found that it depends on how the code is written. Let’s compare the performance cost of several approaches to traversing the data set.
The data is from a random data set found on the Internet:
import pandas as pd
import numpy as np
data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/EuStockMarkets.csv")
data.head()
Copy the code
Unnamed: 0 | DAX | SMI | CAC | FTSE | |
---|---|---|---|---|---|
0 | 1 | 1628.75 | 1678.1 | 1772.8 | 2443.6 |
1 | 2 | 1613.63 | 1688.5 | 1750.5 | 2460.2 |
2 | 3 | 1606.51 | 1678.6 | 1718.0 | 2448.2 |
3 | 4 | 1621.04 | 1684.1 | 1708.1 | 2470.4 |
4 | 5 | 1618.16 | 1686.6 | 1723.1 | 2484.7 |
data.describe()
Copy the code
Unnamed: 0 | DAX | SMI | CAC | FTSE | |
---|---|---|---|---|---|
count | 1860.000000 | 1860.000000 | 1860.000000 | 1860.000000 | 1860.000000 |
mean | 930.500000 | 2530.656882 | 3376.223710 | 2227.828495 | 3565.643172 |
std | 537.080069 | 1084.792740 | 1663.026465 | 580.314198 | 976.715540 |
min | 1.000000 | 1402.340000 | 1587.400000 | 1611.000000 | 2281.000000 |
25% | 465.750000 | 1744.102500 | 2165.625000 | 1875.150000 | 2843.150000 |
50% | 930.500000 | 2140.565000 | 2796.350000 | 1992.300000 | 3246.600000 |
75% | 1395.250000 | 2722.367500 | 3812.425000 | 2274.350000 | 3993.575000 |
max | 1860.000000 | 6186.090000 | 8412.000000 | 4388.500000 | 6179.000000 |
I’m going to use the DAX column for the test, and the function I’m going to use is probably sin(DAX) * 1.1, and it doesn’t have any special meaning, it just consumes time.
Presumably we’ll test the following:
- Naive for loop
- Iterrows method loop
- The apply method
- Vector method
To test the time consumed one by one, statistical time method is unified as timeit.
Let’s look at the first one
Naive for loop
%%timeit
result = [np.sin(data.iloc[i]['DAX']) * 1.1 for i in range(len(data))]
data['target'] = result
Copy the code
422 ms ± 29.3 ms per loop (mean ± std.dev. Of 7 runs, 1 loop each)Copy the code
Using a regular for loop, 1860 times takes 400+ms. This number should vary from computer to computer, but it is definitely relative to the following.
This is very time consuming, very Python. This kind of writing method is also very not recommended, so to improve
Iterrows method loop
Pandas provides the iterrows method, which provides enumerate behavior, and the index and object of the current loop.
Iterrows is much more efficient and easy to use than a simpler for loop. Pandas provides an itertuples method that provides high performance at the cost of convenience. The itertuples are called tuples and cannot be retrieved by the index names of the columns in the loop. The value can only be specified by the index of a tuple.
Timeit result = [np.sin(row['DAX']) * 1.1 for index,row in data.iterrows()] data['target'] = resultCopy the code
103 ms ± 1.15 ms per loop (mean ± std.dev. Of 7 runs, 10 loops each)Copy the code
%%timeit
result = [np.sin(row[2]) * 1.1 for row in data.itertuples()]
data['target'] = result
Copy the code
9.67 ms ± 649 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code
In addition to the Iterrows method, there is another method.
The apply method
This method iterates through the DataFrame in the form of a callback method, with the axis parameter specifying which axis to traverse (row or column).
The apply method performs better than iterrows but less well than Itertuples and is as convenient as Iterrows.
%%timeit
data['target'] = data.apply(lambda row: np.sin(row['DAX']) * 1.1, axis=1)
Copy the code
60.7 ms ± 3.14 ms per loop (mean ± std.dev. Of 7 runs, 10 loops each)Copy the code
In the case of itertuples, performance has been optimized up to about 9ms, which is about 10 times better, so can it be improved further
Vector method
If you do a scientific calculation, you should know the vector 1, 2. Data in a DataFrame is basically all vectors, including filters.
So is it faster to calculate in DataFrame vector units?
% % timeit data [' target '] = np. Sin (data [' DAX ']) * 1.1Copy the code
427 µs ± 36.4 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code
As you can see, it is simpler and more efficient, requiring only 400+ US, which is 1000 times better than the original 400ms, and is already less than milliseconds.
thinking
If you think about it, you can see why you can optimize your performance step by step.
The naive for loop carries too much information and is mostly unnecessary, and uses Python’s slow loop. So you can optimize for both. First, you can optimize the Python loop, use the Apply function, and use the C code to loop, so that the performance is doubled; Then reducing the amount of information carried and using only immutable tuples can improve performance considerably.
However, the tuple is not a DataFrame structure, and it takes a lot of time to convert the tuple into a DataFrame structure, which should improve performance.
So using vectors directly improves performance by another 20 times.
The DataFrame vector also carries some additional information, such as the index of the DataFrame. Therefore, using the lowest unit of computation can also improve performance.
Pandas is the basic unit of data used by Numpy, so it can improve performance by extracting the original Numpy array from the vector.
%%timeit
data['target'] = np.sin(data['DAX'].values) * 1.1
Copy the code
25 µs ± 6.34 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code
Performance is nearly twice as good, 2,000 times better than in the beginning.
conclusion
At the beginning of the design of many systems, there are very beginning to worry about performance, which I think is completely unnecessary.
The difficulty of system design lies in the best or better way to achieve this function, system performance is not the bottleneck of the system. Python is trial-and-error, iterates quickly, and has many ways to optimize performance. If these methods don’t meet your needs, you can use Cython or write parts of your code in C.
Of course, performance has always been Python’s Achilles heel, and it’s been labeled slow from the start. So don’t worry too much about performance. There’s nothing wrong with using a few dozen lines of code to do something and then rewriting it to thousands of lines of C code.