How can I iterate 150x faster for Pandas?

Let’s face it, Python’s speed has caused quite a bit of controversy when compared to C or Go.

This has led me to doubt Python’s ability to handle tasks quickly for some time.

At the moment, I’m trying to do data science in Go — it’s possible — but it’s not nearly as enjoyable as it is in Python, largely due to the static nature of the language and the largely exploratory nature of data science.

That’s not to say that a solution rewritten in Go can’t improve performance, but that’s the subject of another article.

So far, I have at least overlooked Python’s ability to do tasks faster. I’ve been suffering from myopic vision – a syndrome of seeing only one solution and completely ignoring the others. I am not alone in believing this to be the case.

That’s why I want to briefly describe how to make Pandas’ daily work faster and more enjoyable. More precisely, the example will focus on iterating between rows and performing some data manipulation along the way. So, without delay, let’s get down to business.

Make a data set

The easiest way to make this point is to declare a single-column datbox object with integer values ranging from 1 to 100000:

It really doesn’t take anything more complicated to solve Pandas’ speed problem. To verify that everything is going well, here are the first few lines and the overall shape of the dataset:

Now that you’ve done your homework, let’s take a look at how to traverse and how not to traverse the rows of the data box. First, how not to select.

Here’s what you shouldn’t do

Ah, I’ve been using (and overusing) so many iterrows() methods. It’s slow by default, but you know why I bothered to find an alternative (myopic).

To prove that you shouldn’t use the iterrows() method to traverse a data box, I’ll do a quick demonstration — declare a variable and set it initially to 0 — then incrementing it with the current value of the Values property at each iteration.

If you want to know the %%time magic function returns the number of seconds/milliseconds required for the cell to complete all operations.

Let’s see how this function works:

You might now be thinking that 15 seconds is not much to go through 100,000 rows and increment the value of some external variable. But the truth is — see why in the next section.

Here’s what you should do

Now there’s a magic way to do this — itertuples(). As the name implies, itertuples() loops through the rows of a data box and returns a named tuple. That is, these values cannot be accessed with parentheses [], but need to be. The reason for the sign.

I’ll now demonstrate the same example as a few minutes ago, but using the itertuples() method:

To see see! Itertuples () performs the same operation about 154 times faster! Now imagine your daily work scenario where you are dealing with millions of rows — itertuples() can save you a lot of time.

In this simple example, we’ve seen how small changes to the code can have a huge impact on the overall result.

That doesn’t mean itertuples() will be 150 times faster than iterrows() in every scenario, but it does mean to some extent that it will be faster every time.

Wenyuan network, only for the use of learning, such as infringement, contact deletion.

I have compiled good technical articles and lessons learned on my public account, Python Circle.

You will definitely encounter difficulties in learning Python. Don’t panic, I have a set of learning materials, including 40+ e-books, 600+ teaching videos, covering Python basics, crawlers, frameworks, data analysis, machine learning, etc.

How can I iterate 150x faster for Pandas?

Make a data set

Here’s what you shouldn’t do

Here’s what you should do

Related Posts

Eight sorts – Simple selection sort

Programmers, please don’t look down on zero code

High performance MySQL column closing remarks