Why is Numpy still slow? Are you using it correctly?

Recently, I have been working on a Python simulator and doing simulation. It is not easy to use the “Numpy” module, which is the fastest calculation module in Python. The result is surprisingly slow. Because a simulation will take an hour, to test, so I can not stand. First of all, the question that was in my head, it began to surface.

I know Pandas is slower than Numpy, so I try to avoid using Pandas. But Numpy, why is it still so slow?

With a neat penchant for writing code, I gave Google a good look. The first article that came up was Getting the Best Performance out of NumPy. So I’m going to share some of the tips I’ve learned from this article, and add a few additions.

Why Numpy?

We all know that Python is slow, simply because Python performs a lot of complicated “check” functions when you run your code, such as when you assign values

b=1; a=b/0.5
Copy the code

This may seem simple, but inside the computer, b needs to be converted from an integer to float before it can perform the following ‘B /0.5’, since the result is a decimal. There are many other reasons and detailed explanations (such as Python calls to memory) that can be found here: Why Python is Slow: Looking Under the Hood

When it comes to Numpy, it’s a Python savior. It combines the simplicity of Python with the high performance of C. When you call Numpy, it actually calls a lot of C instead of pure Python. That’s why everyone loves Numpy.

Create a Numpy Array structure

In fact, Numpy is C’s logic. The storage container “Array” is created by looking for a series of regions in memory to store, whereas Python stores discrete regions, making Python less efficient at indexing the data in the container. Numpy simply needs to walk back and forth over this fixed contiguous area to retrieve data without much effort. The image below from Why Python is Slow: Looking Under the Hood explains it all well.

When working with Numpy, we usually store data not in a one-dimensional Array, but in two – or three-dimensional blocks (speaking for machine learning friends).

Because of Numpy’s fast matrix multiplication, it is possible to distribute multiplications across multiple cores in a computer in parallel. These days, we want everything. Multithreading/multiprocessing. This is one reason why Numpy is so popular. This parallel operation greatly speeds up the operation.

So for the 2D/3D Array that we use every day, we usually don’t think about how it came about. Because in our normal minds, a matrix is a matrix, there’s nothing deep about it. But that’s not true! Otherwise I wouldn’t be writing this post. No matter what 1D/2D/3D Array is, it is basically a 1D Array!

The graph of this Blog shows. What we think of as a 2D Array, if traced back to computer memory, is actually stored in a contiguous space. For this continuous space, if we create arrays in different ways, the order of the continuous space will be different. This will affect everything after that! We will use Python to test the elapsed time later.

In Numpy, the default way to create 2D arrays is by “C-type” rows in memory, and by “Fortran” columns in memory.

col_major = np.zeros((10.10), order='C')    # C-type
row_major = np.zeros((10.10), order='F')    # Fortran
Copy the code

Actions on Axis

When your calculations involve merging matrices, different forms of matrix creation will give you different time effects. Because matrix merging and so on in Numpy takes place in one dimension! Not in two dimensions as we think!

a = np.zeros((200. 200), order='C')
b = np.zeros((200. 200), order='F')
N = 9999

def f1(a) :
    for _ in range(N) :
        np.concatenate((a. a), axis=0)

def f2(b) :
    for _ in range(N) :
        np.concatenate((b. b), axis=0)

t0 = time.time(a)
f1(a)
t1 = time.time(a)
f2(b)
t2 = time.time(a)

print((t1-t0)/N)     # 0.000040
print((t2-t1)/N)     # 0.000070
Copy the code

As you can imagine from the above figure, row-dominated storage is faster if matrices are merged in the direction of the row. In the above tests, F1 was faster because we could just add a row after the 1D array as long as we were thinking about the 1D array. However, in a column-dominated system, adding a row to a 1D array becomes more complex and takes longer. If axis=1 is merged, f2 in “F” mode will be better than F1 in “C” mode.

One more thing to mention, I sometimes use ‘Np. stack’ instead of ‘np.concatenate’ for convenience, because it takes less code to write, but using the form above, the tests above found it to be true. So for speed, I recommend using ‘Np. concatenate’ as much as possible.

np.vstack((a.a))                # 0.000063
np.concatenate((a.a), axis=0)   # 0.000040
Copy the code

Or sometimes operate on an Axis, such as selecting points from the a-matrix created with “C-type” above:

indices = np.random.randint(0. 100. size=10. dtype=np.int32)
a[indices. :]     # 0.000003
a[:, indices]     # 0.000006
Copy the code

Because A is stored as a row, it is much faster to select data from a row than from a column! The results are similar for other Axis operations. So there you have it, see which axis you want to fiddle with the most, and then create your own matrix form (“C-type”/”Fortran”).

Copy the view slow fast

In Numpy, there are two important concepts, copy and view. Copy, as the name implies, copies data and stores it in another location in memory, whereas view does not copy data and takes the index of the source data. The following figure is from Understanding SettingwithCopyWarning in Pandas

What does that mean? Let’s just look at the code.

a = np.arange(1. 7).reshape((3.2))
a_view = a[:2]
a_copy = a[:2].copy(a)

a_copy[1.1] = 0
print(a)
"" "
[[1, 2]
[3, 4]
6 [5]]
"" "

a_view[1.1] = 0
print(a)
"" "
[[1, 2]
3 [0]
6 [5]]
"" "
Copy the code

In short, everything in a_view is everything in A, and everywhere you move a_view, A is going to be passive, because they’re going to be exactly the same place in memory, essentially themselves. And a_copy is just taking a copy of a, and putting a_copy somewhere else in memory, so if you change a_copy, you can’t change a.

So why bring it up? Because views don’t copy things, fast! Let’s test the speed. In the following example, ‘a*=2′ assigns the view, the same meaning as’ a[:] *=2 ‘, without creating anything new. In b = 2*b, we assign b to another new b.

a = np.zeros((1000. 1000))
b = np.zeros((1000. 1000))
N = 9999

def f1(a) :                 
    for _ in range(N) :
        a * = 2           # same as a[:] *= 2

def f2(b) :
    for _ in range(N) :
        b = 2*b

print('%f' % ((t1-t0)/N))     # f1:0.000837
print('%f' % ((t2-t1)/N))     # f2:0.001346
Copy the code

One more thing about view, do you occasionally flatten a matrix using ‘np.flatten()’ or ‘np.ravel()’. They are different! Ravel returns a view (thank you @fayi for the comment). If you use Ravel, it will be copied only when it needs to be copied. I think this time is the time to convert the order in Ravel. Such as’ c-type ‘-> ‘Fortran’), and flatten always returns a copy. Now you know who’s holding you back! The following tests prove that Ravel is faster than Flatten.

def f1(a) :
    for _ in range(N) :
        a.flatten(a)

def f2(b) :
    for _ in range(N) :
        b.ravel(a)

print('%f' % ((t1-t0)/N))    # 0.001059
print('%f' % ((t2-t1)/N))    # 0.000000
Copy the code

Select data

When selecting data, we often use the form view or copy. So we know, if we can use view, let’s try to use view to avoid copying data. So when will it be a View? Here are some examples of the view mode:

a_view1 = a[1:2. 3:6]    # slice slice
a_view2 = a[:100]        # same as above
a_view3 = a[: :2]         # leapfrog
a_view4 = a.ravel(a)      # Mentioned above
.                      # That's all I can think of. If there are more, please post them in the comments
Copy the code

So what operations are we going to copy?

a_copy1 = a[[1.4.6]. [2.4.6]]   # in the index
a_copy2 = a[[True. True]. [False. True]]  # with a mask
a_copy3 = a[[1.2]. :]        # Although 1 and 2 are indeed linked, they are indeed copy
a_copy4 = a[a[1To:] ! = 0. :]  # fancy indexing
a_copy5 = a[np.isnan(a), :]  # fancy indexing
.                          # That's all I can think of. If there are more, please post them in the comments
Copy the code

Numpy gives us a lot of freedom to select data, which is all very convenient, but if you can avoid it, your speed can fly.

In the blog above, he mentioned that if you still like fancy indexing, we’ll speed it up as well. The blog pointed out two approaches

1. 使用 `np.take()`, 替代用 index 选数据的方法.

As mentioned above, if you use index to select data such as a_copy1 = a[[1,4,6], [2,4,6]], take will be faster in most cases than such a_copy1.

a = np.random.rand(1000000. 10)
N = 99
indices = np.random.randint(0. 1000000. size=10000)

def f1(a) :
    for _ in range(N) :
        _ = np.take(a. indices. axis=0)

def f2(b) :
    for _ in range(N) :
        _ = b[indices]

print('%f' % ((t1-t0)/N))    # 0.000393
print('%f' % ((t2-t1)/N))    # 0.000569
Copy the code

2. Use ‘np.compress()’ instead of using mask to select data.

A_copy2 = a[[True, True], [False, True]] The tests are as follows:

mask = a[:, 0] < 0.5
def f1(a) :
    for _ in range(N) :
        _ = np.compress(mask. a. axis=0)

def f2(b) :
    for _ in range(N) :
        _ = b[mask]

print('%f' % ((t1-t0)/N))    # 0.028109
print('%f' % ((t2-t1)/N))    # 0.031013
Copy the code

Very useful out parameter

For those of you who don’t know much about Numpy, you’ll probably ignore this out parameter in many functions (I’ve never used it before). But when I got into it, it was very useful! For example, ‘a=a+1’ needs to be converted to ‘np.add()’, so the former takes longer.

a = a + 1         # 0.035230
a = np.add(a. 1)  # 0.032738
Copy the code

If this is the case, we will trigger the copy principle mentioned earlier. The two assigned a’s are copies of the original A’s, not a’s view. But there is an out parameter in the function, so we don’t have to recreate the A. So the following two are the same function, neither of them will create another copy. However, it may be the reason mentioned above, there is also a difference in the operation time here.

a + = 1                 # 0.011219
np.add(a. 1. out=a)    # 0.008843
Copy the code

All numpy functions with out are here: Universal functions. So as long as a placeholder already exists (such as A), there is no need to create another placeholder. Using out is convenient and effective.

Give the data a name

I like pandas because pandas allows you to name data. Names are much easier to remember and use than indexes when there are many data types. But pandas is slower than NUMpy. Fortunately, there is a way to index by name. The structure of a/ B is the same as that of a/ B, but the structure of a/ B is the same as that of A/B.

a = np.zeros(3. dtype=[('foo'. np.int32), ('bar'. np.float16)])
b = pd.DataFrame(np.zeros((3. 2), dtype=np.int32), columns=['foo'. 'bar'])
b['bar'] = b['bar'].astype(np.float16)

"" "
# a
array([(0, 0.), (0, 0.), (0, 0.)],
      dtype=[('foo', '<i4'), ('bar', '<f2')])

# b
foo bar
0 0 0.0
1 0 0.0
2 0 0.0
"" "

def f1(a) :
    for _ in range(N) :
        a['bar'] * = a['foo']

def f2(b) :
    for _ in range(N) :
        b['bar'] * = b['foo']

print('%f' % ((t1-t0)/N))    # 0.000003
print('%f' % ((t2-t1)/N))    # 0.000508
Copy the code

As you can see, NUMpy is significantly faster than PANDAS. If you need to use different data forms, Numpy can do the job and still maintain fast computation speed. Pandas data is slower than NUMPY. Pandas Data contains many other features of the data. Here’s a more comprehensive Comparison: Numpy Vs Pandas Performance Comparison

If you have any other tips or speed tricks, feel free to discuss them below.

As a final note, if you’re interested in machine learning, there are plenty of awesome short video tutorials on machine learning methods and lots of Python hands-on tutorials on machine learning to get you up to speed in your spare time: Never Mind Python Tutorials

Why is Numpy still slow? Are you using it correctly?

Why Numpy?

Create a Numpy Array structure

Actions on Axis

Copy the view slow fast

Select data

Very useful out parameter

Give the data a name

Related Posts

Work queues – The second of RabbitMQ’s seven working queues

Quarkus, a Java framework born out of the “cloud” : Build native executables

Scheduled tasks for Linux system management