01 Why Does Python perform poorly

When we talk about the efficiency of a programming language, we usually have two meanings. The first is development efficiency, which is the time it takes a programmer to finish coding. The other is operational efficiency, which is the time it takes a computer to complete a computation. Coding efficiency and operational efficiency are often intertwined, and it is difficult to balance them simultaneously. Different languages have different priorities. Python is no doubt more concerned with coding efficiency. Life is short, we use Python.

Although programmers who use Python should accept its inefficiency, python is widely used in an increasing number of fields, from scientific computing to Web servers. Programmers also want Python to be faster and more powerful.

First, just how slow Python is compared to other languages, the results will certainly vary from scenario to scenario and test case. Python3: python (C++) : python (C++) : python (C++) : python (C++) : python (C++)

As you can see from the figure above, python is several times to tens of times slower than C++, depending on the case.

What are some of the reasons why Python is inefficient

  • Python is a dynamic language

The type of object to which a variable points is determined at run time, so the compiler can’t predict anything and therefore can’t optimize. Take a simple example: r = a + b. A and B are added, but the types of A and B are known at runtime. For addition operations, different types have different processing, so each run will determine the types of A and B, and then perform the corresponding operation. In static languages such as C++, the runtime code is determined at compile time.

Another example is attribute lookups, the order of which is explained in Python Attribute Lookups. In short, accessing a property of an object is a very complex process, and python objects accessed from the same variable can be different (see the Lazy Property example). In C, attributes are accessed using the address of the object plus the offset of the attribute.

  • Python is interpreted execution

JIT (Just in Time Compiler) is not supported. Unladen Swallow was one of Google’s most famous attempts, but it didn’t work either.

  • Everything in Python is an object

Reference counts need to be maintained for each object, adding extra work.

  • Python GIL

GIL is one of Python’s biggest criticisms, because GIL, multithreading in Python is not really concurrent. In an IO Bound business scenario, this is not a big problem, but in a CPU bound scenario, it can be fatal. So I don’t do much python multithreading in my work, usually using pre forks or coroutines. Even in a single thread, the GIL can have a significant performance impact because Python tries to switch threads every 100 opcode executions (by default, set with sys.setcheckInterval ()), The specific source code is in ceval.c::PyEval_EvalFrameEx.

  • The garbage collection

This is probably a common problem with all programming languages that have garbage collection. Python uses a tagging and generational garbage collection strategy that stops the world every time a garbage collection is performed, causing what is known as stalling. There was an article on InfoQ that said Instagram performance improved by 10% when Python’s GC was disabled. Interested readers can peruse it.

02 code Pythonic

We all know that premature optimization is the root of evil, and everything needs to be profile-based.

However, as a Python developer you should use Pythonic, and Pythonic code tends to be more efficient than non-Pythonic code, for example:

  • Use iterator iterator, for example: dict iteritems instead of items (same as itervalues, iterkeys) use generator, especially if it is possible to break early in a loop

  • Check if the same object uses is instead of ==

  • To determine whether an object is in a collection, use a set instead of a list

  • Using the short circuit evaluation characteristics, the “short circuit” probability of the logical expression written in the front. Other lazy ideas are also ok

  • For the accumulation of large numbers of strings, use the join operation

  • Use the for else (while else) syntax

  • Swap two variables using: a, b = b, a


03 Profile-based optimization

Even though our code is very Pythonic, it may not run as efficiently as expected. We also know that the 80/20 rule is that most of the time is spent in a small number of code fragments, and the key to optimization is to find the bottleneck code. There are many ways to do this: add logs here and there to print timestamps, or test suspected functions individually with Timeit, but the profile tool is most effective.

▍ 1. python profilers

There are three well-known profile tools for Python programs: Profile, cProfile, and HotShot.

Among them, Profile is implemented in pure Python language, while Cprofile implements part of profile native. Hotshot is also implemented in C language. The differences between Hotshot and Cprofile are as follows: Hotshot has less impact on the execution of object code at the cost of more post-processing time, and hotshot is out of maintenance.

It is important to note that profile (Cprofile Hotshot) is only suitable for single-threaded Python programs. For multithreading, you can use Yappi, which not only supports multithreading, but is accurate to CPU time

For the greenlet, you can use the Greenlet profiler to hook the thread context based on yappi

Here is a piece of fabricated “inefficient” code, using Cprofile to illustrate the specific approach to profile and the performance bottlenecks we may encounter.

# -*- coding: UTF-8 -*-

from cProfile import Profile
import math
def foo():
    return foo1() 

def foo1():
    return foo2()

def foo2():
    return foo3()

def foo3():
    return foo4()

def foo4():
    return "this call tree seems ugly, but it always happen"

def bar():
    ret = 0
    for i in xrange(10000):
        ret += i * i + math.sqrt(i)
    return ret

def main():
    for i in range(100000):
        if i % 10000 == 0:
            bar()
        else:
            foo()

if __name__ == '__main__':
    prof = Profile()
    prof.runcall(main)
    prof.print_stats()
    #prof.dump_stats('test.prof') # dump profile result to test.prof
Copy the code

The running results are as follows:

For the output above, each field has the following meaning:

Total number of calls to the totTime function (excluding subfunctions) Total number of calls to the totTime function perCall (first) Total number of calls to the totTime/nCalls function  filename:lineno(function) file: line number (function)Copy the code

Output in the code is very simple, in fact, can take advantage of the pstat, diversify our profile results output, specific can see the official document: https://docs.python.org/2/library/profile.html

▍ 2. Profile GUI tools

Although the output of Cprofile is more intuitive, we prefer to save the profile results and use graphical tools to analyze them from different dimensions, or to compare the code before and after optimization.

There are also a number of tools available to view profile results, such as visualpytune, qcachegrind, and runsnakerun, which are used for this analysis. For the above code, re-run the generated test.prof file with visualpytune and open it directly as follows:

The meaning of the field is basically the same as the text output, but it is convenient to sort by clicking on the field name. The calller (caller) for the current function is listed in the lower left, and the time consumed within the current function and its children is listed in the lower right. Is sorted by cumtime, the amount of time spent inside the function and its children.

Performance bottlenecks are usually caused by functions that are called frequently, functions that consume very much in a single session, or a combination of the two. In our previous examples, foo is a case of high frequency calls and bar is a case of very high single consumption, which is what we need to optimize.

Python-profiling -tools introduces the use of qcachegrind and runsnakerun, two colorful tools that are much more powerful than visualpytune. For details on how to use test.prof with qCacheGrind, see the following figure

Qcachegrind is definitely more powerful than visualpytune. As can be seen from the figure above, it is roughly divided into three parts: The first part, similar to visualpytune, is the amount of time each function takes, where Incl equals cumtime and Self equals tottime. Part 2 and Part 3 both have many labels. Different labels indicate the results from different perspectives. So, part 3’s “Call Graph” shows the call tree of the function and contains the percentage of time for each sub-function, which is easy to see.

3 profile for optimization

Knowing the hot spots, you can carry out targeted optimization, and this optimization is often closely related to the specific business, without a skeleton key, specific problems, specific analysis. In my experience, the most effective optimization is to go to the product manager to discuss the requirements. It is possible to meet the requirements in another way, or at least a small compromise is acceptable to the product manager. The second is to change the implementation of the code, such as using a simpler but less efficient algorithm. If this algorithm becomes a performance bottleneck, consider switching to a more efficient but potentially difficult algorithm, or use dirty Flag mode. These same methods need to be combined with specific cases, this paper does not repeat.

Next, in combination with python language features, I’ll introduce some ways to make Python code less Pythonic but better performance

First: reduce the level of function calls

Each layer of function call will bring a lot of overhead, especially for calltree with high call frequency but small single consumption, the overhead of multi-layer function call is very high, so it can be considered to expand it.

The call tree foo is very simple for the profile code that we called earlier, but it’s very frequent. Modify the code to add a plain_foo() function that returns the final result directly, with the following key output:


Comparison with previous results:


And as you can see, it’s optimized almost three times.

Second: optimize attribute lookup

As mentioned above, Python’s attribute lookups are inefficient, and if a property is frequently accessed in a piece of code (such as a for loop), consider replacing the object’s attribute with a local variable.

Third: close the GC

As mentioned in the first chapter of this article, turning off GC can improve Python performance, and the latency caused by GC is unacceptable in applications where real-time requirements are high. But turning off GC is not an easy task. We know that Python’s reference counting can only handle the absence of circular references, which are handled by GC. In Python, it is very easy to write circular references. Such as:

# case 1a, b = SomeClass(), SomeClass() a.b, b.a = b, a# case 2
lst = []
lst.append(lst)

# case 3
self.handler = self.some_func
Copy the code

Of course, you might say, who would be so stupid as to write code like this? Yes, the code above is too obvious, and when you have a few more levels in the middle, you have an “indirect” loop. OrderedDict in Python’s standard library collections is case2:

To solve the problem of circular references, the first method is to use weakref, and the second method is to resolve circular references manually.

Fourth: setcheckinterval

If the program determines that it is single-threaded, change the checkInterval to a larger value, as described here.

Number five: Use __slots__

Slots is designed primarily to save memory, but also to improve performance to a certain extent. We know that classes that define __slots__ leave enough space for an instance to no longer create __dict__ automatically. Of course, there are many caveats to using __slots__, the most important of which is that all classes on the inheritance chain must define __slots__, as described in Python Doc. Here is a simple test example:

class BaseSlots(object):
    __slots__ = ['e'.'f'.'g']

class Slots(BaseSlots):
    __slots__ = ['a'.'b'.'c'.'d']
    def __init__(self):
        self.a = self.b = self.c = self.d = self.e = self.f  = self.g = 0

class BaseNoSlots(object):
        pass

class NoSlots(BaseNoSlots):
    def __init__(self):
        super(NoSlots,self).__init__()
        self.a = self.b = self.c = self.d = self.e = self.f  = self.g = 0

def log_time(s):
    begin = time.time()
    for i in xrange(10000000):
        s.a,s.b,s.c,s.d, s.e, s.f, s.g
    return time.time() - begin

if __name__ == '__main__':
    print 'Slots cost', log_time(Slots())
    print 'NoSlots cost', log_time(NoSlots())
Copy the code

The output is as follows

Slots cost 3.12999987602
NoSlots cost 3.48100018501
Copy the code


                                                            -END-

                                     

Follow the public account to learn more about technology

            

Free information can be added to the group, 705673780

Click to receive