This article was originally published on the official account [Python Cat]. Please do not reprint without authorization.

The original address: mp.weixin.qq.com/s/8KvQemz0S…

Cat talk: One of the most widely criticized aspects of Python is probably its GIL. Because of the GIL, true multithreaded programming is not possible in Python, and many people see this as Python’s biggest weakness.

After THE introduction of PEP-554 (September 2017), there seemed to be a glimmer of improvement. However, can GIL really be completely killed? If so, how can it be done? Why has it been more than a year since it was done?


English | from the Python GIL had slain? 【 1 】

The author | Anthony Shaw

The cat under the translator | pea flowers

Disclaimer: This article has been translated with the authorization of the original author, please keep the original source of reprint, do not use for commercial or illegal purposes.

In early 2003, Intel introduced a new Pentium 4 “HT” processor that runs at 3 GHz and uses “hyper-threading” technology.

Over the next few years, Intel and AMD competed fiercely to achieve the best desktop performance by increasing bus speeds, L2 cache sizes, and reducing chip sizes to minimize latency. The 3Ghz HT was replaced by Prescott’s 580 model in 2004, which had a dominant frequency of up to 4 GHz.

It seems that the best way to improve performance is to increase the CPU frequency, but cpus suffer from high power consumption and heat dissipation, which can affect global warming.

Do you have a 4Ghz CPU on your computer? Not likely, because the way forward in performance is higher bus speeds and more cores. The Intel Core 2, which replaced the Pentium 4 in 2006, has a much lower frequency.

In addition to the release of consumer multicore cpus, another thing that happened in 2006 was the release of Python 2.5! Python 2.5 brings a beta version of the beloved with statement.

Python 2.5 has an important limitation when using Intel’s Core 2 or AMD’s Athlon X2 — the GIL.

What is GIL?

The GIL, or Global Interpreter Lock, is a Boolean value in the Python Interpreter that is protected by mutual exclusion. This lock is used by the core bytecode in CPython to evaluate loops and tune the current thread used to execute statements.

CPython supports the use of multiple threads in a single interpreter, but threads must have access to the GIL to execute opcode (to do low-level operations). The benefit of this is that Python developers writing asynchronous or multithreaded code don’t have to worry about getting locks on variables or about the process crashing due to deadlocks.

The GIL makes multithreaded programming in Python easy.

GIL also means that while CPython can be multi-threaded, only one thread can be executed at any given time. This means that your quad-core CPU will work like above (minus the blue screen, hopefully).

The current version of the GIL was written in 2009 [2] to support asynchronous functionality and has survived virtually untouched, despite numerous attempts to remove it or reduce dependency on it.

The whole argument for removing the GIL is that it should not degrade the performance of single-threaded code. Anyone who started hyper-threading in 2003 will understand why this is important [3].

Avoid using GIL in CPython

If you want to use true concurrent code in CPython, you must use multiple processes.

In CPython 2.6, the multiprocessing module was added to the standard library. Multiprocessing is a wrapper for CPython’s mass-produced processes (each process has its own GIL)

from multiprocessing import Process

def f(name):
    print 'hello', name

if __name__ == '__main__':
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()
Copy the code

Processes can “hatch” from the main process, send commands through compiled Python modules or functions, and then be reintegrated into the main process.

The Multiprocessing module also supports sharing variables through queues or pipes. It has a Lock object that locks objects in the main process so that other processes can write.

Multiprocessing has one major drawback: it is expensive in terms of time and memory usage. CPython startup times, even without non-sites, are 100-200ms (see this link [4]).

Thus, you can use concurrent code in CPython, but you must carefully plan for long-running processes that rarely share objects between them.

Another alternative is to use a three-party library like Twisted.

Pep-554 and GIL’s death?

In summary, it’s easy to use multithreading in CPython, but it’s not really concurrent, and multiprocessing is concurrent, but it’s extremely expensive.

Is there a better solution?

The clue to bypassing the GIL is in its name. The global interpreter lock is part of the state of the global interpreter. CPython processes can have multiple interpreters and therefore multiple locks, but this feature is rarely used because it is exposed only through the C-API.

Among the features proposed for CPython 3.8 are PEP-554, a proposed implementation of sub-interpreter, and a new interpreters module with API in the standard library.

This makes it possible to create multiple interpreters within a single Python process. Another change in Python 3.8 is that each interpreter will have a separate GIL

Because the state of the interpreter contains the Memory Allocation Arena, which is a collection of all Pointers to Python objects (local and global), subinterpreters in PEP-554 have no access to other interpreters’ global variables.

Like multiple processes, objects are shared between interpreters by using some form of IPC (network, disk, or shared memory) for serialization. There are many ways to serialize objects in Python, such as the Marshal module, the pickle module, and more standardized methods like JSON and SimplexML. These methods have mixed reviews, but they are invariably costly.

The best solution is to carve out a shared mutable memory space, controlled by the main process. In this case, objects can be sent from the main interpreter and received by other interpreters. This will be the memory management space for PyObject Pointers, accessible by every interpreter, and controlled by the main process.

Such an API is still in the works, but it might look something like this:

import _xxsubinterpreters as interpreters
import threading
import textwrap as tw
import marshal

# Create a sub-interpreter
interpid = interpreters.create()

# If you had a function that generated some data
arry = list(range(0.100))

# Create a channel
channel_id = interpreters.channel_create()

# Pre-populate the interpreter with a module
interpreters.run_string(interpid, "import marshal; import _xxsubinterpreters as interpreters")

# Define a
def run(interpid, channel_id):
    interpreters.run_string(interpid,
                            tw.dedent(""" arry_raw = channel_recv(channel_id) arry = marshal. Loads (arry_raw) result = [1,2,3,4,5] # where you would do some calculating result_raw = marshal.dumps(result) interpreters.channel_send(channel_id, result_raw) """),
               shared=dict(
                   channel_id=channel_id
               ),
               )

inp = marshal.dumps(arry)
interpreters.channel_send(channel_id, inp)

# Run inside a thread
t = threading.Thread(target=run, args=(interpid, channel_id))
t.start()

# Sub interpreter will process. Feel free to do anything else now.
output = interpreters.channel_recv(channel_id)
interpreters.channel_release(channel_id)
output_arry = marshal.loads(output)

print(output_arry)
Copy the code

This example uses NUMPY and sends a NUMPY array on the channel by serializing it using the Marshal module, and then the data is processed by the subinterpreter (on a separate GIL), so this is a computationally intensive (CPU-bound) concurrency problem that works well with the subinterpreter.

This seems inefficient

The Marshal module is fairly fast, but still not as fast as sharing objects directly from memory.

Pep-574 introduces a new pickle [5] protocol (V5) that supports processing memory buffers separately from the rest of the pickle stream. For large data objects, serializing them once and then deserializing them by a subinterpreter adds a lot of overhead.

The new API can (presumably, but not in) provide an interface like this:

import _xxsubinterpreters as interpreters
import threading
import textwrap as tw
import pickle

# Create a sub-interpreter
interpid = interpreters.create()

# If you had a function that generated a numpy array
arry = [5.4.3.2.1]

# Create a channel
channel_id = interpreters.channel_create()

# Pre-populate the interpreter with a module
interpreters.run_string(interpid, "import pickle; import _xxsubinterpreters as interpreters")

buffers=[]

# Define a
def run(interpid, channel_id):
    interpreters.run_string(interpid,
                            tw.dedent(""" arry_raw = interpreters.channel_recv(channel_id) arry = pickle.loads(arry_raw) print(f"Got: {arry}") result = arry[::-1] result_raw = pickle.dumps(result, protocol=5) interpreters.channel_send(channel_id, result_raw) """),
                            shared=dict(
                                channel_id=channel_id,
                            ),
                            )

input = pickle.dumps(arry, protocol=5, buffer_callback=buffers.append)
interpreters.channel_send(channel_id, input)

# Run inside a thread
t = threading.Thread(target=run, args=(interpid, channel_id))
t.start()

# Sub interpreter will process. Feel free to do anything else now.
output = interpreters.channel_recv(channel_id)
interpreters.channel_release(channel_id)
output_arry = pickle.loads(output)

print(f"Got back: {output_arry}")
Copy the code

This looks like a lot of samples

Indeed, this example uses a low-level subinterpreter API. If you use a multi-process library, you will find some problems. It’s not as simple as threading, you can’t think of running the same function with the same string of inputs in different interpreters (not yet).

Once the PEP is incorporated, I think some of the other apis in PyPi will also adopt it.

How much overhead does the subinterpreter cost?

Simple answer: more than one thread, less than one process.

Interpreters have their own state, so while PEP-554 makes it easy to create subinterpreters, it also needs to clone and initialize the following:

  • Modules in the main namespace and importlib
  • Contents of the SYS dictionary
  • Built-in methods (print, assert, and so on)
  • thread
  • The core configuration

The core configuration can be easily cloned from memory, but the imported modules are not that simple. Importing modules in Python is slow, so if each creation of a subinterpreter means importing a module into another namespace, the benefit is reduced.

What about asyncio?

The current implementation of the Asyncio event loop in the standard library creates frames with required values, but shares the state (and therefore the GIL) in the main interpreter.

After PEP-554 is incorporated, most likely in Python 3.9, an alternative implementation of event loops might be (although no one has done so yet) : async methods are run in the subinterpreter, and therefore are concurrent.

A: That sounds good. Let’s deliver the goods.

Well, not yet.

Because CPython has been using a single-interpreter implementation for a long time, many parts of the code base use the “Runtime State” instead of the “Interpreter State,” so if the current PEP-554 is to be incorporated, Will cause a lot of problems.

For example, the status of the garbage collector (prior to version 3.7) is runtime.

During the PyCon Sprint, the event was held in the US from May 1 to May 9, 2019. Sprints are 1-4 day events where developers volunteer to work on a project as a “sprint”. The word is more commonly used by agile development teams and has a slightly different meaning and form), changes have begun [6] to transfer the state of the garbage collector to the interpreter, so that each subinterpreter will have its own GC (as it should).

Another problem is that some “global” variables remain in the CPython code base and many C extensions. As a result, we may run into some problems when people suddenly start writing concurrent code correctly.

Another problem is that file handles belong to processes, so when you read or write a file in an interpreter, the child interpreter will not be able to access the file (without further changes to CPython).

In short, there are many other things that need to be addressed.

Conclusion: Is GIL dead?

For single-threaded applications, the GIL still lives. So, even with the incorporation of PEP-554, if you have single-threaded code, it won’t suddenly become concurrent.

If you want to use concurrent code in Python 3.8 and you run into computationally intensive concurrency issues, then this could be a ticket to the door!

When?

Pickle V5 and shared memory for multiple processes are likely to be implemented in Python 3.8 (October 2019), and the subinterpreter will be between 3.8 and 3.9.

If you want to use my example now, I’ve built a branch that contains all the necessary code [7]

References

[1] Has the Python GIL been slain? Hackernoon.com/has-the-pyt… [2] It was written in 2009: github.com/python/cpyt… [3] this is very important: arstechnica.com/features/20… [4] This link: hackernoon.com/which-is-th… [5] put forward a new kind of pickled PEP – 574: www.python.org/dev/peps/pe… [6] Changes have already begun: github.com/python/cpyt… [7] Necessary code: github.com/tonybaloney…

Public account [Python Cat], this number is a series of high-quality articles, such as cat star philosophy cat series, Python advanced series, good book recommendation series, technical writing, quality English recommendation and translation, etc., welcome to pay attention to oh. The backstage replies “love learning”, and you will get a free learning gift package.