“This is the 10th day of my participation in the First Challenge 2022. For details: First Challenge 2022”
Interviewer: I heard you learned Python? Can you tell me how Python does memory management?
Me:?? Memory management is not clear…
Interviewer: What do you know about Python garbage collection?
Me: Python garbage collection is dominated by reference counting, tag clearing, and generational collection.
Interviewer: Can you tell me more about these three garbage collection techniques?
Me: pawn…
Let’s start with memory management
Memory management is simple: allocate (MALloc) + recycle (free).
Before we read this article, consider: if you design, how will memory management be implemented? Answer: good, can’t design (pen master also can’t), can big guy please bypass. Let’s take a look at how Python is designed. To improve efficiency:
- How to distribute efficiently?
- How to recycle effectively?
What is memory
Buy the configuration of the computer “4G + 500G / 1T”, where 4G refers to the computer’s memory capacity, and the computer’s hard disk 500G / 1T.
Memory (full name refers to internal Memory), naturally comes to external Memory, they are hardware devices.
Memory is one of the most important parts in computer. It is the bridge between external memory and CPU. All programs in a computer are run in memory, so the performance of memory has a great impact on the computer.
Memory is like a blank book
First, you can compare your computer’s storage space to a blank short story. There’s nothing on the page yet. Eventually, different authors will emerge. Every writer needs some space to write their story.
Since they are not allowed to write to each other, they must pay attention to the pages they can write on. Please consult the bookkeeper before you start writing. Then, the administrator decides what they are allowed to write in the book.
If the book has been around for a long time, therefore many of the stories are no longer applicable. When no one reads or quotes stories, they are deleted to make room for new stories.
In essence, computer memory is like an empty book. In fact, it is quite common to call contiguous memory page blocks of fixed length, so this analogy works well.
Authors are like different applications or processes that need to store data in memory. The administrators who decide where an author writes in a book are like various memory management roles, and the people who delete old stories to make room for new ones are garbage collectors.
The above analogy comes from this article
Memory management: from hardware to software
Why a COMPUTER with 4 gigabytes of memory can efficiently analyze gigabytes of data, and the program can run forever.
What does Python help us with all this 4G memory?
Memory management is the process by which applications read and write data. The memory manager determines where to place application data.
Because memory is limited, like pages in a book, an administrator must find some free space and provide it to the application. The process of providing memory is often called memory allocation.
In fact, if we understand the memory management mechanism, we can solve the problem in a faster and better way.
After reading this article, you will understand a little about Python’s memory management design philosophy.
The object management
We’ve probably heard Python’s famous “everything is an object.” Yes, in Python numbers are objects, strings are objects, everything is an object, Cpython, and the core of Python object implementation is a structure called PyObject.
Typedef struct_object {int ob_refcnt; struct_typeobject *ob_type; }PyObject;Copy the code
PyObject is mandatory for every object, the grandfather of all objects in Python, and consists of only two things:
- Ob_refcnt: Reference count
- Ob_type: pointer to another type
So, so CPython is written in C, and it interprets Python bytecode. What does this have to do with memory management?
Well, there are memory management algorithms and structures in the CPython code in C. To understand Python’s memory management, you must have a basic understanding of CPython itself. We won’t go into the details of the others, but those who are interested in them can learn by themselves.
CPython memory management
Note: this piece of content in the Internet to find a lot of content, read for a long time also did not understand, their own dishes. The only thing I can understand is the relevant text of Alexander VanTol’s article, which has been moved here, abridged, if you are interested, I suggest you read the original text.
The dark gray box below is now owned by the Python process.
Python uses part of its memory for internal use and non-object memory. The other is dedicated to object storage (your ints, dict, etc.). Note that this has been simplified. If you need the full picture, you can look at the CPython source code, where all this memory management takes place.
CPython has an object allocator that allocates memory within the object’s memory region. This object allocator is where most of the magic happens. This method is called whenever a new object needs to allocate or delete space.
In general, adding and removing data to Python objects such as list and int doesn’t involve too much data at once. As a result, the allocator has been designed to process small amounts of data at once. It also tries not to allocate memory until absolutely necessary.
Now, let’s look at CPython’s memory allocation strategy. First, we’ll discuss these three main parts and how they relate to each other.
Python’s memory allocator
Memory structure
In Python, when it comes to allocating memory, instead of just using malloc/free, three separate layers are stacked on top of it and allocated efficiently.
Layer 0 down is the functionality of the OS. Layer -2 is the part of the underlying physical relationship with the machine, and the OS’s virtual memory manager is responsible for this part of the function. Layer -1 is the part where the actual interaction with the machine is performed by the OS. Since this part of the knowledge is beyond the scope of this book, we will not go into additional details. Some representative functions are called at layers 3 through 0, as shown in the following diagram.
Layer 0 universal base allocator
In the case of Linux, layer 0 refers to allocators such as Glibc’s Malloc () that claim memory on oss such as Linux.
Instead of calling malloc() when generating all objects, Python changes the allocation method based on how much memory is allocated. If the requested memory size is greater than 256 bytes, call malloc() honestly; If it is less than or equal to 256 bytes, layers 1 and 2 come up.
More detailed process: algorithm and implementation of garbage collection mechanism
Level 1 Python low-level memory allocator
Almost all objects used in Python are 256 bytes or less, and only objects that are immediately obsolete. Take a look at the following example.
for x in range(100):
print(x)
Copy the code
The Python script above is A program that converts the non-negative integer A from 0 to 99 into A string and prints it. This program makes heavy use of small one-time strings.
In this case, if the layer 0 allocator is queried one by one, frequent calls to malloc() and free() will occur, which will reduce efficiency.
As a result, special processing is done internally in Python when allocating very small objects. It is the layer 1 and 2 memory allocators that actually perform this processing.
When objects less than or equal to 256 bytes need to be allocated, the level 1 memory allocator is used. At this level, memory space is quickly reserved, starting at level 0, and accumulated. The role of the first floor is to manage this part of the accumulated space.
The memory structure of the information processed at level 1
Depending on the role and size of the memory space managed, the smallest unit is called a block, and the address of this block is ultimately returned to the applicant. The unit larger than a block is a pool, which contains a block. And then pool is called arena.
Arena > Pool > Block, like a Russian nesting doll. To avoid frequent calls to malloc() and free(), the layer 0 allocator reserves memory at the maximum unit arena. A pool is the unit used to effectively manage empty blocks. The word arena means “arena”. You can think of it as pools in an arena, pools with blocks floating around in them, so maybe it’s a little bit easier to understand.
arena
Arenas are the largest block of memory and are aligned on page boundaries in memory. A page boundary is the edge of a continuous block of fixed-length memory used by the operating system. Python assumes a system page size of 256 KB.
Arenas has a memory pool, which is a virtual memory page (4 KB). These are like pages of analogies in our book. These pools are divided into smaller chunks of memory.
All blocks in a given pool have the same “size level.” Given a certain amount of request data, the size class defines a specific block size. The following image is taken directly from the source code comment:
See Pymalloc for this
- For small objects (<= 512 bytes), Pymalloc allocates memory space in the memory pool
-
If the value is 512bytes, PyMem_RawMalloc() and PyMem_RawRealloc() are used to apply for new memory space
For example, if 42 bytes are requested, put the data into a 48-byte block.
pool
The size of each pool within an arena is fixed at 4K bytes. Since the virtual memory page size is 4K bytes for almost every OS, we set the pool size to 4K bytes accordingly.
Level 1 summary
The level 1 task can be summed up in one sentence: “Manage arena.”
Layer 2 Python object allocator
The level 2 allocator manages blocks in the pool. This layer actually returns the beginning address of the block to the applicant and frees the block, etc. So let’s look at how this layer manages blocks.
block
The pool is split into blocks. When we generate objects in Python, we end up allocating this block (in cases where the size is required to be 256 bytes or less). The partition is in blocks, as determined from pool initialization. This is because we decided “This is the pool for 8-byte blocks” when we started using pools. A pool is completely filled with blocks. How does a pool manage block state? A block can be in only three states.
- Have been allocated
- After use
- Don’t use
Allocators specific to layer 3 objects
Objects have a variety of types, such as lists and tuples, which are generated using their own allocators.
Summary of allocator
Assignment statement memory analysis
We can check the memory address of an object by using the id() function. Everyone’s computer has a different memory address.
a = 1
id(a) # Output: 4566652048
b = 2
id(b) # Output: 4566652080
c = 8
id(c) # Output: 4566652272
d = 8
id(d) # Output: 4566652272
Copy the code
Use == to check whether the values of objects are equal, and is determines whether objects are the same object
c == d # Output: True
c is d # Output: True
e = 888888888
id(e) # Output: 4569828784
f = 888888888
id(f) # Output: 4569828880
e == f # Output: True
e is f # Output: False
Copy the code
Explanation: As we can see,
c == d
Output True andc is d
It also prints True because, for a smaller int, Python assigns c and D the same memory address in the Pool.- while
e == f
Is True and has the same value;e is f
Print False, not much of the same object.
This is because the Python memory pool allocates space, assigns the class of the object and assigns its initial value. Small integers, from -5 to 256, are used very frequently in Python scripts, and because they are immutable, they are created only once and reused.
E and F are big numbers, so we have to reassign the address. In fact, the numbers between -5 and 256 have been arranged for me by Python.
>>> i = 256
>>> j = 256
>>> i is j
True
>>> i = 257
>>> j = 257
>>> i is j
False
>>> i = -5
>>> j = -5
>>> i is j
True
>>> i = -6
>>> j = -6
>>> i is j
False
Copy the code
** Cache to support the object identity semantics of autoboxing for values between*** -128 and 127 (inclusive) as required by JLS.*
Next, look at the memory analysis of the object:
li1 = []
li2 = []
li1 == li2 # Output: True
li1 is li2 # Output: False
x = 1
y = x
id(x) # Output: 4566652048
id(y) # Output: 4566652048
y = 2
id(y) # Output: 4566652080
x == y # Output: False
x is y # Output: False
Copy the code
Consider recycling
Garbage collection mechanism
Take a look at garbage collection techniques in Python:
- Reference count main
- Mark removal and generational recycling are supplemented
If an object has a reference count of 0, the Python interpreter will reclaim the object’s memory, but the disadvantage of reference counting is that it does not solve the problem of circular references, so we need to mark clear and generational reclaim.
What is reference counting
- Each object has a total number of references to it
- View the reference count of an object
sys.getrefcount()
- You can delete a reference using the del keyword
import sys
l = []
print(sys.getrefcount(l)) # Output: 2
l2 = l
l3 = l
l4 = l3
print(sys.getrefcount(l)) # Output: 5
del l2
print(sys.getrefcount(l)) # Output: 4
i = 1
print(sys.getrefcount(i)) # Output: 140
a = i
print(sys.getrefcount(i)) # Output: 141
Copy the code
When the object’s reference count reaches zero, the interpreter pauses to unallocate it and all objects accessible only from that object. When the reference count is zero, garbage collection is started.
However, reference counting does not solve the problem of circular references. The following code can fill up the computer memory by running:
>>> a = []
>>> b = []
>>> while True:
... a.append(b)
... b.append(a)
...
[1] 31962 killed python
Copy the code
Mark clear
As Python’s auxiliary garbage collection technology, the tag clearing algorithm mainly deals with container objects such as list, dict, tuple, instance, etc., because it is impossible to cause circular reference problems for string and numeric objects. Tag scavenging and generational recycling are designed to solve circular references.
It is divided into two phases: the first phase is the marking phase, in which GC marks all live objects, and the second phase is to reclaim those unmarked objects that are not live.
Objects are connected by references (Pointers) to form a digraph, objects form nodes of the digraph, and reference relations form edges of the digraph. Starting from the root object, the object is traversed along the directed edge. The reachable objects are marked as active objects. The unreachable objects are the inactive objects to be removed. The root object is the global variable, call stack, register.
In the figure above, blocks 1 can be accessed directly from program variables, and blocks 2 and 3 can be accessed indirectly. The program cannot access blocks 4 and 5. The first step will mark block 1 and remember blocks 2 and 3 for later processing. The second step will mark block 2, and the third step will mark block 3, but will not remember block 2 because it has been marked. The scan phase ignores blocks 1,2 and 3 because they are marked, but recycles blocks 4 and 5.
As an auxiliary garbage collection technique in Python, the tag clearing algorithm mainly deals with container objects such as lists, dict, tuple, etc., because it is impossible to cause circular reference problems for string and numeric objects.
Python uses a bidirectional linked list to organize these container objects. However, there is an obvious drawback to this crude markup cleanup algorithm: it must scan the entire heap sequentially before removing inactive objects, even if only a small number of active objects are left.
Generational recycling (automatic)
Generation collection is based on mark clearing technology, which is a space for time operation mode.
- Python divides all objects into 0,1,2 generations
- All new objects are generation 0 objects
- When a generation object survives garbage collection, it is classed as the next generation object.
At the same time, the generation recycling is based on the marker removal technology. Generational collection also works with container objects as an auxiliary garbage collection technique in Python.
When Python runs, the number of times object Allocation and object dealLocation are counted.
Garbage collection starts when the difference is above a certain threshold
- Check the threshold gc.get_threshold()
import gc
print(gc.get_threshold()) # Output: (700, 10, 10)
Copy the code
Get_threshold () returns two 10s from (700, 10, 10). In other words, every 10 g0 garbage collections will be matched by one g1 garbage collection. For every 10 gen 1 garbage collections, there is only one gen 2 garbage collection. In theory, objects that survive for a long time are less likely to be recycled the more they are used, which is also the idea of generational recycling design.
Manual recovery
See the GC module for details.
- Gc.collect () Manual collection
- Count () in the objGraph module records the number of instance objects generated by the current class
import gc
result = gc.collect()
print(result)
Copy the code
import objgraph
class Person(Object):
pass
class Cat(object):
pass
p = Person()
c = Cat()
p.name ='yuzhou1su'
c.master = p
print(sys.getrefcount(p))
print(sys.getfefcount(c))
del p
del c
gc.collect()
print(objgraph.count('Person'))
print(objgraph.count('Cat'))
Copy the code
When you locate an object that has a memory leak, you can use show_backrefs to view the reference chain for that object.
Memory pool mechanism
Frequent application and consumption will lead to a large number of memory fragments, resulting in low efficiency.
The concept of a memory pool is to apply for a certain number of memory blocks of equal size in memory for reserve.
Memory pool A pool consists of blocks of a single size class. Each pool maintains a bidirectional list of links to other pools of the same size class. In this way, the algorithm can easily find available space for a given block size, even in different pools.
When there is a new memory requirement, memory is allocated from the memory pool for that requirement. If the memory is insufficient, apply for a new memory.
The memory pool itself must be in one of three states:
- Has been used
- Is full
- Or is empty.
Advantages: Reduces memory fragmentation and improves efficiency.
conclusion
Memory management is a very important part of a computer. Python, like Java and Go, helps developers solve this problem from the language design level, eliminating the need to manually allocate and free memory, which is one of the advantages of such languages.
This paper mainly explains:
- What is memory management, management mode of the way
- Cpython’s memory management approach
- Garbage collection mechanism
- Python’s automatic garbage collection methods for reference counting, clear marking, and generational collection.
- Finally, the manual collection of packages and the memory pooling mechanism to improve the efficient use of memory were introduced
Hopefully this article will interest you in memory management and garbage collection. See you in the next article
Reference article:
- Highly recommended reading for those interested in Memory Management: Memory Management in Python
- Algorithm and implementation of garbage collection mechanism
- www.cnblogs.com/TM0831/p/10…
- www.cnblogs.com/xybaby/p/74…
- www.jianshu.com/p/c2c960481…