How do I optimize Python memory usage

An overview of the

If the program processes a lot of data and is more complex, it will occupy a lot of memory when the program is running. When the memory occupation reaches a certain value, the program may be terminated by the operating system. Especially in the scenario of limiting the memory size used by the program, problems are more likely to occur. Here are a few ways to optimize Python’s memory footprint.

Note: the following code runs on Python3.

Take a chestnut

Let’s take a simple scenario where Python is used to store a three-dimensional coordinate data, x,y,z.

Dict

Using Python’s built-in Dict data structure to implement the requirements of the above example is simple.

>>> ob = {'x':1, 'y':2, 'z':3}
>>> x = ob['x']
>>> ob['y'] = y
Copy the code

Check the size of the ob object:

>>> print(sys.getsizeof(ob))
240
Copy the code

Three simple integers take up a lot of memory, but imagine that if you had a lot of these data to store, it would take up even more memory.

The amount of data	Occupied Memory size
1 000 000	240 Mb
10, 000, 000	2.40 Gb
100 000 000	24 Gb

Class

Programmers who like object-oriented programming prefer to keep packets in a class. Using class uses the same requirements:

Class Point: # def __init__(self, x, y, z): self.x = x self.y = y self.z = z >>> ob = Point(1,2,3)Copy the code

The data structure of a class is quite different from that of a Dict. Let’s look at the memory footprint in this case:

field	memory
PyGC_Head	24
PyObject_HEAD	16
__weakref__	8
__dict__	8
TOTAL	56

Check out the documentation for __weakref__, the object’s dict stores something about self.xxx. Starting with Python 3.3, key uses shared memory storage, reducing the size of instance tracing in RAM.

>>> print(sys.getsizeof(ob), sys.getsizeof(ob.__dict__)) 
56 112
Copy the code

The amount of data	memory
1 000 000	168 Mb
10, 000, 000	1.68 Gb
100 000 000	16.8 Gb

As you can see, class has less memory footprint than dict, but it’s not nearly enough.

slots

From the class footprint distribution, we can see that by eliminating dict and _Weakref__, we can significantly reduce the size of class instances in RAM. We can do this by using slots.

class Point: __slots__ = 'x', 'y', 'z' def __init__(self, x, y, z): Y = y self.z = z >>> ob = Point(1,2,3) >>> print(sys.getsizeof(ob)) 64Copy the code

You can see that the memory footprint is significantly reduced

field	Memory footprint
PyGC_Head	24
PyObject_HEAD	16
x	8
y	8
z	8
TOTAL	64

The amount of data	memory
1 000 000	64Mb
10, 000, 000	640Mb
100 000 000	6.4 Gb

By default, instances of both Python’s new and classic classes have a dict to store the instance’s attributes. This is fine in general, and is flexible enough to allow you to set new properties at will in your program. However, dict can be a waste of memory for small classes that know they have several fixed attributes before they are “compiled.”

This problem becomes particularly acute when a large number of instances need to be created. One solution is to define a slots attribute in the new class.

The slots declaration contains several instance variables and reserves just enough space for each instance to hold each variable; This saves Python space by not using dict anymore.

So slot is not very is that really necessary? Using slots also has side effects:

Each inherited subclass has to redefine slots
Instances can only contain attributes defined in slots. This affects the flexibility of writing programs. For example, if you set instance to a new attribute for some reason, such as instance.a = 1, but the instance is not in slots. You have to constantly modify slots or work around it in other ways
Instances cannot have weakRef targets, otherwise remember to put WeakRef in slots

Finally, namedList and Attrs provide automatic creation of classes with slots for those interested.

Tuple

Python also has a built-in type tuple for representing immutable data structures. A tuple is a fixed structure or record without a field name. For field access, use field indexes. When creating a tuple instance, the tuple field is associated with the value object once:

> > > ob = (1, 2, 3) > > > x = ob [0] > > > ob [1] = y # ERRORCopy the code

The example of a tuple is neat:

>>> print(sys.getsizeof(ob))
72
Copy the code

You can see only 8 bytes more than slot:

field	Occupied Memory (bytes)
PyGC_Head	24
PyObject_HEAD	16
ob_size	8
[0]	8
[1]	8
[2]	8
TOTAL	72

Namedtuple

With namedtuple we can also access elements in a tuple with a key:

Point = namedtuple('Point', ('x', 'y', 'z'))
Copy the code

It creates a subclass of tuples that define descriptors for accessing fields by name. For our example, it looks like this:

class Point(tuple):
     #
     @property
     def _get_x(self):
         return self[0]
     @property
     def _get_y(self):
         return self[1]
     @property
     def _get_y(self):
         return self[2]
     #
     def __new__(cls, x, y, z):
         return tuple.__new__(cls, (x, y, z))
Copy the code

All instances of this class have the same memory footprint as tuples. A large number of instances can leave a slightly larger footprint:

The amount of data	Memory footprint
1 000 000	72 Mb
10, 000, 000	720 Mb
100 000 000	7.2 Gb

Recordclass

Python’s third party library, RecordClassd, provides a data structure, recordClass.mutableTuple, which is almost identical to the built-in tuple data structure, but uses less memory.

>>> Point = recordclass('Point', ('x', 'y', 'z'))
>>> ob = Point(1, 2, 3)
Copy the code

After instantiation, only PyGC_Head is missing:

field	memory
PyObject_HEAD	16
ob_size	8
x	8
y	8
y	8
TOTAL	48

Here, we can see that the memory footprint is further reduced compared to slot:

The amount of data	Memory footprint
1 000 000	48 Mb
10, 000, 000	480 Mb
100 000 000	4.8 Gb

Dataobject

Recordclass provides another solution: it uses the same storage structure in memory as the Slots class, but does not participate in the circular garbage collection mechanism. Recordclass.make_dataclass creates an instance like this:

>>> Point = make_dataclass('Point', ('x', 'y', 'z'))
Copy the code

Another method is to inherit from dataObject

class Point(dataobject):
    x:int
    y:int
    z:int
Copy the code

Classes created in this way create instances that do not participate in the circular garbage collection mechanism. The structure of the in-memory instance is the same as that of slots, but without PyGC_Head:

field	Memory usage (bytes)
PyObject_HEAD	16
x	8
y	8
y	8
TOTAL	40

> > > ob = Point (1, 2, 3) > > > print (sys. Getsizeof (ob) 40Copy the code

To access these fields, you also use special descriptors to access the fields via their offsets from the beginning of objects that are in the class dictionary:

mappingproxy({'__new__': <staticmethod at 0x7f203c4e6be0>,
              .......................................
              'x': <recordclass.dataobject.dataslotgetset at 0x7f203c55c690>,
              'y': <recordclass.dataobject.dataslotgetset at 0x7f203c55c670>,
              'z': <recordclass.dataobject.dataslotgetset at 0x7f203c55c410>})
Copy the code

The amount of data	Memory footprint
1 000 000	40 Mb
10, 000, 000	400 Mb
100 000 000	4.0 Gb

Cython

One approach is based on the use of Cython. The advantage is that fields can take C atomic type values. Such as:

cdef class Python:
    cdef public int x, y, z

 def __init__(self, x, y, z):
      self.x = x
      self.y = y
      self.z = z
Copy the code

In this case, the memory footprint is smaller:

> > > ob = Point (1, 2, 3) > > > print (sys. Getsizeof (ob) 32Copy the code

The memory structure is as follows:

field	Memory usage (bytes)
PyObject_HEAD	16
x	4
y	4
y	4
п с т о	4
TOTAL	32

The amount of data	Memory footprint
1 000 000	32 Mb
10, 000, 000	320 Mb
100 000 000	3.2 Gb

However, when accessed from Python code, a conversion from an int to a Python object is performed each time, and vice versa.

Numpy

In a pure Python environment, Numpy provides better results, for example:

>>> Point = numpy.dtype(('x', numpy.int32), ('y', numpy.int32), ('z', numpy.int32)])
Copy the code

Create an array that starts with 0:

>>> points = numpy.zeros(N, dtype=Point)
Copy the code

The amount of data	Memory footprint
1 000 000	12 Mb
10, 000, 000	120 Mb
100 000 000	1.2 Gb

The last

As you can see, there is still a lot that can be done to optimize Python performance. Python provides convenience but also requires more resources for a while. In the case of failure, I need to choose different processing methods to bring better performance experience.