start

Since Python3, STR has been encoded in Unicode (note that this is not utF8, although the.py file defaults to UTF8). Each standard Unicode character takes up 4 bytes. This is undoubtedly a waste of memory.

Unicode represents a character set, and for easy transmission, encoding schemes such as UTF8 and UTF16 have been derived to save storage space. Python stores strings internally in a similar way.

Three internal representation Unicode strings

To reduce memory consumption, Python uses three different unit lengths to represent strings:

  • 1 byte per character (Latin-1)
  • 2 bytes per character (UCS-2)
  • 4 bytes per character (UCS-4)

Define string structure in source code:

# Include/unicodeobject.h
typedef uint32_t Py_UCS4;
typedef uint16_t Py_UCS2;
typedef uint8_t Py_UCS1;

# Include/cpython/unicodeobject.h
typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;
Copy the code

If all characters in the string are in the ASCII range, it can be stored in a one-byte Latin-1 encoding. If there is a string that requires two bytes (such as a Chinese character), the entire string is stored in the two-byte UCS-2 encoding.

This can be verified by looking outside the sys.getsizeof function:

As shown, ‘zh’ requires 1 byte more storage space than ‘z’, where h takes up 1 byte;

It takes 2 more bytes to store ‘z in ‘than’ middle ‘, so z takes up 2 bytes here.

For most natural languages, 2-byte encoding is sufficient. But if a gigabyte of ASCII text is loaded into memory and an emoji is inserted into the text, the space required for the string is, surprisingly, quadruple.

Why not use UTF8 encoding internally

The most popular Unicode encoding scheme, Python doesn’t use it internally. Why?

Here comes the downside of UTF8 encoding. In this scheme, the byte length of each character varies, which makes it impossible to randomly access individual characters, such as string[n] (utF8 encoding), which requires counting the bytes of the first n characters. So instead of order 1, order n becomes order n, which is even less acceptable.

So Python internally stores strings of fixed length.

String resident mechanism

Another way to save memory is to pool some short strings and check if there are any strings in the pool before creating string objects. Internally, only strings containing underscores (_), letters, and digits up to 20 characters in length can reside. Resident is done during code compilation, and the following resident check is done in the code:

  • An empty string' 'And to all;
  • The variable name.
  • Parameter name;
  • String constants (all strings defined in code);
  • The dictionary keys;
  • Attribute name;

The resident mechanism saves a lot of memory for repeating strings. Internally, the string residency pool is maintained by a global dict that uses strings as keys:

void PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;

    if(s == NULL || ! PyUnicode_Check(s))return; // Type and status checks PyUnicodeObjecif(! PyUnicode_CheckExact(s))return;
    if (PyUnicode_CHECK_INTERNED(s))
        return; // Create a dict for internif (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */ return; } // Whether the object exists in inter t = PyDict_SetDefault(interned, s, s); If (t! = s) { Py_INCREF(t); Py_SETREF(*p, t); return; } /* The two references in interned are not counted by refcnt. The deallocator will take care of this */ Py_REFCNT(s) -= 2; _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL; }Copy the code

PyDict_SetDefault(interned, s, s); PyDict_SetDefault(interned, s, s); Set string as key as well as value, so that the reference count for string objects is +1 twice, so that objects stored in the dictionary will never be 0 until the end of the program, which is also y_REFCNT(s) -= 2; Reason to subtract 2 from the count.

As you can see from the function arguments, the string object is still created. The internal string object is always created for the string, but after the Inter mechanism checks, the temporarily created string is destroyed because the reference count is 0. The temporary variable is ephemeral in memory and then disappears quickly.

String buffer pool

In addition to the string resident pool, Python also stores individual characters in all ASCII codes:

static PyObject *unicode_latin1[256] = {NULL};
Copy the code

If the string is actually a character, it is preferentially fetched from the buffer pool:

[unicodeobjec.c]
PyObject * PyUnicode_DecodeUTF8Stateful(const char *s,
                             Py_ssize_t size,
                             const char *errors,
                             Py_ssize_t *consumed)
{
    ...

    /* ASCII is equivalent to the first 128 ordinals in Unicode. */
    if (size == 1 && (unsigned char)s[0] < 128) {
        returnget_latin1_char((unsigned char)s[0]); }... }Copy the code

It is then stored in the intern pool, which then resides in the pool and buffer pool, both referring to the same string object.

Strictly speaking, this single-character buffer pool is not a saving scheme because almost all objects taken out of it are saved into the buffer pool, which is intended to reduce the creation of string objects.

conclusion

This article introduces two ways to save memory. Each character in a string occupies the same amount of space, depending on the maximum number of characters in the string.

Short strings are placed in a global dictionary that is singleton to save memory.