1. The introduction of

It was recently discovered that pickle was faster than Joblib at loading larger dict.

I looked up the difference between pickle and Joblib online, and found that there wasn’t much written about the subject.

Therefore, this paper tries to test and study the topic of “the difference between pickle and Joblib when loading dict”.

2. Verify pickle and Joblib load dict tests

Use the following code to first create a large dict and dump/load tests with pickle and Joblib.

import time
import pickle
import joblib as jl

def get_big_dict() :
    d = {}
    for i in range(10000000):
        key = 'k'+str(i)
        value = i
        d[key]=value


def test_pickle_dump_load_big_dict(d) :
    dump_time_cost_list = []
    load_time_cost_list = []
    for i in range(100) :Dump /load 100 times, take average value
        # dump
        t1 = time.time()
        fw = open('tmpfile.bin'.'wb')
        pickle.dump(d, fw)
        fw.close()
        t2 = time.time()
        # load
        fr = open('tmpfile.bin'.'rb')
        d2 = pickle.load(fr)
        fr.close()
        t3 = time.time()
        dump_time_cost_list.append(t2-t1)
        load_time_cost_list.append(t3-t2)
    print('pickle dump time cost ave: {0}'.format(sum(dump_time_cost_list)/len(dump_time_cost_list)))
    print('pickle load time cost ave: {0}'.format(sum(load_time_cost_list)/len(load_time_cost_list)))


def test_joblib_dump_load_big_dict(d) :
    dump_time_cost_list = []
    load_time_cost_list = []
    for i in range(100) :Dump /load 100 times, take average value
        # dump
        t1 = time.time()
        jl.dump(d, 'tmpfile.bin')
        t2 = time.time()
        # load
        d2 = jl.load('tmpfile.bin')
        t3 = time.time()
        dump_time_cost_list.append(t2-t1)
        load_time_cost_list.append(t3-t2)
    print('joblib dump time cost ave: {0}'.format(sum(dump_time_cost_list)/len(dump_time_cost_list)))
    print('joblib load time cost ave: {0}'.format(sum(load_time_cost_list)/len(load_time_cost_list)))



if __name__=='__main__':
    d = get_big_dict()
    test_pickle_dump_load_big_dict(d)
    test_joblib_dump_load_big_dict(d)
Copy the code

In this code, you start by generating a large dict. Then run pickle/joblib to dump/load the dict data 100 times, calculate the dump/load time, and average the 100 test results. The results of one experiment were as follows:

Pickle dump time cost ave: 0.00247844934463501 Pickle Load time cost ave: 0.0010942578315734862 Joblib dump time cost ave: 0.006253149509429932 Joblib Load time cost ave: 0.0012739634513854981Copy the code

After several runs of python3.6, pickling is about 20% faster than joblib when loading data. Pickle is also faster than Joblib when it comes to dumping data. The result is stable and reproducible.

So why is pickling faster for data loads like dict?

3. Process of joblib loading data

Let’s look at the source code to find the logic that joblib uses to load data.

From Reference 2, you can find the entry for joblib to load the data, which is condensed below


def load(filename, mmap_mode=None):
    fobj = filename
    filename = getattr(fobj, 'name', '')
    with _read_fileobject(fobj, filename, mmap_mode) as fobj:
        obj = _unpickle(fobj)
    return obj

Copy the code

As you can see here, Joblib supports the memory map mechanism. As you can see from Reference 1, this enables Joblib to support multi-process shared memory, which pickle does not. The key to loading the data is _unpickle(), the source of which can be found further (see Reference 3), which is simplified as follows:

def _read_fileobject(fileobj, filename, mmap_mode=None) :
    compressor = _detect_compressor(fileobj)

    if compressor == 'compat':
        # other logic
    else:
        if compressor in _COMPRESSORS:
            inst = compressor_wrapper.decompressor_file(fileobj)
            fileobj = _buffered_read_file(inst)
        yield fileobj
Copy the code

As can be seen from the above source, before reading the data, according to the file compression mode, decompress the file. The key function here is _buffered_read_file(). The source code (see Reference 4) is simplified as follows:

def _buffered_read_file(fobj) :
    """Return a buffered version of a read file object."""
    return io.BufferedReader(fobj, buffer_size=_IO_BUFFER_SIZE)
Copy the code

The IO library BufferedReader() function is used to load the lowest level of joblib data. See Reference 5 for its usage.

4. Pickle loading process

See Reference 6 for the source code for pickle loading data, simplified as follows:

def load(self) :
    while True:
        key = read(1)
Copy the code

See Reference 7 for the implementation of read(). After simplification:

def read(self, size=-1) :
    b = bytearray(size.__index__())
    n = self.readinto(b)
    return bytes(b)
Copy the code

We don’t follow up with readinto() any further here, but we can see that pickle doesn’t do anything like unzip, memory mapping, and so on before loading the data, as Joblib does.

5. To summarize

This article did some testing and source code analysis on how quickly Joblib and pickle loaded dict, and came to the following conclusions:

  1. Pickle was about 20 percent faster than Joblib at loading dict data, a result that was almost replicated in every trial
  2. Pickle was faster than Joblib when dumping dict data, but the data varied from trial to trial, but it was just faster, somewhat harder to reproduce
  3. After analyzing the source code of loading data, it is found that Joblib does more operations on data when loading data (see below), so this is probably the reason why Joblib loads data slowly
    • Compression format judgment, data decompression
    • Memory mapping related processing

Note that this article has only explored one topic: the difference between pickle and Joblib when loading dict “, and the procedures and conclusions studied in this article are valid only for that topic. Therefore, for loading data such as NUMpy, or for other scenarios different from this article, the conclusions may change.

From reference 1, the following conclusions are also found

  1. Joblib is faster for dump/load of large Numpy data
    • This article does not test this conclusion
  2. Pickling can be faster if large Numpy data is not dumped/loaded
    • This paper proves this conclusion from test results and source code analysis

reference

  1. Stackoverflow.com/questions/1…

  2. Github.com/joblib/jobl…

  3. Github.com/joblib/jobl…

  4. Github.com/joblib/jobl…

  5. Docs.python.org/zh-cn/3/lib…

  6. Github.com/python/cpyt…

  7. Github.com/python/cpyt…