Image from HBO series Silicon Valley

1. Where is the bottleneck of training

  • Low GPU utilization rate: GPU memory is full during model training, but the GPU utilization rate is unstable, sometimes 0%, sometimes 90%, sometimes high or low.

  • Large amount of training data: The training data is large, and it takes a long time to train an Epoch with the magnitude of millions/tens of millions. The model iteration cycle is too long.

2. Improve GPU utilization: CPU vs GPU

The GPU usage is low because the PROCESSING efficiency of the CPU cannot keep up with that of the GPU

2.1 CPU vs GPU Communication

  • The CPU is responsible for loading data + data preprocessing, and constantly exchanging data between memory and video memory
  • GPU responsible for model training (image from network)

2.2 Solutions

Multi – process parallel processing, speed up the PERFORMANCE of CPU loading data

  • Keras provides workers USe_multiprocessing to process data in parallel in a multi-process manner and push data into the queue for GPU model training. Workers can set 2,4,8 because processes may affect each other’s resources and larger is not always better.

    run_model.fit_generator(
                generator=training_generator,
                class_weight={0: config.weights, 11},
                epochs=epochs,
                verbose=1,
                steps_per_epoch=steps_per_epoch,
                callbacks=callbacks_list,
                validation_data=valid_generator,
                validation_steps=validation_steps,
                shuffle=True,
                workers=8,
                use_multiprocessing=True,
                max_queue_size=20
    Copy the code
  • Pytorch Torch provides a similar parameter num_workers in the load data. Pin_memory =True can be loaded directly into video memory without memory

    torch.utils.data.DataLoader(image_datasets[x],
                                batch_size=batch_size, 
                                shuffle=True,
                                num_workers=8,
                                pin_memory=True)
    Copy the code

3. Distributed parallel training

3.1 Parallel Mode

When the amount of training data is large, the efficiency of training can be improved through multiple gpus on multiple machines. Different from distributed data processing frameworks such as Hadoop and Spark, deep learning training involves the propagation and back propagation of parameters, and has two parallel modes:

  • Model Parallelism: Different machines (Gpus/cpus, etc.) in a distributed system are responsible for different parts of the network model. Usually, different network layers of the neural network model are assigned to different machines, or different parameters within the same layer are assigned to different machines. It is generally a large model, a graphics card can not put the situation, such as NLP model. The disadvantage of model parallelism is that there may be dependencies between layers and it cannot be completely parallel. (Photo from Internet)

  • Data parallelism: Different machines have multiple copies of the same model, each machine is assigned different data, and then the results of all the machines are combined in some way. This is more suitable for big data. The problems to be solved in data parallelism are data segmentation and transmission, as well as parameter update.

3.2 Data Parallelism

In Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Facebook describes the use of 256 Gpus for resNET-50 network ‘data parallelism’ Training

  • Data segmentation: a large batch-size is selected for segmentation according to the number of workers and distributed to different workers for execution
  • Parameter update: There are two modes for parameter update (1) parameter server (2) ring ring update (no server mode)

3.2.1 Parameter Server Mode

Parameter server mode, as shown below. After each worker executes a batch of training, when parameters are back-propagated, all workers will send the parameters to the parameter server for summary and mean calculation, and then send the parameters to each worker for the second batch of training. (Photo from Internet)

The parameter server has one or more structural modes. It can be seen that whether the efficiency of this mode of data parallelism is improved depends on the communication efficiency between the parameter server and worker, that is, the training time of the slowest worker and the time that the parameter server receives and updates the parameters and sends them back. With a large number of workers, the parameter server may have a bottleneck. (Photo from Internet)

3.2.2 ring – reduce

The ring-reduce proposed by Baidu abandons parameter server and adopts circular structure to update parameters. Ring-reduce forms all workers into an adjacent ring structure. Each worker only exchanges parameters with neighboring workers. After several exchanges, all workers contain parameter information of other workers to achieve the purpose of update. (Photo from Internet)

\

You can see some of these steps in the images below; To speed things up, Ring-reduce does not exchange all parameters at once; Instead, you divide the parameters first, and you keep swapping them.

4. Implementation framework: Horovod

Horovod is another open source deep learning tool for Uber. It takes advantage of Facebook’s “One-hour Training ImageNet Paper” and Baidu’s Ring Allreduce to help users achieve distributed training. https://github.com/horovod/horovod

NCCL is used to replace Baidu’s Ring-AllReduce implementation. NCCL is Nvidia’s collection communications library that provides a highly optimized version of Ring-AllReduce. NCCL 2 allows ring-Allreduc to run between multiple machines.

If you want to change the standalone training code to distributed code, there are only a few steps to change the distributed training code:

  • Horovod installation recommended to install Docker horovod, save the trouble of installing the environment. Horovod relies on NCCL 2 Open MPI

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
    $ docker build -t horovod:latest horovod-docker-gpu
    Copy the code
  • SSH connection is established between worker machines

  • Horovod supports different deep learning frameworks such as TF, Keras, PyTorch and MXNET. Using Keras as an example, the modification is performed in the following six steps: (1) Initialization: hvd.init() (2) Allocating GPU computing resources: Config.gpu_options.visible_device_list = STR (hvd.local_rank()) (3) Opt = hvd.distributedOptimizer (opt) (4) defines initialization consistency for all worker models HVD. Callbacks. BroadcastGlobalVariablesCallback (0), (5) model in one worker

    from __future__ import print_function
    import keras
    from keras.datasets import mnist
    from keras.models import Sequential
    from keras.layers import Dense, Dropout, Flatten
    from keras.layers import Conv2D, MaxPooling2D
    from keras import backend as K
    import math
    import tensorflow as tf
    importhorovod.keras as hvd # Horovod: initialize Horovod. hvd.init() # Horovod: pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.allow_growth  = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) batch_size =128
    num_classes = 10
    
    # Horovod: adjust number of epochs based on number of GPUs.
    epochs = int(math.ceil(12.0 / hvd.size()))
    
    # Input image dimensions
    img_rows, img_cols = 28.28
    
    # The data, shuffled and split between train and test sets
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    
    if K.image_data_format() == 'channels_first':
        x_train = x_train.reshape(x_train.shape[0].1, img_rows, img_cols)
        x_test = x_test.reshape(x_test.shape[0].1, img_rows, img_cols)
        input_shape = (1, img_rows, img_cols)
    else:
        x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
        x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
        input_shape = (img_rows, img_cols, 1)
    
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    x_train /= 255
    x_test /= 255
    print('x_train shape:', x_train.shape)
    print(x_train.shape[0].'train samples')
    print(x_test.shape[0].'test samples')
    
    # Convert class vectors to binary class matrices
    y_train = keras.utils.to_categorical(y_train, num_classes)
    y_test = keras.utils.to_categorical(y_test, num_classes)
    
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(3.3),
                    activation='relu',
                    input_shape=input_shape))
    model.add(Conv2D(64, (3.3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2.2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))
    
    # Horovod: adjust learning rate based on number of GPUs.
    opt = keras.optimizers.Adadelta(1.0 * hvd.size())
    
    # Horovod: add Horovod Distributed Optimizer.
    opt = hvd.DistributedOptimizer(opt)
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                optimizer=opt,
                metrics=['accuracy'])
    
    callbacks = [
        # Horovod: broadcast initial variable states from rank 0 to all other processes.
        # This is necessary to ensure consistent initialization of all workers when
        # training is started with random weights or restored from a checkpoint.
        hvd.callbacks.BroadcastGlobalVariablesCallback(0),
    ]
    
    # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
    if hvd.rank() == 0:
        callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
    
    model.fit(x_train, y_train,
            batch_size=batch_size,
            callbacks=callbacks,
            epochs=epochs,
            verbose=1,
            validation_data=(x_test, y_test))
    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    Copy the code
  • Perform distributed training with horovodRun

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Copy the code

5. To summarize

This article shares how to improve deep learning training through GPU utilization and the Distributed Training Horovod framework.

  • Parallel CPU loading and preprocessing makes the GPU no longer wait for the CPU
  • Use Horovod to parallelize data to improve iteration time for large-volume training

About the author: WeDO Experimental Jun, data analyst; Love life, love writing

Appreciate the author

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Long press scan code to add "Python little assistant" ▼ Click to become a community member click on itCopy the code