PyTorch: Optimize EDL in GPU cluster with AdaptDL

AdaptDL is a resource adaptive deep learning training and scheduling framework, which is part of CASL open source project. The goal of AdaptDL is to make distributed DL simple and efficient in a dynamic resource environment.

This article was originally posted on the wechat public account PyTorch developer community

EDL Elastic Deep Learning, incubated by the LF AI Foundation, is a Deep neural network training framework that can dynamically adjust parallelism. It supports multi-tenant cluster management, can balance model training wait and completion time, and can improve resource utilization.

Training deep learning models is usually time-consuming and expensive in computing power resources, storage space and other aspects.

Taking BERT model as an example, the training process on GPU often exceeds 2000 hours, while training ResNet and VGG models takes at least 100 hours.

According to current cloud computing cost accounting, model training costs may be as high as thousands or even tens of thousands of yuan. In order to control the cost of model training, computing resource sharing cluster came into being. Today we introduce AdaptDL, developed by the Petuum CASL team, which greatly optimizes EDL in GPU clusters.

Challenges faced by shared clusters

With shared clusters, multiple users can individually submit model training tasks.

Not only does this reduce the waste caused by oversupply of computing resources, but by taking advantage of idle resources, users can train a complex model on a workstation in a matter of days or even hours.

However, shared clusters have their own problems.

Typical challenges faced by shared clusters include:

1. Resource allocation: Multiple tasks share a cluster. GPU resource allocation needs to be carefully planned. For example, when training a model, GPU on the same machine can be used for much faster training than GPU on multiple machines. In order to avoid competing for network bandwidth among training tasks, different distributed training tasks should be assigned to gpus on different machines.

2. Uneven training speed and scalability: Selecting the appropriate GPU configuration for training tasks requires continuous monitoring of the training speed and scalability of the model, which all change with time. Especially near convergence, large batch sizes are used. Therefore, it is better to use less GPU resources at the beginning of training.

3. Training configuration: Usually we need to know which Gpus are available in advance before we can configure them for some important training. This is sometimes not possible in a shared cluster. For example, batch size and learning rate are usually determined by the number of Gpus, or gradient accumulation can be used to overcome network bottlenecks given that gpus are on different machines.

4. Fairness and availability: In the peak GPU usage, some users may need to wait in line for idle Gpus, but some users who are already running tasks want to increase the number of Gpus in order to speed up. How to balance and solve the contradiction between the two.

AdaptDL simplifies and speeds model training on local machines and shared clusters

AdaptDL solves the problem of shared clusters

To address the shortcomings associated with Organization Pool Computing and shared clusters, the Petuum CASL team created AdaptDL to simplify and speed up distributed training on shared clusters.

AdaptDL is a resource adaptive deep learning (DL) training and scheduling framework. It can monitor the performance of training tasks in real time and flexibly adjust the allocation of resources (such as Gpus and compute instances) while the task is running.

It has the following advantages to solve the problems mentioned above in the shared cluster:

1. Improve the utilization of shared GPU cluster: AdaptDL can analyze all model training tasks and learn the performance of different tasks in different GPU resource configurations. Using the knowledge learned, the AdaptDL scheduler can fairly and efficiently allocate GPU resources for different training tasks. As the number of training tasks increases and the performance characteristics of different tasks become better understood, AdaptDL will learn to flexibly reconfigure the GPU.

2. Reduce cloud model training costs: AdaptDL can provide a moderate number of GPU instances in the cloud to avoid excess costs. AdaptDL also automatically scales the cluster when larger batch sizes are used in training.

3, easy to achieve large batch size training: using large batch size can accelerate training on many Gpus, but the application is not simple. In some models, if the batch size is too large, the training time may be increased due to reduced statistical efficiency, but the GPU cannot be effectively utilized if the batch size is too small. AdaptDL can automatically select the appropriate batch size on shared clusters, cloud environments, and local machines.

Compared with Optimus and Tiresias, the model using AdaptDL had less training time on average

AdaptDL can automatically adjust batch size, learning rate, and gradient accumulation for each model training task, and can control the number of Spot instances in the cloud service platform.

Practice at Petuum shows that with the AdaptDL shared cluster training model, the average completion speed is 2-3 times higher and the cost of using bid instances in AWS is 3 times lower.

start

AdaptDL can be used in two modes.

1. Cluster scheduling: Allows multiple tasks to run on a Kubernetes cluster. Using the AdaptDL Python library, the AdaptDL scheduler can be integrated into PyTorch code to automatically select the optimal number of Gpus and training batch sizes.

2. Independent training: Train the model with adaptive batch size and learning rate on any cluster or local multi-GPU machine. AdaptDL can automatically figure out when larger batch sizes can be used to speed up model training.

Training with the AdaptDL Python library:

The Adaptdl Python library simplifies the PyTorch training code so that batch sizes and learning rates are adaptive without additional Settings.

Python3 -m PIP install adaptdlCopy the code

In the case of PyTorch MNIST, only a few lines of code need to be changed. As shown below:

AdaptDL provides a distributed data parallel interface similar to PyTorch’s native, making it easy to modify existing distributed training code.

The first step:

Use adaptdl. Torch. AdaptiveDataLoader alternative torch. Utils. Data. The DataLoader.

AdaptiveDataLoader automatically selects the best batch size during training, based on the throughput and statistical efficiency of the program. Checkpoint also saves status, so you can restart and resume training from where you left off.

Train_loader. autoscale_batch_size(1024) enables the AdaptDL to automatically select the most effective batch size for training. The maximum global batch size for all training processes is 1024.

The following:

Use adaptdl. Torch. AdaptiveDataParallel encapsulation model.

Adaptdl. Torch. AdaptiveDataParallel calculates Gradient Noise in the process of training Scale (Gradient, whose Scale), it can be used to calculate statistical efficiency. AdaptiveDataParallel automatically adjusts learning rate according to rules when batch size changes.

By default, adaptiveData apartuses AdaScale that performs well in various tasks.

During checkpoint, AdaptiveDataParallel saves model parameters, optimizer state, and LR scheduler state automatically and restores these Settings with one click after restarting training.

With these changes, users can run training code on a local computer or in a distributed cluster. AdaptDL selects the correct batch size and learning rate for faster distributed training, and automatically performs gradient accumulation to overcome network problems.

YOLOv3 can be compared in training time on Adaptive and Manual Batch Size machines, and Adaptive has significant advantages in training time and Batch Size comparison

If you do not use AdaptDL, choosing too small a batch size will result in longer training time because you are not using the GPU fully. Conversely, if you choose an excessively large batch size, you will also need more epoch to converge, resulting in longer training hours. By contrast, AdaptDL automatically achieves better training performance without selecting a fixed batch size.

Cluster management with the AdaptDL scheduler:

The AdaptDL scheduler can automatically determine the GPU resources to be used by training tasks, which makes training tasks in shared clusters more intelligent.

With flexibility, when the cluster idle rate is high, the training task is extended to use additional Gpus; When the cluster usage is high, it shrinks to use less GPU resources instead of suspending the training task.

The AdaptDL scheduler also provides other features such as declustering to avoid network contention between different tasks and maintaining fairness among competing training tasks.

Because of the coordination between the scheduler and each training task, AdaptDL can keep the shared cluster efficient.

When a task can effectively use a larger batch size, AdaptDL automatically shifts more Gpus to the job to speed up training. On the other hand, when smaller batch sizes are available only, idle Gpus will be allocated more efficiently to other tasks.

The AdaptDL scheduler can be installed on any Kubernetes instance using Helm with the following command:

Helm install adaptdl adaptdl sched - - repo - https://github.com/petuum/adaptdl/raw/helm-repo - namespace adaptdl - The create namespace -- - set docker - registry. Enabled = trueCopy the code

Once the AdaptDL scheduler is installed, you can submit training tasks using the AdaptDL CLI. The training task starts with a single GPU and then restarts multiple times with a different number of Gpus, during which the AdaptDL calculates the optimal number of Gpus to use. No matter how many Gpus there are, AdaptDL can always choose the most efficient batch size and adjust the learning rate accordingly.

AdaptDL cluster tracing example

The color bar graph shows the number of compute instances assigned to different tasks, and AdaptDL can dynamically optimize the number of compute instances obtained by each task

With AdaptDL, the PyTorch training task runs two to three times faster in a shared cluster. In addition, the AdaptDL scheduler also supports AWS bidding instances, reducing costs by up to three times.

Finally, you can use AdaptDL and NNI to speed up hyperparameter tuning workloads (AdaptDL + NNI Post).

Project address: Click here

Translation from PyTorch Medium blog.

Reference: Click here

PyTorch: Optimize EDL in GPU cluster with AdaptDL

Challenges faced by shared clusters

AdaptDL solves the problem of shared clusters

start

Related Posts

Use queues in coroutines

Abandoned by Geoffrey Hinton, why is backpropagation doubted? (Attached BP derivation)

Part-of-speech tagging was performed using OpenNLP