Use PyTorch Lightning to speed deep learning pipes up to 10 times faster

preface

This article describes how to use PyTorch Lightning to build efficient and fast deep learning pipelines, including why it is important to optimize deep learning pipelines, six ways to use PyTorch Lightning to speed up lab cycles, and a summary of the experiments.

In this paper, from the public, CV technology guide * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * technical summary series * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

When Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton designed AlexNet in 2012, it took five to six days to train a 60 million-parameter model. Eight years later, in 2020, the Microsoft DeepSpeed team successfully trained a 350 million-parameter large-bert model in less than 44 minutes!

Nine years later, we now see that AlexNet is just the tip of the iceberg of the machine learning revolution. Today, we know that many of the untapped potential training technologies and deep learning model architectures are within our grasp!

Unfortunately, because of the scale of the data and the scale of the new deep learning model architecture, many of these advances are as inaccessible to the average researcher as juicy apples are to a fruit picker without a ladder. With so many fruitful model architectures hanging on the tree of deep learning potential, we should ask ourselves, “How can we reach them?”

The answer is simple: to achieve these productive architectures, we need ladders! Alex Krizhevsky built his own ladder to get to AlexNet piece by piece, but today, solutions like PyTorch Lightning give you your own off-the-shelf ladders — even escalators!

This article describes how to use PyTorch Lightning to build efficient and fast deep learning pipelines, and explains how these optimizations can quickly try out various research ideas by significantly speeding up the development lab cycle!

Why is it important to optimize deep learning pipelines
Six ways to Speed up your lab cycle with PyTorch Lightning
The results summary

Why is it important to optimize deep learning pipelines

Whether in academia or industry, time and resources for research and development to explore and try out new ideas are always limited. As the size of data sets and the complexity of deep learning models increase, experimentation with the latest machine learning models and techniques becomes increasingly complex and time-consuming. How to address these challenges (and make the r&d cycle more efficient) is critical to the overall success of the project.

Today, various solutions exist to overcome these barriers, such as Grid.ai, WandB, and PyTorch Lightning. This article focuses on PyTorch Lightning and explains how to use it to make deep learning pipelines faster and improve memory efficiency behind the scenes with minimal code changes. Using these solutions, you can make experiments more scalable and iterate faster, while minimizing potential errors. Making these changes will reduce the time needed to experiment, saving time that can be used to try out more ideas.

Six ways to Speed up your lab cycle with PyTorch Lightning

Six ways to optimize the deep learning pipeline:

Parallel data loading
Many GPU training
Mixed Precision Training
Sharded Training
Stopping Early
Optimization during model evaluation and reasoning

For each approach, we’ll briefly explain how it works, how to implement it, and finally, share if we’ve found it helpful for our project!

Parallel data loading

It is common for data loading and enhancement steps to become bottlenecks in the training pipeline.

A typical data pipeline consists of the following steps:

Load data from disk
Create random enhancement in real time
Organize each sample into batches

The data loading and enhancement process is easy to parallelize and can be optimized by using multiple CPU processes to load data in parallel. This way, expensive GPU resources are not hampered by the CPU during training and reasoning.

To load data as quickly as possible to train the deep learning model, you can perform the following operations:

Set the ‘num_workers’ parameter in the DataLoader to the number of cpus.
When using a GPU, set the ‘pin_memory’ parameter in the DataLoader to True. This allocates data to page lock memory, which speeds up the transfer of data to the GPU.

Supplementary notes:

If you process stream data (that is, ‘IterableDataset’), you also need to configure each worker to process incoming data independently.
Seed initialization errors plague many open source deep learning projects. To avoid this error, define the process seed for the worker process in ‘worker_init_fn’. Starting with PyTorch Lightning 1.3, this is handled automatically using ‘seed_everything(123, workers=True)’.
Starting with PyTorch 1.8, you can use the optional ‘preFETch_factor’ parameter to better control load performance behavior. Set this to a higher integer to load more batches ahead of time, but use more memory.

Multi-gpu training using distributed data in parallel

The GPU provides a huge acceleration in CPU training and reasoning time. What’s better than a GPU? More than one GPU!

PyTorch has examples for training models with multiple Gpus. Two more common examples are DataParallel and DistributedDataParallel, where DistributedDataParallel is a more scalable approach.

Modifying the training pipeline in PyTorch (and other platforms) is not easy. One must consider issues such as loading data in a distributed manner and synchronization of weights, gradients, and metrics.

With PyTorch Lightning, it is very easy to train the PyTorch model on multiple Gpus with almost no code changes!

Mixing precision

By default, input tensors and model weights are defined in single precision (float32). However, some mathematical operations can be performed with half-precision (float16). This significantly increases speed and reduces model memory bandwidth without sacrificing model performance.

By setting the mixed precision flag in PyTorch Lightning, the framework automatically uses half precision when possible, while keeping single precision elsewhere. With minimal code modifications, model training time was increased by 1.5 to 2 times.

Stop in advance

The model needs to train a lot of epochs, but in fact the model is likely to overfit the training data early in the training process. Therefore, early stop needs to be implemented in the training pipeline. Early stop is configured to end the training when the verification loss stop is reduced after a predefined number of assessments. By doing so, you not only prevent overfitting, but also save time in finding the best model within dozens of epochs instead of hundreds.

Shard training

Sharding training is based on Microsoft’s ZeRO research and DeepSpeed library, which makes training large models scalable and simple. This is achieved by using various memory and resource inter-communication optimizations. In fact, sharding training can train large models that would otherwise not fit on a single GPU or use larger batch sizes during training and reasoning.

PyTorch Lightning introduced support for sharding training in its 1.2 release. In our use case, we did not observe any significant improvement in training time or memory footprint. However, our insights may not generalize to other issues and Settings and may be worth a try, especially when dealing with large models that don’t use a single GPU.

Optimization during model evaluation and reasoning

Gradient is not required for forward propagation of the model during model evaluation and inference. Therefore, the evaluation code can be wrapped in a ‘torch. No_grad’ context manager. This reduces memory footprint by preventing gradients from being stored during forward pass. As a result, larger batches can be entered into the model for faster assessment and reasoning.

By default, PyTorch Lightning manages these optimizations behind the scenes.

The results summary

In our experiments, we found that all optimizations independently reduced the time to train the deep learning model, and we observed no speed or memory improvements except for sharding training.

The table below shows each of the optimizations made to improve the deep learning pipeline, as well as the performance improvements observed.

With these optimizations, we increased the deep learning pipeline 10 times, from two weeks to just 10 hours.

Author: Georgian

Compilation: CV technical Guide

The original link: devblog. Pytorchlightning. Ai/how – we – 2…

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Use PyTorch Lightning to speed deep learning pipes up to 10 times faster

Why is it important to optimize deep learning pipelines