PyTorch recently released the version 1.5 update, which brings a major feature upgrade to the increasingly popular machine learning framework. In addition, Facebook and AWS have collaborated on two important PyTorch libraries.
As PyTorch becomes more and more used in production environments, it has become a priority to provide the community with better tools and platforms to efficiently extend training and deployment models.
PyTorch 1.5 was released recently, upgrading the main TorchVision, TorchText and TorchAudio libraries, and introducing features such as converting models from Python APIS to C ++ apis.
In addition, Facebook has partnered with Amazon to launch two blockbuster tools: the TorchServe Model Service Framework and the TorchElastic Kubernetes Controller.
TorchServe is designed to provide a clean, compatible, industrial-grade path for large-scale deployment of PyTorch model inference.
The TorchElastic Kubernetes controller allows developers to quickly use Kubernetes clusters to create fault-tolerant distributed training jobs in PyTorch.
This appears to be a move by Facebook and Amazon to declare war on TensorFlow for a large performance AI model framework.
TorchServe: Used for reasoning tasks
Deploying machine learning models for large-scale reasoning is not easy. Developers must collect and package model artifacts, create secure service stacks, install and configure software libraries for prediction, create and use apis and endpoints, generate logs and metrics for monitoring, and manage multiple model versions on potentially multiple servers.
Each of these tasks takes a significant amount of time and can slow model deployment by weeks or even months. In addition, optimizing services for low latency online applications is a must.
Previous developers using PyTorch lacked an officially supported way to deploy the PyTorch model. The release of TorchServe, a production model service framework, will change that, making it easier to put models into production.
In the following example, you will see how to extract a trained model from Torchvision and deploy it using TorchServe.
A beta version of TorchServe is now available and features include:
- Native API: Supports reasoning apis for forecasting, and management apis for managing model servers.
- Security deployment: includes HTTPS support for security deployment.
- Powerful model management capabilities: Allows complete configuration of models, versions, and a single worker thread through a command line interface, configuration file, or runtime API.
- Model archiving: Provides tools to perform “model archiving,” a process of packaging models, parameters, and support files into a single persistent artifact. Using a simple command-line interface, you can package and export as a single “.mar “file that contains everything you need to provide the PyTorch model. The.mar file can be shared and reused.
- Built-in model handlers: Supports model handlers that cover the most common use cases, such as image classification, object detection, text classification, and image segmentation. TorchServe also supports custom handlers.
- Logging and metrics: Support for reliable logging and real-time metrics to monitor inference services and endpoints, performance, resource utilization, and errors. You can also generate custom logs and define custom metrics.
- Model management: Supports simultaneous management of multiple models or versions of the same model. You can use model versions to go back to earlier versions, or route traffic to different versions for A/B testing.
- Pre-built images: When ready, T orchServe’s Dockerfile and Docker images can be deployed in CPU – and NVIDIA GPU-based environments. The latest Dockerfiles and images can be found here.
Installation instructions, tutorials, and documentation are also available at pytorch.org/serve.
TorchElastic: Integrated K8S controller
Current training models of machine learning, such as RoBERTa and TuringNLG, are becoming larger and larger, and their need to scale outward to distributed clusters is becoming more and more important. To meet this requirement, preemptive instances (such as Amazon EC2 Spot instances) are often used.
But these preemptible instances are inherently unpredictable, and for that reason a second tool, TorchElastic, has emerged.
The integration of Kubernetes and TorchElastic allows PyTorch developers to train machine learning models on a set of compute nodes that can change dynamically without disrupting the model training process.
Even if a node fails, TorchElastic’s built-in fault tolerance allows it to suspend training at the node level and resume training once that node has returned to normal again.
In addition, using the Kubernetes controller with TorchElastic allows you to run the critical task of distributed training on the cluster where the node has been replaced in case of hardware or node recycling problems.
Training tasks can be started with partially requested resources and can be dynamically expanded as resources become available without stopping or restarting.
To take advantage of these features, users specify training parameters in a simple job definition, and the Kubernetes-Torchelastic package manages the life cycle of the job.
Here is a simple example of the TorchElastic configuration for the Imagenet training job:
Microsoft, Google, ask you panic not panic?
There may be a deeper meaning behind the new PyTorch library, as this is not the first time in the history of framing.
In December 2017, AWS, Facebook, and Microsoft announced that they would work together to develop ONNX for production environments, countering Google TensorFlow’s monopoly on industrial use.
Later, Apache MXNet, Caffe2, PyTorch and other mainstream deep learning frameworks all support ONNX to different degrees, which facilitates the migration of algorithms and models between different frameworks.
The vision of ONNX to bridge the gap between academia and industry has not lived up to its original expectations, with each framework still using its own service architecture, with MXNet and PyTorch essentially the only ones to penetrate ONNX.
Now that PyTorch has launched its own service, ONNX is almost meaningless (MXNet says it’s at a loss).
On the other hand, PyTorch is approaching or even surpassing its closest competitor TensorFlow in terms of compatibility and ease of use as it continues to be updated and updated.
Google has its own cloud services and frameworks, but the combination of AWS’s cloud resources and Facebook’s framework makes it difficult for Google to compete.
Microsoft has already been kicked out of the group chat by two of the ONNX trio. What’s next?
– the –