By Wang Bam Ping

With the vigorous development of cloud native technology and the landing of its increasingly mature industry, machine learning on the cloud is rapidly advancing towards large-scale and industrial direction.

Recently, Morphling became the Cloud Native Computing Foundation (CNCF) Sandbox project as a separate sub-project of Alibaba’s open source KubeDL. For large-scale industrial deployment of reasoning machine learning model (model inference) services, provide automated deployment configuration tuning, testing, and recommended, virtualization and the reuse technology has become increasingly mature in GPU environment, help enterprises to fully enjoy the cloud native advantages, optimize online machine learning service performance, reduce the cost of service deployment Efficiently solve the performance and cost challenges of machine learning in actual industrial deployment. In addition, Morphling project related academic paper “Morphling: Fast, Near-optimal Auto-configuration for Cloud-native Model Serving”, Accepted by ACM Symposium on Cloud Computing 2021 (ACM SoCC 2021).

Morphling was originally the hero of Dota, the Water Man, who can change his form flexibly to suit his environment and optimize his combat performance. Through project Morphling, we hope to achieve flexible and intelligent deployment configuration changes for machine learning inference jobs, optimize service performance and reduce service deployment costs.

The Morphling Github:github.com/kubedl-io/m… IO/Tuning/Intr…

background

The workflow of machine learning on the cloud can be divided into model training and Model inference: After the offline training and tuning test is completed, the model will be deployed as an online application in the form of a container, providing users with uninterrupted high-quality inference services, such as object recognition in online live video, online language translation tool, online picture classification, etc. For example, The Machine Vision Application Platform (MVAP), alibaba’s internal Taoshi content social Platform, supports the recognition of product highlights, the removal of live broadcast cover images, and the classification of browsing texts through online Machine learning reasoning engine. According to the data of Intel, the era of “Inference at Scale” is coming: by 2020, the ratio of Inference to training cycle is more than 5:1; Amazon’s AWS infrastructure spending on model reasoning services accounted for more than 90 percent of its total spending on machine learning tasks in 2019, according to the company. Machine learning reasoning has become the key to the landing and “realization” of ARTIFICIAL intelligence.

Reasoning tasks on the cloud

Reasoning service itself is a special form of long running micro-service. With the increasing deployment volume of reasoning service on cloud, its cost and service performance have become crucial optimization indicators. This requires the o&M team to reasonably optimize the configuration of the inference container before deployment, including hardware resource configuration and service operation parameter configuration. These optimized configurations play a critical role in coordinating service performance (such as response time and throughput) with resource utilization efficiency. In practice, our tests have found that different deployment configurations can result in a ten-fold difference in throughput/resource utilization.

Relying on ali’s extensive AI reasoning service experience, we first summarized the reasoning business, which has the following characteristics compared with traditional service deployment:

  • Use of expensive graphics card resources, but low video memory usage: The development and maturity of GPU virtualization and time-sharing multiplexing technology gives us the opportunity to run multiple inference services on a GPU at the same time, significantly reducing costs. Different from the training task, the inference task uses the well-trained neural network model to process the user input information and get the output through the neural network. In the process, only Forward Propagation of the neural network is involved, and the demand for video memory resources is low. In contrast, the training process of the model, involving Backward Propagation of neural network, needs to store a large number of intermediate results and has much more pressure on the video memory. Our extensive clustering data shows that assigning an entire video card to a single inference task can result in considerable resource waste. However, how to choose the appropriate GPU resource specifications for inference services, especially the incompressible video memory resources, has become a key problem.
  • Resource bottlenecks for performance are diverse: in addition to GPU resources, reasoning tasks also involve complex pre-processing of data (processing user input into parameters that match model input) and post-processing of results (generating data formats that match user cognition). These operations are typically performed using a CPU, and model reasoning is typically performed using a GPU. For different service services, GPU, CPU, and other hardware resources may become the dominant factor affecting the service response time, resulting in resource bottlenecks.
  • In addition, the configuration of container running parameters also becomes a dimension for service deployment personnel to tune. In addition to computing resources, container running parameters also directly affect RT and QPS performance, such as the number of concurrent threads running services in the container and the batch processing size of inference services.

Optimize inference service deployment configuration

With Kubernetes as the mainstream cloud native technology, is widely used in a rich form of new application load, machine learning tasks (including training and reasoning) built on Kubernetes, and achieve stable, efficient and low-cost deployment, has become the focus and key of major companies to promote AI projects, services on the cloud. Kubernetes framework inference container configuration, the industry is still exploring and trying.

  • The most common pattern is to manually configure parameters based on human experience, which is simple but inefficient. The actual situation is that the service deployers, from the perspective of the cluster manager, tend to allocate more redundant resources in order to ensure the service quality, and choose to sacrifice the latter between stability and efficiency, resulting in a large amount of resource waste. Alternatively, run parameters can be configured using default values, thus losing the opportunity for performance optimization.
  • Another alternative is to further refine and optimize resource allocation based on the historical water level profile of the resource. However, our observation and practice found that the daily resource water level could not reflect the flow peak during the service pressure survey and could not evaluate the upper limit of service capacity. Secondly, there is a general lack of reliable historical water level information for reference for newly launched businesses. In addition, due to the characteristics of machine learning framework, the historical consumption of GPU graphics memory can not accurately reflect the real demand of application graphics memory. Finally, there is a lack of data support from a historical data point of view for tuning the running parameters of the program inside the container.

Overall, while the Kubernetes community has some research and products for automating parameter recommendations for more general hyperparameter tuning, the industry lacks a cloud native parameter configuration system directly oriented to machine learning inference services.

Relying on ali’s extensive AI reasoning service experience, we concluded that the pain points of reasoning service configuration tuning are as follows:

  • The lack of a framework for automated performance testing and parameter tuning: iterative manual configuration/service compression imposes a heavy human burden on deployment testing, making this direction an impossible option in reality.
  • Stable and non-invasive service performance testing process: Direct deployment testing of online services in a production environment affects the user experience.
  • Efficient parameter combination optimization algorithm is required: considering that the number of parameters to be configured increases, the combination optimization configuration of multi-dimensional parameters is debugled jointly, and higher efficiency is required for the optimization algorithm.

Morphling

In view of the above problems, alibaba Cloud native cluster management team has developed and opened source a machine learning reasoning service configuration framework based on Kubernetes – Morphling, which automates the whole process of parameter combination tuning and combines efficient intelligent tuning algorithm to enable the configuration and tuning process of reasoning business. It can efficiently run on Top of Kubernetes to solve the performance and cost challenges of machine learning in actual industrial deployments.

Morphling abstracts the parameter tuning process at different levels of cloud native, providing users with a concise and flexible configuration interface that encapsulates the underlying container operations, data communications, sampling algorithms, and storage management in a controller. Specifically, Morphling’s parameter tuning – performance pressure test, using the experiment-trial workflow.

  • Experiment, as the layer of abstraction closest to the user, defines a specific parameter tuning operation by specifying the storage location of the machine learning model, configuration parameters to be tuned and the upper limit of the number of tests through interaction.
  • For each parameter tuning job experiment, Morphling defines another layer of abstraction: trial. Trial encapsulates a one-time test process for a particular parameter combination, covering the underlying Kubernetes container operations: In each trial, Morphling configured and started the inference service container according to the combination of test parameters, detected the availability and health status of the service, and conducted a stress test on the service to measure the service performance of the container under this configuration, such as response time delay, service throughput, resource use efficiency, etc. The test results are stored in the database and fed back to experiment.
  • Morphling uses intelligent hyperparameter tuning algorithm to select a small number of configuration combinations for trial, and the results of each round are used as feedback to efficiently select the next set of parameters to be tested. In order to avoid exhaustive size sampling, we use Bayesian optimization as the inner core driver of the image sampling algorithm. By continuously refining the fitting function, we give a nearly optimal container size recommendation result with a low sampling rate (<20%) pressure measurement overhead.

Through this iterative sampling-testing, the business deployer is ultimately given an optimized configuration mix recommendation.

Meanwhile, Morphling provides a control suite: Morphling-UI, which is convenient for the business deployment team to initiate inference experiment, monitor the tuning process and compare the tuning results through simple and easy-to-use operations on the interface.

Morphling’s practice in Amoy content social platform

Alibaba’s abundant online machine learning reasoning scenarios and a large number of reasoning service instance requirements provide first-hand landing practice and test feedback for Morphling’s landing verification. Among them, the Machine Vision Application Platform (MVAP) team of Alitao supports the recognition of product hotspot, the removal of live cover image, and the classification of browsing text through online Machine learning reasoning engine.

During the Double 11 in 2020, we tested and optimized the specifications of AI reasoning containers with Morphling to find the optimal solution between performance and cost. Meanwhile, the algorithm engineering team further quantified and analyzed these resource-consuming reasoning models, such as Taoshi video viewing service. It is optimized from the perspective of AI model design to support the peak flow of Double 11 with the least resources, while ensuring the performance of the business does not decline, which greatly improves the GPU utilization rate and reduces the cost.

Academic exploration

In order to improve the efficiency of the parameter tuning process of inference service, the alibaba Cloud native cluster management team further explored the use of meta-learning and few shot regression to achieve a more efficient and low sampling cost configuration tuning algorithm according to the characteristics of inference business. In response to the actual industry’s “fast, small sample sampling, low test cost” tuning requirements, and cloud oriented native and automated tuning framework. Morphling: Fast, Near-optimal Auto-configuration for Cloud-native Model Serving”, Accepted by ACM Symposium on Cloud Computing 2021 (ACM SoCC 2021).

In recent years, topics related to the optimization deployment of AI reasoning tasks on cloud have been active in major cloud computing and system-related academic journals and conferences, becoming a hot topic of academic exploration. Topics explored include dynamic selection of AI models, dynamic scaling of deployed instances, traffic scheduling for user access, and full utilization of GPU resources (such as dynamic model loading and batch size optimization). However, this is the first study to optimize container-level inference service deployment from a large-scale industry practice.

In terms of algorithm, performance tuning is a classic hyper-parameter tuning problem. Traditional hyperparameter tuning methods such as Bayesian optimization are difficult to deal with the tuning problems of high dimensions (multiple configuration items) and large search space. For example, for AI reasoning tasks, we perform “combination optimization” hyperparameter tuning based on the number of CPU cores, GPU memory size, batch size, and GPU model. Each configuration item has five to eight optional parameters. In this way, the parameter search space in the combined case is up to more than 700. Based on our testing experience in production clusters, it takes several minutes for an AI reasoning container to test a set of parameters, from service pull up, stress test to data presentation. At the same time, AI reasoning services are diverse, with frequent updates and iterations, limited deployment engineers and limited cost of testing clusters. To test the optimal configuration parameters efficiently in such a large search space presents a new challenge to the hyperparameter tuning algorithm.

In this paper, the core of our observation is that for different AI reasoning business, the need to optimize the configuration (for example the GPU memory, batch size) for the services of container performance (QPS, for example), “trend of stable and similar,” performance “configuration – performance” in visual surface, embodied, and different AI reasoning method, The “configuration-performance” surface has similar shapes, but the degree to which configuration affects performance and the key nodes are numerically different:

The figure above visualizes the effect of the two-dimensional configuration of <CPU cores, GPU memory size > on container service throughput RPS for three AI inference models. This paper proposes to use model-Agnostic meta-learning (MAML) to learn these generality in advance and train the meta-model, so as to quickly find the key nodes in the surface of the new AI reasoning performance test. Based on the meta-model, an accurate fitting under a small sample (5%) can be made.

conclusion

Morphling-based Kubernetes machine learning inference service configuration framework, combined with the “fast, small sample sampling, low test cost” tuning algorithm, has realized the cloud oriented native automatic and stable and efficient AI inference deployment tuning process, enabling faster optimization and iteration of the deployment process. Accelerate the launch of machine learning business applications. The combination of Morphling and KubeDL will also make the AI experience smoother, from model training to configuration tuning for inference deployment.

Reference

The Morphling lot: github.com/kubedl-io/m…

IO/Tuning/Intr…

KubeDL lot: github.com/kubedl-io/k…

KubeDL website: kubedl.io/

Check out project Morphling on Github!


The recent hot

#2021· Cloud Computing Conference registration # Scan code to sign up for free lucky gift!

Scan below [Cloud Conference registration QR code] to complete the registration and screenshot, add cloud native assistant wechat account (AlibabaCloud888) to send the screenshot, you can win the lucky draw! Come and try it

Cloud Computing Conference registration QR code