Product | technology drops



Introduction: Now many Internet companies have their own machine learning platform, although the name of the various, but the platform to solve the problems and technology selection is basically similar. The so-called “great harmony” means that everyone has similar problems to deal with, and the technology architecture and selection are not too different, such as using GPU cluster, Spark or K8s platform. The so-called “small difference” refers to the different scale of each family. Each family is solving the problems of platform based on its own situation, stage and characteristics. The main governance idea of Didi machine learning platform is to reduce duplication and improve efficiency. This paper will give a comprehensive interpretation of Didi’s machine learning platform, focusing on the problems to be solved in different stages of machine learning platform, as well as the ideas and technical solutions to solve the problems. I hope I can help you.

Machine learning platform 1.0: Transition from “workshop” to “centralized”

The construction of Didi’s machine learning platform began in 2016, when didi’s algorithm teams gradually carried out ai-related research and practical applications such as machine learning and deep learning. Most of these algorithms are computation-intensive applications, and expensive GPU servers are generally used. But as the business progressed, each algorithm team only planned for their own problems, resulting in a small workshop production situation.

The workshop production mode has its positive side in the early stage, which can ensure the flexibility of innovation. However, the more in the future, the limitations of this small workshop algorithm production mode become more obvious: lack of overall scheduling of resources, unable to form a large-scale effect, a lot of repetitive work, and limited computing power. The gradual increase in this kind of small workshop production mode results in the overall input-output benefit greatly reduced.

Didi machine learning platform came into being in this context, and this stage is mainly devoted to solving these problems. The architecture and technology selection adopted by machine learning platforms during this period is mainly aimed at the problem of workshop production, that is, to improve reusability and scale capability.

The first problem to be solved is unified resource management, which includes offline and online problems.

The offline “unification” problem focuses on the centralization of GPU server selection, testing, introduction and launching. On the one hand, this kind of centralization improves the on-line quality of server introduction; On the other hand, compared with the workshop model, due to the participation of gPU-related professionals, the selection of GPU avoids blindly pursuing new ideas and divergences.

Moreover, centralization can be combined with the overall situation of the company, so as to make the optimal selection and introduction of the program.

The problems to be solved in online “unification” are subdivided into resource management problems and task scheduling problems, so that resource users can apply for and release resources when they are used up, so as to revitalize the whole resource pool, while the platform needs to achieve resource isolation and management.

At this stage, we need to solve the problem of how to avoid repetitive work with unified resource management. By avoiding repetition, each algorithm business does not need to repeat the build of Caffe, TensorFlow, PyTorch, etc., but it needs to be built once and available to all users. For the platform, it is necessary to manage the application environment, customize the environment and deploy the environment quickly.

After sorting out these requirements and combining with the basic requirements of the technical environment and maturity at that time and above, the platform chooses the current popular Docker to concurrently manage the environment, weak isolation of resources and task scheduling.

However, resource managers that support GPU resource scheduling are not available, so Yarn is extended to support GPU resource management and task scheduling. In the environment, the platform also provides interactive interfaces for Notebook and Jupyter.

After unified resource management and environment management, users have to share data among multiple resource nodes. When applying for new resources after the current resources are released, users often rely on the previous data.

Multi-node data sharing in the workshop period is limited to a single scale, the problem is not very prominent, but after centralized with the increase of users will gradually become more acute and even a big technical challenge. Because:

  • The task computing characteristics of machine learning depend on the high-speed computing of GPU, and they have certain requirements on data access delay, which must be supported by high ENOUGH IO bandwidth.

  • As the number of users increases, the demand for storage bandwidth becomes very large.

  • For storage systems, POSIX interfaces are required to reduce existing technical solutions and compromise between high reliability, high performance, and cost.

Didi machine learning platform’s attempt on storage system still borrows THE PFS used by traditional supercomputer as the whole data storage level. The underlying network infrastructure uses high bandwidth Ethernet network, and uses RoCE protocol as RDMA support, and evolves in this direction.



Machine learning platform architecture -Yarn

Generally speaking, the problems faced at this stage are mainly internal problems. From the workshop type to the development stage of centralized production, the related repetitive problems to be solved are relatively simple. Some of these problems belong to the nature of the centralization of the problem, but the solution is still workshop type, the limitations of the technology selection has not been fully exposed.

Machine learning Platform 2.0: Platform evolution

As workshops disappear, machine learning platforms are presented to all of the company’s algorithm teams as a centralized production method. Platform functions began to complete and improve, monitoring system, operation and maintenance system, more refined resource isolation, management and optimization; Different tasks are supported according to different tasks of users.

After the previous stage, although it effectively reduced the repetitive work of workshop production, it almost inevitably produced some new forms of repetitive work. With the increase of user access, the nature of user tasks is also diversified, some are experimental, some are online production tasks, some are single card tasks, some are multi-card multi-machine training tasks and so on.

Tasks of each nature have their own specific form of repetition. For example, production service issues such as service HA and load balancing need to be solved by users to deploy model services after model production, and these issues need to be solved by each online model.

For example, user training often needs to adjust parameters, and these parameters are homomorphic and only change in value. After the change in value, each task is independent, and users need to submit the task process, which is also repetitive work.

For example, when users run multiple machines and cards, they need parameter server. Inefficient parameter server wastes a lot of time on communication, which will increase the repetition of user resources. Similar to this repeated form, model services need to go online. In order to meet service delay, QPS and resource constraints, full-stack optimization from service, deep learning framework and computing library needs to be done. Basically, most models also need to go through this optimization process when they go online.

To solve these new problems, the platform needs more powerful resource management and task scheduling capabilities.

In the previous period, Yarn, which was selected as resource management and task scheduler, began to show fatigue, which was reflected in the fact that K8S was becoming more mature. The combination with Docker was more reasonable and complete, and it could integrate resources from multiple dimensions. Using K8S provided environment and conditions for the automatic deployment of model services. It also reduces the operation and maintenance cost of the service.

Based on the advantages and disadvantages of K8S and Yarn, Didi machine learning platform starts to migrate from Yarn architecture to K8S architecture.



Machine learning Platform architecture -K8S

In view of the efficiency of homomorphic parameter tuning, the platform conducts semantic analysis on the Python code of the user to automatically identify which parameters may need to be adjusted. The user only needs to set the range and step distance to automatically obtain the model training task of the whole set of parameters and the final result.

Aiming at the efficiency of multi-machine and multi-card training, the platform developed a Didi parameter server combining its own hardware characteristics and communication mode characteristics. Didi parameter server adopts circular structure to realize efficient RDMA communication Allreduce algorithm.

The loop structure, rather than the centralized Server -client mode, eliminates bandwidth competition and network congestion. The RDMA library can avoid the performance loss of user mode Verbs provided by the equipment manufacturer. The rewritten RDMA library implements sig/ READ and POST/RECV modes, avoids the inherent communication overhead of RDMA, and fully explores the properties of hardware to improve performance.

In addition, the self-developed Allreduce algorithm subtly overlaps computation and transmission, minimizes unnecessary memory copy to reduce additional costs, and fully considers hardware attributes such as GPU topology and CPU affinity to improve performance.

In the actual test of RoCE V2 RDMA network with 40G bandwidth in the computer room, didi parameter server has obvious advantages compared with the industry’s OpenMPI and Nvidia’s NCCL2 scheme.



Aiming at model service deployment and optimization, the platform developed DDL (DiDi Deep Learning) Serving service framework, IFX framework and Autotuning optimization library based on its own scene characteristics, which greatly accelerated the online model deployment and optimization process.

DDL Serving is an original adaptive Batch mechanism that optimizes RPC protocol and solves the defect of Tensorflow Serving. Compared with Tensorflow Serving performance, DDL Serving performance is accelerated as follows:





DDL Serving framework services themselves are no longer the bottleneck point in the whole Serving link. For some lightweight models there is a threefold performance improvement, including RT and QPS, while for general models the performance hotspot is at the deep learning framework layer.

Therefore, for the framework layer, we independently developed the deep learning framework IFX, which is suitable for BOTH GPU server and mobile platform. On THE GPU server, due to the problem of CONTEXT management in CUDA, we designed and implemented a concurrency mechanism on THE GPU, which effectively avoided the extra overhead caused by these problems, and optimized a large number of OP. The performance of IFX is much higher than Tensoflow and TensorRT.

IFX conducts instruction rearrangement and memory optimization according to different hardware configurations of mobile terminals, such as pipeline length, out-of-order order, and supernumerary. Combined with the computing characteristics of business, IFX achieves remarkable performance:



In the optimization process of IFX, a large number of repetitive work is basically in Tuning Blas calculation. Due to different hardware architectures, different models have different computation amount, calculation access ratio and calculation access mode. Under the requirements of extremely high performance, targeted optimization needs to be carried out based on these specific situations. These optimizations are low-level, and tuning is relatively tedious, so upper-level service users don’t have to worry about the low-level details.

To solve this problem, the platform developed Autotuning tool chain, including Kepler, Pascal, Volta architecture of the native assembler.

For the user, it is only necessary to send the binary code on the GPU to the platform, and the platform can generate the binary code that is almost optimal on the GPU platform, that is, the current optimized binary code with the highest performance.

Besides NV, didi machine Learning Platform team has the most supported versions of NV GPU native assembler and the most understanding of NV GPU microarchitecture.



These “duplication problems” arise with centralization and platformization, and in the platformization environment it makes sense to solve these “duplication” problems.

The second benefit of centralization and platformization is that on this basis, the need for universality will gradually precipitate into the service of the platform.

Similarity retrieval requirements in details, for example, a map of POI optimization, face retrieval, video image content retrieval business scenarios are common demand, so the platform will get enough business information platform to develop the level of service, and in the era of mill type is hard to get this kind of across the needs of the business scenario and spontaneous precipitation platform service, mostly sweep the snow from his own doorstep.

Machine learning platform 2.1: Internal and external cloud platform shaping

The second impact of centralized production is that the functions and positioning of the machine learning platform become diversified with the increase of platform capabilities and the gradual enrichment of incubation algorithms, as well as the gradual accumulation and maturity of didi’s internal data, AI engineering and algorithms.

In addition to good service drops internal machine learning platform for users, further consolidate resources scheduling, task management, monitoring operational capacity, platform began to undertake the function of internal ability to export, during the machine learning platform and drabs cloud to build on public clouds from the underlying resource to the upper platform, from the public cloud to a private cloud solution.

The centralized production within machine learning also reserves the output of didi’s machine learning platform capabilities, but the technical product requirements of external customers are relatively more complex.

This complexity is first reflected in the multiple levels of product requirements: there are direct requirements for resources and even hardware, there are requirements for specific services, and there are requirements for platform capabilities in the private cloud, for example. Secondly, the product factors are multi-dimensional: the cost performance of resources is often only one aspect, security, stability, and integration ability with other infrastructures are also factors affecting users’ decisions. Finally, the horizontal comparison of competing products of all friends.

All these problems are problems encountered by Didi machine learning platform in external services, but these problems cannot be solved in a single battle. They are all solved in stages and steps with particular emphasis.

The first step is to solve the basic problem, how to reveal the capabilities, how to ensure the safety of customers, and how to minimize the repetitive work of external users (user cost) and the repetitive work of Didi machine learning platform (product cost performance) on the basis of the first two capabilities.

GPU Resources: Reduce repetitive work on resources

Compared with internal users, external users need a secure isolation environment to use resources. Only Docker’s weak isolation mode cannot provide users with a secure and isolated environment. Therefore, THE GPU cloud resources on Didi Cloud use KVM and GPU transparent transmission to transparently transmit GPU resources to users.

The technical team of Didi Machine Learning Platform is quite experienced in the use of GPU. The team members are also one of the first groups to try GPU in the industry, and they have accumulated rich knowledge and experience in GPU use, which has been proved to be very effective in Didi. They pay special attention to GPU resources, topology and related support. Therefore, with the same GPU model, users often get better performance, as shown in the figure below. The precipitation of this part also reduces the repetitive work of external users in the process of exploring the use of GPU and reduces the hidden cost of use.



Resilient Reasoning Services (EIS) : Reduce duplication of service deployment optimizations

All algorithm models are ultimately required for production services. There are many PAML platforms abroad that can deploy machine learning model services, and machine learning platforms also provide a model deployment service — EIS (Elastic Prediction Service) on Didi Cloud.

The EIS service is rooted in the DDL Serving service we use internally, but because of our adherence to some of the concepts on the cloud, you might wonder if we are “up early and Serving late”.

In fact, it is not too late for EIS to emerge in the form of DDL in Didi. The service market in this area is only in its infancy now. Product differentiation and diversification will be an inevitable trend, and there will be better and larger choices for users.

At present, most of the manufacturers providing PA services in the market have their own characteristics, but in general, their positioning of this product is still only as a supplementary role of resource products, focusing on solving resource and deployment problems for users. This supporting role has its advantages, which mainly include:

  • The mode is simple, and the service is converted into the smallest granularity resource cost, which is charged according to the smallest unit resource consumption.

  • The capacity requirement of infrastructure is reduced, which simplifies to resource cost. In essence, there is only one more form of resource sale.

  • Service vendors have minimal work to do. Although users can choose from multiple resources, and each resource has its own theoretical computing power, how users make good use of these resources is their own business.

The problem with this model is that although the service provider solves part of the problem for the customer, it still does not consider the actual service deployment of the user. Why is that?

The reason is also mentioned in the DDL description. The model service deployment service requires users to optimize their services to meet RT and QPS requirements. Furthermore, how to optimize the cost for users to use cloud services, the cost is almost inevitable and carefully considered.

Therefore, from this point of view, the disadvantages of the resource-oriented and service-complementary mode of PA service provider are also obvious:

  • For model services, the granularity of the smallest resource is still relatively coarse. If GPU is used, the problem is more prominent.

  • Resources ability of theoretical calculation for the user Numbers are often only a theory, restricted by the limit of hardware and customer’s own technical ability, the client often does not make full use of the resources of PA manufacturers of computing power, and general utilization co., LTD., the actual use and nominal cost between the theory of digital resources is pay by the user, and what is more, There are two parts of duplication for users: duplication of resource usage optimization, and duplication of o&M related work for service deployment.

Based on the experience of our internal users and some external users, the core technical metrics of the service are QPS and RT, and thus the deployment cost and usage cost when these two metrics are met. These costs must be reduced on the basis of minimizing user rework and “practical sales”. In addition to HA and operation and maintenance support required by general service deployment, EIS focuses on solving these two problems from the technical architecture design.

Speaking from RT: User service of RT overhead is limited by the network link and the actual forward calculation of overhead, in order to reduce the overhead of the network link, drops cloud spent a lot of time, in the public cloud implements the Gateway of pure public clouds, on the one hand to support user-defined operation such as authentication, on the other hand also minimize network jump of reduce network overhead, Guarantee RT of user service.

In terms of QPS, EIS uses DDL Serving as the service engine framework. Users using DDL Serving can ignore the details of the underlying hardware, thus avoiding the user repeating the known optimization work at the service framework level. This also provides the conditions for the realization of users “practical sales”. This can be understood through the following architecture diagram:

In order to achieve “practical and practical sales”, there is also a very important link is to know the actual computing demand of the user’s model, and the computing utilization rate under a certain hardware.

We developed an automated pressure measurement module where the user provides the model and deployment input to obtain computational performance using DDL Serving on certain hardware, and further regression of QPS capability under certain RT performance requirements.

For the user, the user can convert the total QPS required by the service and then expand the capacity horizontally according to QPS, which means that the user only bears the part of the resources actually consumed by the computing performance. This is more fine-grained resource control than the previous mode.

The reduction of repetitive work in user optimization, as mentioned earlier, in addition to the optimization of the service framework, part of the optimization is spent on the optimization of computing performance, but the optimization of computing performance often depends on the computational characteristics of the program and related hardware characteristics, and each model has its own characteristics.

In this part, EIS also provides the optimization service of Autotuning. The user needs to provide his binary code. After passing the Autotuning service, the performance code that is almost optimal under the specified hardware in a certain model and framework will be generated.

In addition to reducing repetitive basic and trivial optimization work, Autotuning service can also improve user model service RT and actual resource consumption per QPS.

At present, EIS has been connected to a large number of internal businesses of Didi. Its entire functional modules are shown as follows. Due to some limitations, for external customers, the EIS service of Didi Cloud is still accessed by submitting work orders, and the self-service method will be launched soon.



Simplified hub: reduce user duplication of platform construction

Like EIS, machine learning platform-level products have accumulated rich platform experience internally. Based on this, machine learning platform has developed platform level products Jianshu on Didi Cloud.

Jianyong packages a variety of platform capabilities, resource management of weak isolation solutions, multiple task management, monitoring and alarm, rapid deployment of online services, etc., which can help other companies in the process of platform less pit, quickly equipped with platform capabilities, improve production efficiency.



future

For machine learning, computing power is still the most revolutionary force. Just as GPU is the driving force of the wave of deep learning that started in 2011, computing power is still the constrainer of engineering in the future.

As Jeff Dean puts it, “It turns out that what we really need is a million times more computing power, not just a few dozen times more.” Therefore, for the platform, how to better manage the constantly explosive increase of computing power, how to effectively release these computing power, how to control these computing power still need the platform to continue to explore, practice, technology upgrading and so on.

The vitality of all platforms comes from the overall improvement of production efficiency and overall cost reduction. For Didi machine learning platform, the first internal goal is to reduce the overall efficiency and cost control of Didi by using the latest machine learning, deep learning, reinforcement learning and other technologies, while taking into account the vitality of innovation.

For the external, adhering to the concept of continuously creating value for customers, deepen the functions, quality and cost of cloud platform products, to create inexpensive technical products for customers.

Machine learning platform 3.0

Specifically, Didi machine learning platform should realize the 3.0 stage, that is, the whole software stack from hardware selection to infrastructure to the upper layer, so as to unify the internal and external architecture and reduce the repetitive work of the internal and external parts.

At the same time, we will solve the problem of efficiency and scale from AI two aspects, on the platform to provide a richer functionality, such as market development algorithm, model, data market, GUI interface in order to improve the efficiency of users use a variety of learning technology, will also continue to precipitate more specific services, such as: face alignment, speech recognition, translation, etc.

If you are interested in Didi Cloud GPU cloud host, Elastic Inference Service (EIS), machine learning platform and other products and technical solutions, please visit the official website of Didi Cloud.

END