Author | Liu Dongyang, Wu Ziyang
At the end of 2018, Vivo AI Research Institute started to build AI computing platform in order to solve the pain points of unified high-performance training environment, large-scale distributed training and efficient utilization and scheduling of computing resources. After more than two years of continuous iteration, the platform has made great progress in construction and landing, becoming the core basic platform of Vivo AI field. The platform has evolved from deep learning training to VTraining, VServing and VContainer, providing model training, model reasoning and container capabilities. The platform’s container cluster has thousands of nodes and hundreds of PFLOPS of GPU computing power. The cluster runs thousands of training tasks and hundreds of online services simultaneously. This article is one of vivo AI computing platform practice series articles, mainly share the platform in the hybrid cloud construction practice.
background
Hybrid cloud is one of the new areas of focus in the cloud native space in recent years. It refers to solutions that combine private and public cloud services. At present, several major public cloud vendors provide their own hybrid cloud solutions, such as AWS Outpost, Google GEC Anthos and Ali ACK Hybrid Cloud. Most vendors use Kubernetes and containers to mask underlying infrastructure differences and provide unified services. AI computing platform chooses to build hybrid cloud mainly for the following two reasons.
Elastic resources of the public cloud
The cluster of the platform uses the bare metal server in the machine room built by the company. The procurement process of new resources is complicated and the cycle is long, and it cannot timely respond to the temporary large amount of computing power requirements of the business, such as large-scale parameter model training and holiday activity expansion of online service. At the same time, due to the severe situation of server supply chain this year, hardware devices such as network cards, hard disks and GPU cards are in short supply, so there are great risks in server procurement and delivery. Public cloud resources can be applied for and released on demand. Hybrid clouds can use public cloud resources to meet the temporary computing requirements of services and effectively reduce costs.
Advanced features of public clouds
The public cloud has some advanced features, such as AI high-performance storage CPFS, high-performance network RDMA, and deep learning acceleration engine AIACC. Currently, the private cloud of the company does not have these solutions or features, but the time and money cost of privatization is very high. Hybrid cloud can use these features quickly and at low cost.
plan
Scheme selection
Through preliminary investigation, the following three schemes can meet the needs of hybrid cloud:
Solution 1 has low implementation cost, does not change the current resource application process, and can be implemented quickly. Services can be expanded by the hour. So we chose plan one.
The overall architecture
The overall architecture of hybrid cloud is shown in the figure below. The management plane of the K8s cluster is deployed in an equipment room built by the company, and the working plane includes physical machines in the equipment room and cloud hosts of Ali Cloud. The computer room and Ali Cloud through the dedicated network, physical machine and cloud host can access each other. The solution is transparent to the upper platform, for example, VTraining platform can use the computing power of cloud host without changing.
The ground practice
Register cluster
First, you need to register your self-built cluster with Aliyun. Notice The VPC network segment cannot conflict with the Service CIDR of the cluster. Otherwise, the VPC cannot be registered. The CIDR of the VIRTUAL switch and Pod in a VPC cannot coincide with the network segment used in the equipment room. Otherwise, route conflicts may occur. After the registration is successful, deploy the ACK Agent. Its role is to take the initiative to establish a long link from the computer room to Ali Cloud, receive the console request and forward to apiserver. If there is no dedicated line, this mechanism prevents apiserver from being exposed to the public network. The link between the console and Apiserver is as follows:
Aliyun ACK console <<–>> ACK Stub (deployed on Aliyun) <<–>> ACK Agent (deployed on K8s) <<–>> K8s Apiserver
Requests from the console to the cluster are secure and controllable. When the Agent connects to the Stub, it carries the configured token and certificate. The link uses TLS 1.2 protocol to ensure data encryption; You can configure console access to K8s through ClusterRole.
Container network Configuration
The container network of K8s requires Pod and Pod, Pod and host computer to communicate normally, and the platform adopts Calico + Terway network scheme. The working nodes in the equipment room use Calico BGP, and Route Reflector synchronizes THE ROUTING information of Pod to the switch. Both physical machines and cloud hosts can access Pod IP normally. Working nodes on Aliyun will adopt Terway shared network card mode, and Pod will be assigned IP from the network segment configured by Pod virtual switch, which can be accessed in the machine room. The platform labels cloud hosts and configuri the nodeAffinity of calico-Node components to not schedule cloud hosts. Also configure the Terway component’s nodeAffinity to run only on cloud hosts. This enables physical machines and cloud hosts to use different network components. In deploying and using Terway, we encountered and resolved the following three issues:
1. The terway container fails to be created and the /opt/ cnI /bin directory does not exist is reported.
To solve the problem, change the hostPath type of the path in Terway Daemonset from Directory to DirectoryOrCreate.
2. Failed to create the service container, indicating that the Loopback plug-in cannot be found.
Terway does not deploy the loopback plug-in (to create a loopback network interface) in/ opt/ cnI /bin/ as calico-Node does. We solved the problem by adding InitContainer to Terway Daemonset to deploy the Loopback plug-in.
3. The IP address assigned to the service container belongs to the network segment of the host switch.
This is because in use, we added a new availability area, but did not configure information about the Pod interactive machine in the availability area to Terway. To resolve the problem, add the Pod virtual switch information for the availability zone in the VSwitches field configured in terway.
The cloud host is added to the cluster
The process for adding a cloud host to a cluster is the same as that for a physical server. Apply for cloud hosts on the enterprise cloud platform, initialize the cloud hosts on the automation platform of VContainer, and add the cloud hosts to the cluster. Finally to the cloud host cloud host proprietary label. For an introduction to the automation platform, see vivo AI Computing Platform Cloud native Automation Practice.
Reduce line pressure
The dedicated line between the equipment room and Aliyun is shared by all services of the company. If the platform occupies too much dedicated bandwidth, the stability of other services will be affected. When landing, we found that the deep learning training task pulled data from the storage cluster in the machine room, which really caused pressure on the dedicated line. Therefore, the platform took the following measures:
1. Monitor the network usage of the cloud host, and the network group helps to monitor the impact on the dedicated line. 2. Use the TC tool to limit the downstream bandwidth of the eth0 network adapter on the cloud host. 3. Support services to use data disks of cloud hosts to preload training data, avoiding repeated data fetching from the equipment room.
The ground effect
Several business parties temporarily need a lot of computing power for the training of the deep learning model. With the hybrid cloud capability, the platform adds dozens of GPU cloud hosts to the cluster and provides them for users to use on the VTraining platform, meeting the computing power requirements of services in a timely manner. The user experience is exactly the same as before. These resources can be used for a period of one to several months depending on different services. After estimating, the use cost is much lower than the cost of purchasing physical machine, effectively reducing the cost.
future
The construction and implementation of hybrid cloud have achieved phased results, and we will continue to improve the functional mechanism and explore new features in the future:
AI online services can be deployed on cloud hosts using hybrid cloud capabilities to meet the temporary computing requirements of online services. Establish a simple and effective resource application, release and renewal process mechanism to improve cross-team communication and collaboration efficiency. Measure and assess the cost and utilization rate of cloud host to promote the business side to make good use of resources. Automates the entire process of applying for and adding cloud hosts to a cluster, reducing manual operations and improving efficiency. Explore advanced features on the cloud to improve performance for large-scale distributed training.
Thank you
Thanks to Huaxang, Jianming, Liusheng and others from Ali Cloud Container team and Yang Xin, Huang Haiting, Wang Wei and others from the company’s Basic Platform Department for their strong support during the design and implementation of hybrid cloud solution.
About the author:
Liu Dongyang is a senior engineer in computing Platform Group of Vivo AI Research Institute. He used to work for Kingdee and Ant Financial. Focus on cloud native technologies such as K8S and containers.
Wu Ziyang, senior engineer of Computing Platform Group, Vivo AI Research Institute, worked for Oracle, Rancher and other companies; Kube-batch, TF-operator and other projects contributor; Focus on cloud native, machine learning systems and other areas.
Related links in the article:
1) Vivo AI computing platform: www.infoq.cn/theme/93
2) AWS Outpost: aws.amazon.com/cn/outposts…
3) GEC Anthos: cloud.google.com/anthos
4) ACK hybrid cloud: help.aliyun.com/document_de…
5) AI high-performance storage CPFS: www.alibabacloud.com/help/zh/doc…
6) deep learning speed engine AIACC: market.aliyun.com/products/57…
7) Vivo AI computing platform cloud native automation practice: www.infoq.cn/article/9vB…
Click below to learn more about aliACK Hybrid Cloud! Help.aliyun.com/document_de…