Abstract:

Author: Alibaba computing platform senior technical expert Ying Hui

MaxCompute, as alibaba’s main computing platform, once again lived up to expectations in the 2018 Double 11, withstanding the test of massive data and high concurrency during the double 11. For the group’s various business lines to provide strong computing power, worthy of the alibaba double 11 years to deliver super computing power nuclear weapons.

This article introduces how MaxCompute, based on tens of thousands of servers deployed in multiple clusters, provides escort and guarantee for the rapidly growing services of the group.

challenge

Double 11 is the natural deadline for MaxCompute projects. Before this year’s Double 11, the project of one way northward migration and off-line mixing will move hangzhou cluster to Zhangbei except ants, which involves most of the business projects, data storage and computing tasks, bringing challenges to the guarantee of data computing services for this year’s Double 11.

dimension

MaxCompute now has tens of thousands of servers including the offline mixed cluster, and the total data storage is in the EB-level, running nearly millions of jobs every day, and the total data processed by all jobs every day is in the hundreds of PB. The cluster is distributed in three geographical regions, which are connected by long transmission links. Due to the inherent universal connection characteristics of the group’s data services, there is constant dependence on a large amount of data and severe dependence on bandwidth between the clusters.

The cost of

A large number of servers means a large number of costs. To reduce costs, we need to make full use of the computing storage capacity of each cluster and improve resource utilization. At the same time, different services have different characteristics, some have more storage and less computing, some have more computing and less storage, some have busy LARGE-SCALE ETL I/O, and some have CPU intensive machine learning scientific computing.

How to make full use of the capacity of each cluster, improve the utilization rate of CPU, memory, IO and storage, balance the load of each cluster, take into account the pressure of long transmission bandwidth between sites, ensure stable operation under ultra-high resource utilization, but also support the overall relocation of Hangzhou such magnitude of change, These challenges for MaxCompute are not a major battle against the Double 11 push, but a daily routine for MaxCompute.

To address these challenges, here’s a list of what MaxCompute does from a variety of perspectives.

Cluster migration

This year, one way north migration and off-line mixing project, hangzhou cluster to Zhangbei, also involved the migration of MaxCompute control cluster and computing cluster. The big move in physical resources also brings some problems and challenges to MaxCompute service guarantee.

Transparent Project cluster migration

Many students may have encountered the Project migration cluster failed, “AllDenied” error. Before, when the Project was moved to another cluster, it would have an impact on users. Prior to the operation, users and operation and maintenance students needed to be notified, which was very confusing. This year, MaxCompute implements the migration of the Project to run the job and commit properly, so as to be transparent to the user.

Lightweight migration

Cluster because of differences in business, computing and storage ratio imbalance, the normal migration need target cluster storage and computing space to meet the requirements to do, so you will meet some cluster storage water level is higher, but the computational ability to also useless, but can’t transfer large Project in the past.

The lightweight migration mechanism, which was launched this year, enables only computing and some hot data to be migrated to the new cluster, while the old data is left in the original cluster, balancing computing resources without too many cross-cluster reads and writes.

OTS that can’t be moved

MaxCompute uses the core metadata of the OTS storage system, so if an OTS exception occurs, MaxCompute’s entire service is affected. Worse, the dependence of MaxCompute service on OTS does not support active/standby switchover for a long time, so the OTS cluster becomes the only point that MaxCompute cannot move.

This year, as part of the one way to the north migration planning, we carefully prepared and verified the OTS earnest change scheme, sorted out the dependence of control service and OTS cluster, the goal is not only to do the OTS primary and secondary earnest change, but also to cut directly from Hangzhou to Zhangbei.

After further optimization and drills, we finally shortened the switching time from the scheduled minute level to several seconds level, and successfully implemented it in the public cloud online environment. There was no abnormal feedback in the actual switching process, and no user perception was achieved.

The ability to replace the most critical point in the MaxCompute service without damage greatly reduces the global risk of the entire service.

Cross-cluster scheduling

Various global job scheduling mechanisms

Due to factors such as job types and service characteristics, various computing resources may be insufficiently used among clusters. For example, services have different peak hours and duration of resources throughout the day. The task type that applies for a large block of resources has a gap in the cluster that can be oversold and filled with small jobs; There are even special cases where there is a temporary need to borrow resources.

For this purpose, MaxCompute provides some global job scheduling mechanisms. You can schedule a batch of jobs in a specified cluster to run in the specified cluster. When the current cluster is busy, the system automatically schedules other cluster resources to run in the idle cluster.

In addition to balancing resource utilization, these mechanisms also provide flexibility for manual regulation, and scheduling mechanisms are being developed in conjunction with data layout to schedule based on real-time cluster state.

Topology-aware, data-driven bridgehead

To access table data from other clusters, a job can either read from the local cluster directly to the remote cluster (direct reading), or copy the remote data to the local cluster (waiting for replication) first. Each of these two approaches has its own advantages and disadvantages and applicable scenarios. In addition, the network topology between clusters (remote long transmission or same-city and same-core) also affects the choice of direct read and equal replication policies. Long-distance transmission bandwidth has high cost and small capacity, while the network bandwidth of the same city has a relatively large capacity. However, under the traffic of big data, the peak times are the same and congestion may occur. Therefore, it is necessary to take advantage of the bandwidth advantage of the same city without transferring the bottleneck to the same city, requiring global strategy deployment.

Because the business is changing every day and the dependency of data is also changing, we use the analysis data of historical tasks to continuously optimize and update the replication strategy. In each region, we select the bridgehead cluster to receive long-transmission replication, and then implement chain replication or close distance direct reading in the region. With Bridgehead 2.0, we achieved a 30%+ reduction in data replication traffic between the two geographies.

New problems with new models

Once the son of Heaven a courtier, a generation model generation bottleneck.

Today, MaxCompute cluster size is still the standard 10,000 units, but today’s 10,000 units are not the 10,000 units of a few years ago. The number of CPU cores in a single machine has gone from 24 cores, 32 cores, to 96 cores in a new cluster, up to three cores per machine. However, regardless of the number of cores in a single MaxCompute cluster, the CPU in the MaxCompute cluster can always run at full capacity for several hours every day, and the overall average daily CPU utilization reaches 65%.

In addition to CPU utilization, there is also the number of disks, and our data IO capability is still provided by the same stand-alone mechanical hard disk. Even though the disks were filled with helium and had three times as much capacity per disk, the IOPS per disk was almost the same, and the DiskUtil became a huge bottleneck.

After a series of optimization measures, this year’s large number of 96-core clusters are no longer in the face of 64-core panic last year, keeping DiskUtil at a more manageable level.

Transparent file merge

FILE_NOT_FOUND failed to be found while running a job, or a job scanned for a long partition range failed to be found after repeated running.

Platform in order to reduce the pressure on the number of cluster file, the background of automatic stop file merging one or two days have peaked, but for a long time this action in order to guarantee the data consistency and efficiency, and can hardly avoid interrupting reading homework, can only choose the partition only merge colder, but on the other hand the number of files pressure forced the cold threshold value to judge from a month to two weeks to shorter, On the other hand, there will always be a number of jobs that will still read earlier partitions and be interrupted by merge operations.

This year, the platform implemented a new merge mechanism that will allow a certain amount of time for jobs that are already running to still be able to read the files before the merge, thus eliminating this stubborn problem.

At present, the new mechanism has achieved good results in the public cloud, and the group is also in grayscale trial operation.

Platform performance improvement

As a computing platform, MaxCompute takes computing power as the core indicator and supports the group’s rapid business growth through continuous improvement of computing power. Compared to The 2017 Double 11, the number of MaxCompute jobs on this year’s Double 11 almost doubled. Over the past year, MaxCompute continued to build a highly available, high-performance, and adaptive big data platform through NewSQL+ rich Structured + federated computing platform +AliORC. Released in The Cloud Computing Conference in September, the evaluation results of TPC-BB exceeded the open source system by more than 3 times on the scale of 10TB; The scale score of 100TB has increased from 7800+ last year to 18000+, leading the world.

conclusion

MaxCompute successfully passed the test again on November 11, 2018. At the same time, we also saw that the platform needs to continuously improve the comprehensive ability of multiple clusters under distributed computing, continuously improve the computing power, guarantee the stability of large-scale computing, and support the continuous rapid growth of the business. Through continuous engine capacity optimization, development framework construction, intelligent warehouse construction and other dimensions, MaxCompute is evolving to an intelligent, open, ecological platform to support the next 100 percent business growth.