Introduction: This paper introduces MaxCompute and alibaba’s internal big data development suite based on MaxCompute, and introduces the problems frequently encountered in the process of data development and related solutions.
-
Frequent business change — Business development is very fast, business needs are many and change frequently;
-
Need for fast delivery – Business driven and need to deliver results quickly;
-
Frequent releases — The iteration cycle is in days, with several releases per day;
-
Multiple operation and maintenance tasks — in the group common layer, each developer is responsible for more than 100 tasks on average;
-
Complex system environment — Ali platform system is mostly self-developed, and in order to ensure the development of business, the platform system has a fast iteration speed and great pressure on the stability of the platform.
I. Unified computing platform
MaxCompute consists of four parts, namely, the MaxCompute Client, the MaxCompute Front End, the MaxCompute Server, and the Apsara Core.
On 10 November 2016, Sort Benchmark published the final results of CloudSort 2016 on its official website. Ali Cloud won the world champion of Indy (Special purpose Sorting) and Daytona (General purpose sorting) with the result of 1.44/TB. Broke the record of 1.44/TB held by AWS in 2014 and won the world champion of Indy (Special Purpose Sorting) and Daytona (General Purpose sorting). It broke the record of 1.44/TB held by AWS in 2014 and won the world champion of Indy (Special purpose Sorting) and Daytona (General purpose sorting), and broke the record of 4.51/TB held by AWS in 2014. This means that Ali Cloud will be the world’s top computing capacity, into the cloud products of Pratt & Whitney technology. CloudSort, also known as the “Battle for Cloud Computing Efficiency”, is a competition to see who can Sort 100TB of data for less and is one of the most realistic projects at Sort Benchmark.
Second, unified development platform
For example, if the SQL written by users is of poor quality, low performance, or does not comply with the specifications, the rules are formed after the summary, and the potential faults are solved in advance through the system and research and development process to avoid post-processing. SQLSCAN, combined with D2, is embedded in the development process. SQLSCAN checks are triggered when users submit code. The SQLSCAN workflow is shown in the following figure.
-
Code specification class rules, such as table naming conventions, life cycle Settings, table comments, and so on.
-
Code quality class rules, such as scheduling parameter usage checks, denominators 0 alerts, NULL participation in calculations affecting results, insert field order errors, etc.
-
Code performance class rules, such as partition clipping failure, scan large table alerts, double calculation detection, etc.
-
SQLSCAN rules are classified into strong rules and weak rules. When a strong rule is triggered, the task is blocked and the code must be fixed before it can be submitted again. If a weak rule is triggered, only a violation message is displayed and the user can continue to submit the task.
You can configure data quality verification rules to automatically monitor data quality during data processing tasks.
It mainly verifies whether the target data meets expectations. The main scenarios are as follows: