Abstract: Looking back on the big events in the field of big data technology, the earliest date can be traced back to the official launch of Hadoop in 2006, and look around, around the database and data processing engine, the industry is full of a variety of big data technology. At the Big Data Technology Summit of Cloud Computing Community 2017 online Technology Summit, Lin Wei, architect of Alibaba Cloud Big data computing platform, made a share entitled “MaxCompute Brain: Cost-based Optimizer” to share the design and architecture of Alibaba big data computing service brain — cost-based optimizer.


For more exciting content, see the big data channel of cloud community yq.aliyun.com/big-data; In addition, through Maxcompute and its supporting products, cheap large data analysis, only a few steps, visit www.aliyun.com/product/odp for details… .


Abstract: Looking back on the big events in the field of big data technology, the earliest date can be traced back to the official launch of Hadoop in 2006, and look around, around the database and data processing engine, the industry is full of a variety of big data technology. This is a good time for technology people. There are over 300 hot DB in the database field alone, and big data processing technology around the Hadoop ecosystem is blooming like a flower. At the Big Data Technology Summit of Cloud Computing Community 2017 online Technology Summit, Lin Wei, architect of Alibaba Cloud Big data computing platform, made a share entitled “MaxCompute Brain: Cost-based Optimizer” to share the design and architecture of Alibaba big data computing service brain — cost-based optimizer.


MaxCompute profile


Big Data Computing Service (MaxCompute) is a fast, fully managed PB/ EB-level data warehouse solution. MaxCompute has the capacity of expanding 10,000 servers and cross-regional disaster recovery. It is alibaba’s internal core big data platform, which undertakes the vast majority of computing tasks within the group and supports millions of jobs per day. MaxCompute provides users with a comprehensive data import solution and a variety of classical distributed computing models, which can quickly solve the problem of computing massive data for users, effectively reduce enterprise costs, and ensure data security.


MaxCompute architecture


The basic architecture of MaxCompute is shown in the figure above. At the bottom is the pangu distributed file storage system that provides unified storage on top of the physical machine. The layer above pangu is Fuxi distributed scheduling system, which manages all computing resources including CPU, memory, network and disk. The next level up is the unified execution engine which is the MaxCompute execution engine; On top of the execution engine will be built a variety of computing modes, such as streaming computing, graph computing, offline processing, in-memory computing, machine learning and so on. And then on top of that there’s a related programming language, which is MaxCompute; In the language, we hope to provide a good platform for all applications, so that data engineers can develop related applications through the platform, and make the application can be quickly deployed and run in the distributed scene.


Research and development of MaxCompute


The development of MaxCompute is mainly divided into the following four aspects:
  1. High performance, low cost and large scale. It is hoped that the MaxCompute platform can improve the performance of computing, reduce the cost of users as much as possible, and reach the scale of 10,000 machines and multiple clusters.
  2. Stability, servitization. It is hoped that the MaxCompute platform can provide stability and servization, so that users do not have to worry too much about the difficulty of distributed applications, but only need to focus on what kind of computing users need to do, so that the system itself serves users, and can provide a stable, servization interface.
  3. Ease of use, for data developers. We want the MaxCompute platform to be easy to use and easily accessible to data developers who don’t need a deep understanding of distributed scenarios, but who focus on what computations need to be performed with that data. Next, the MaxCompute platform helps data developers execute their ideas efficiently and inexpensively.
  4. Multi-function. It is hoped that MaxCompute will be able to support more than just stream computing, graph computing, batch processing and machine learning, and that more kinds of computing can be better supported on the MaxCompute platform.
The brain of MaxCompute — optimizer
Based on the above development approach, the MaxCompute platform needs to have a more powerful brain that understands the user’s data, understands the user’s calculations, and understands the user itself. The MaxCompute brain needs to be able to help the user optimize more efficiently. Through the system level to understand the user really need what kind of operation, so as to achieve the aforementioned purposes, enables users to from distributed scenario, don’t have to consider how to make operation efficiently, and will be part of this job to MaxCompute brain, let it to provide users with more intelligent platform, This is the value that MaxCompute can bring to the user.


So what is the brain of MaxCompute? It’s the optimizer. The optimizer is able to string together all the information to perform user operations in the most efficient way in a distributed scenario by understanding the relevance of the data in the system and the user’s intentions and the ability of the machine to fully analyze a variety of environments. This share uses offline computing as a prime example to introduce the optimizer, the brain of MaxCompute.


The MaxCompute offline computing architecture is shown in the figure below. MaxCompute provides an SQL-like scripting language, which is submitted by FrontEnd and converted into a logical execution plan. The logical execution plan is translated into a more efficient physical execution plan under the guidance of the Optimizer, and the physical execution plan is decomposed to the operation nodes by the Fuxi distributed scheduling system after the connection with the Runtime.


The core of the above process is how to fully understand the user’s core plan and optimize it to achieve an efficient physical execution plan. This process is called the Optimizer. Hive and Spark optimizers in the open source community are basically rule-based optimizers. In fact, there is such a classification of optimizers on standalone systems, which is divided into rule-based optimizers and cost-based optimizers.


In standalone scenarios, Oracle 6-9i uses a rules-based optimizer, Oracle 8 starts with a cost-based optimizer, and Oracle 10G completely replaces the previous rules-based optimizer. In big data scenarios like Hive, which started with a rules-based optimizer, the new version of Hive also starts with a cost-based optimizer, but it’s not really a cost-based optimizer yet. MaxCompute uses an entirely cost-based optimizer. So what’s the difference between the two optimizers? Actually rule-based optimizer will theoretically according to the rules of logic model identification of transformation, which is to identify A pattern is likely to trigger A rule execution plan from A to B, but this method is not sensitive to the data, and optimization is local greed, like climbing the mountain people look only at the moment where is 10 meters range upward, Regardless of the fact that you have to go down to get to the top of a higher mountain, rule-based optimizers tend to fall into locally good but globally bad scenarios and produce disparate execution plans based on the order in which the rules are applied, so the results are often not optimal. The cost-based optimizer tries various possible equivalent execution plans through the Volcano Volcano model, calculates the “Cost” of these equivalent execution plans according to the statistical information of the data, and finally selects the execution plan with the lowest Cost, so as to achieve global optimality.




A concrete example is shared here to help you understand why rule-based optimizers fail to achieve global optimization. The script in the figure above means that the join is performed on A, B, and C, and the result of the join is group by on A column and the average value is calculated. The above query process can be described as a tree logical execution plan, which in the database world is often bottom-up. That is, for the logical plan tree, the leaf node is the input and the final target output is the root node, so the final data flow is bottom-up. Size(B)




The cost-based optimizer takes a different approach by first expanding the query into multiple equivalent executable plans through a volcano model. In the example, you can join A and B before joining C or B and C before joining A. In both plans, since the following plan has an additional Exchange, the cost-based optimizer will have A Cost model at the end, Through calculation, it is found that the first plan is better in Cost, so the optimal plan will be selected for execution. In the cost-based optimizer, a lot of Cost models unique to distributed scenarios are made, and non-SQL is taken into account. Because many scenarios are internet-related applications, users need a lot of non-SQL support, so user-defined functions can be used to help users realize some query optimization combining non-SQL and relational data. Finally, there are some optimizations for multiple distributed scenarios, which are some of the things that cost-based optimizers do differently than stand-alone optimizers.




Next, I would like to share the Volcano Volcano model. In fact, the Volcano model is an engine of the cost model, which has been proposed in the stand-alone system. The Volcano model also has some rules, but unlike the rules in the rules-based optimizer, these rules are more like transformation functions. Volcano model is first for logical grouping execution plan, after the group to complete a work above, will first explore local expression in group, then according to some rules, some transform is applied to the equivalent transformation principle are algebra, in every time of equivalent change over time is not to replace the original logic execution plan tree, It’s splitting a new tree on top of the original one. Therefore, there will be many equivalent execution plan trees in the end, and the best execution plan can be selected through the cost-based optimizer. The principle of Volcano model is to first hope that each rule is more local, that is, the better the local and orthogonal rules are, the more comprehensive space exploration can be achieved. For example, if in the plane defines four directions around, you can through the four direction search the two-dimensional plane of any point, the same optimization problem is to select the best plan in the space, then hope that after each change when exploring rules can orthogonal, so that you can use fewer rules to explore the whole space, That leaves it up to the engine to explore space and choose the optimal path to explore.




The previous share is more abstract, here to further illustrate the example, hoping to deepen your understanding of the optimization process. Suppose there is a very complex logic execution plan trees, it is really need to do to the user’s task, now will be a small fraction of extracted, first when planning optimization analysis can have any of the existing rules and pattern matching, assuming that the two nodes in the graph just with pattern matching, a filter is a project, Theoretically, filter wants to push to the leaf node, that is, the earlier the filter is carried out, the better. Now there is a mode: If the filter appears on the project, that is to say, the filter needs to be done first and then the project needs to be carried out. In this way, it can be converted to another plan and the two nodes become new nodes. In other words, the filter and project can be changed in the order, which is the process of applying rules. The same another node, such as aggregate can with other pattern matching operation, then can find the corresponding rules, and convert the equivalent node operation, so that you can by reusing a tree node model to maintain more tree, can be seen in the example here USES two rules, looks nodes is just a store, But it actually describes four equivalent trees. Then the cost of the four equivalent trees will be calculated, and the lowest cost tree will be selected as the execution plan. The overall cost-based optimization process is like this, but it can be seen that when the logical plan tree is very large and there are many rule changes, the whole exploration space is very large, so the optimization process needs to be considered in many factors.


Next, we will introduce the general algorithm of the optimization engine. The following figure is a simplified optimization engine algorithm, and there are many factors to consider when really implementing the optimization engine. The following figure is not shown.




First, all the logical nodes in a logical execution plan will be registered, and at the same time, the rules will be matched with the existing logical mode, and then the matched rules will be pushed to the rule queue, and then the rules in the rule queue, and really apply the rule. Application rules exist two kinds of conditions, of course, is a kind of can produce after application of equivalent tree, which is in the tree’s local split out of a different kind of tree, and split up in a tree above may also with other pattern matching, if all rules are within the scope of local matching complete, you can begin to calculate a cost price. After the optimal plan is obtained by calculating the Cost, the local optimization can be abandoned. If the current plan is still not optimal, the Cost can be recorded and other parts of the tree can be optimized until the optimal plan is finally found.


Examples of optimization problems in distributed queries


Here are some examples of optimization problems in distributed queries that are different from those in stand-alone systems.




SQL > alter table T1 join table T1; SQL > alter table T1 join table T1; T2 has been partitioned according to A, and the condition of join is t1.a = t2.a. One method is that T1 is partitioned according to A and B. Ok, the join condition is above A. Therefore, T1 needs to be partitioned according to A again before joining T2. But if the T1 table is very big, is greater than that of T2 table size, this time don’t want to will T1 according to partition afresh, can use another solution, instead is to T2 as a whole, will all the data broadcast for T1 T2 every data, because the join condition was done on a connection, so you can do such a choice, This avoids reshuffling large amounts of data. In this scenario, how you perceive join conditions is key. The two plans in the above example do not have absolute optimality, but need to decide which is the optimal solution according to the size of the data, the amount of T2 data and the distribution of T1 data fragments. Many papers have discussed this problem on SIFMOD12, so we will not elaborate on it in detail here.




As shown in the figure, T1 and T2 still join on A, and there will be a condition that t1.A >20 after completion of the join. After completion, a project will be carried out, and the completed results will be regarded as a new column B, and finally all the results are expected to be order by B. Both T1 and T2 are range partitions. Ok, this is not a hash partition, and since global sort has been done, the range partition boundary between the two tables can be used when making join. There is no need to reshuffle data. For example, at present, it is known which partitions will be greater than 20. Corresponding data can be read according to the selected boundary and then carried out, which can avoid data shuffling as far as possible. Order the results of this method according to the order by B rule. Assuming that foo() is monotonically increasing, we can take advantage of the fact that foo() has been partitioned by range, and that foo() retains the order of B through join and exchange without introducing an exchange. You can order by B directly. This is query optimization in distributed, where you can optimize a distributed query if you understand the shards in the data, if you understand how the data is distributed, if you understand the user’s custom function methods, and how those methods interact with the optimizer. In fact, through the user’s Annotation, we can know what kind of characteristics the user’s method has and what kind of data attributes can be maintained.


User-defined function UDF




In distributed systems, especially in non-SQL, a large number of user-defined functions are needed for expansion, because many query processes are not as simple as Join and aggregate, but need to model many unique functions, so user-defined function implementation is required. Once the user-defined function, the optimizer is often difficult to understand the UDF, then optimize the scope will be greatly restricted, such as yellow center node in the image above contains a user-defined function, but may be the system does not know what the function do, so when I was in the optimization may be divided into three smaller fragments can be optimized, Further optimizations were made in three small pieces. If the optimizer can understand what the user-defined functions are doing, then the optimizer can penetrate the UDF to achieve a wider range of optimizations. So what are the features of the UDF that help the optimizer penetrate it? Actually can analyze the UDF is Row – wise, considering it is line processing, there is no an inter-bank, consider the UDF is not monotonic function, whether in dealing with some of the columns are the same, which can penetrate, is it can keeping data fragmentation or sorting, and some information on the Cost, Its Selectivity is high or low, and whether the data distribution of output is high or low, etc., can be used to better optimize the optimizer, open up more optimization space for the optimizer, achieve more flexible optimization, and help the Cost model select a better solution. This is something Alibaba is currently doing with the MaxCompute optimizer.


Optimization rules


The MaxCompute cost-based optimizer does a number of optimizations that are shown below, but won’t be covered here. It can be seen from the following figure that there are many optimizations to be done in the query. All of these optimizations are operators on the whole system engine, and these operators are also changing the graph, generating many equivalent trees. The optimized engine selects the best scheme through the Cost model.




Cost model


What is the Cost model? In fact, the Cost model needs to pay attention to the Cost model itself. Each Cost model needs to focus on local areas, such as what kind of Cost is input and what kind of Cost will be obtained after join, without paying attention to the whole. The Cost of the global scheme is the total Cost obtained by the engine through accumulation. A good Cost model tries to reflect the objective physical realization. The Cost model does not need to be exactly the same as the real one. The ultimate purpose of the Cost model is to distinguish the advantages and disadvantages of the scheme. At present, the Cost model of traditional database is still the model of a long time ago. Even if the hardware structure has changed, the Cost model can be used to select the optimal scheme as long as the von Neumann architecture has not changed.




In fact, the optimizer also has many other factors to consider. For example, in terms of rules, it needs to carry out equivalent transformation according to the rules, and finally select the optimal scheme according to the Cost model. As the size of the logical plan increases, enumerating all possible solutions can be extremely time-consuming. Especially in MaxCompute, you want the logical execution plan to be as large as possible because it gives the optimization engine more room. However, when enumerating all plans, some enumerations may not be necessary. You’re probably already in a non-optimization situation. So how to do effective pruning, how to avoid unnecessary exploration space, also need to consider a good optimizer. In addition, for the choice of exploration space, time can be spent on the space that is most likely to be the optimal plan, which may be a better choice, because it is not expected to choose the optimal plan through NP-hard time, but to choose a good execution plan within a limited time. Therefore, in the field of optimization, it is not necessary to find the best solution, but to avoid the worst solution, because there is always a time constraint in optimization.


Why are cost-based optimizers increasingly important to the MaxCompute platform?


That’s because Alibaba wants to move beyond Hive’s query statements to provide more complex stored procedures. Have a show in the above, can through the variable assignment and preprocessing the if – else write more complex query process and storage process, and based on the rules of the optimizer will more walk more wide because of greedy algorithm, in the end is likely to get the global optimal solution, and the logic of complicated plan that can optimize the space greatens, But it also makes more demands on optimizers, so better cost-based optimizers are needed to help select better execution plans. In distributed and non-SQL scenarios, cost-based optimizers are different from traditional stand-alone optimizers, so a deeper understanding of data, operations, and users is needed to make cost-based optimizers more intelligent.


Understand the data


So what do we mean by understanding data? In terms of data format, understanding data requires understanding of more data indexes and heterogeneous data types, structured data, unstructured data and semi-structured data. However, in the scene of big data, data has some power-law attributes, and there are tables with millions of sparse columns. A better optimization is needed in such a scenario; Understand data also need to understand the rich data fragmentation, it is in a distributed scenario, data fragmentation can be Range/Hash/DirectHash, and storage can be Columnstorage/Columngrouping, Hierarchy Partition is also used for hierarchical Partition. You will also need to understand sophisticated Data statistics and runtime Data; you will need to understand Histogram, Distinct Values, Data volumes, and more.


Understand the operation


From the perspective of understanding operation, it is necessary to better understand user-defined functions, interact with the optimizer, and enable users to display features on attributes of data in operation through annotations, so that global optimization can be carried out. In addition, more optimization will be carried out during operation. For example, when the intermediate operation reaches a certain stage, it needs to judge the size of the data volume, carry out parallel selection according to the size of the data volume, and select the optimization strategy on the network topology according to the location of the data. There can also be a balance between real-time performance, scale, performance, cost and reliability, and there can also be a network Shuffling for in-memory calculation and streaming calculation.


To understand the user


From the perspective of understanding users, it is necessary to understand the user scenarios on the optimizer, understand the different demands of users on scale, performance, latency and cost in multi-tenant scenarios, and let the optimizer select the best solution in such scenarios. In terms of ecology, optimizer is the core optimization engine, hoping to be more open in language, hoping to connect with more languages and ecology, but also hope to provide powerful IDE can provide complete development experience for developers; Finally, we hope to provide multiple computing modes on a unified platform, so that the optimizer can truly become the brain of computing.


The original link
To read more articles, please scan the following QR code: