Brief introduction: AliCloud database has realized its unique Autosaling ability, which is jointly built by the database kernel, management and DAS (database autonomous service) team. The kernel and management team provide the basic ability of database Autoscaling. DAS is responsible for the monitoring of performance data, the implementation of Scaling decision algorithm and the presentation of Scaling results. This paper will elaborate Autosaling related knowledge in detail from the aspects of Autosaling’s working process and landing.
1. Introduction
Gartner predicts that by 2023, 3/4 of the global databases will be on the cloud. One of the biggest advantages of the cloud native database is the natural elasticity of cloud computing. The database can be used at any time like water, electricity and coal, and the Autosaling capability is the ultimate embodiment of elasticity. The Autoscaling ability of the database refers to the automatic expansion of instance resources when the database is in the peak of business; Automatically release resources to reduce costs when business load drops.
Cloud vendors in the industry AWS and Azure have realized Autoscaling capability on some of their cloud databases. Ali Cloud Database has also realized its unique Autosaling capability, which is jointly built by the database kernel, management and DAS (Database Autonomy Service) team. The kernel and the control team provide the basic capability of database Autoscaling, while the DAS is responsible for the monitoring of performance data, the implementation of the Scaling decision algorithm and the presentation of the Scaling results. DAS (Database Autonomy Service) is a cloud Service based on machine learning and expert experience to realize Database self-perception, self-repair, self-optimization, self-operation and self-security. It helps users eliminate the complexity of Database management and Service failures caused by manual operation, and effectively guarantees the stability, security and efficiency of Database services. The Autoscaling/Serverless capabilities are part of the “self-operating” part of the solution architecture, as shown in Figure 1.
Figure 1. Solution architecture for the DAS
2. Autosaling’s workflow
The overall workflow of database Autoscaling can be defined as three stages as shown in Fig. 2, namely, “When: When the Scaling is triggered”, “How: Which Scaling method is adopted” and “What: What Scaling is to which specification”.
- When the Scaling is triggered is to determine the timing of Scaling and Scaling back of the database instance. The usual practice is to observe the performance indicators of the database instance and to perform Scaling operation at the peak of the load of the instance and Scaling operation when the load falls back. This is a common Reative passive triggering method. In addition, we have implemented Proactive triggers based on predictions. Trigger timing will be discussed in detail in Section 2.1.
- The Scaling method usually has two forms: ScaleOut (horizontal Scaling) and ScaleUp (vertical Scaling). Taking the distributed database POLARDB as an example, ScaleOut is implemented by increasing the number of read-only nodes from 2 to 4, which is mainly applicable to the case where the instance load is dominated by read traffic. ScaleUp is implemented by upgrading the CPU and memory size of the instance, such as from 4GB with 2 cores to 16GB with 8 cores, which is mainly suitable for instance load dominated by write traffic. The Scaling method will be introduced in detail in Section 2.2.
- After the expansion mode is determined, appropriate specifications should be selected to reduce the load of the instance to a reasonable water level. For example, in the ScaleOut approach, you need to determine how many instance nodes to add; For the ScaleUp approach, you need to determine the number of CPU cores and memory for the upgrade instance to determine which instance size to upgrade to. The selection of capacity expansion specifications will be introduced in detail in Section 2.3.
Figure 2. Diagram of the Autoscaling workflow
2.1 Trigger timing of AutoCaling
2.1.1 Reactive passive trigger (based on observation)
Reactive trigger based on observation is the main implementation form of AutoCaling at present. Users set different trigger conditions of expansion and shrinkage for different instances. For computing performance scaling, users can configure triggering conditions in line with service load by setting triggering CPU threshold, observation window length, upper limit of specification, upper limit of read-only node number and silent period. For storage space expansion, users can set the trigger threshold and upper limit of space expansion to meet the growth of instance business and avoid the waste of disk resources. The configuration options for passive triggers are described in detail in Section 3.2.
The advantage of Reactive triggering is that it is relatively easy to implement and highly accepted by users, as shown in Figure 3. As shown, passive trigger also has its disadvantages. In general, the actual execution of the Scaling operation will not take place until the observation conditions configured by the user are reached, while the execution of the Scaling operation also takes a certain amount of time. During this period, the user’s instance may have been under high load for a long time, which will affect the stability of the user’s business to a certain extent.
Fig. 3. Comparison diagram of passive trigger expansion resources
2.1.2 Proactive trigger (based on prediction)
The best way to deal with Reactive triggers is to use Proactive triggers, as shown in Figure 4. Is shown, through the instance load prediction, the instance capacity expansion operation is carried out in advance in the period before the instance load is predicted to be at the peak, so that the instance can pass the whole business peak smoothly. The periodic workload is the most typical application scenario based on prediction (the online examples with periodic characteristics account for about 40%). DAS uses the periodic detection algorithm implemented by students in the intelligent database lab of Dama Institute. This algorithm combines the information in frequency domain and time domain, and the accuracy rate reaches more than 80%. For example, for an online instance with a “day-level” periodicity, the AutoCaling service expands the capacity of the instance before the start of its daily peak hours to better cope with the periodic peak hours.
Fig. 4. Comparison of capacity expansion resources triggered by active mode
We also implemented the prediction based method in the storage space expansion of RDS-MySQL. Based on the disk usage index of the instance in the past period of time, machine learning algorithm was used to predict the maximum storage space of the instance in the next period of time, and the capacity expansion was selected according to the predicted value. You can avoid the impact of the rapid growth of the instance space.
Figure 5. Forecasts based on disk usage trends
2.2 Autoscaling method decision
DAS has two modes of Autoscaling: ScaleOut and ScaleUp. When the Scaling scheme is given, it also provides more diagnostic suggestions (such as SQL automatic flow limiting, SQL index suggestions, etc.) based on the global decision analysis module of Workload Management. Fig. 6 shows a schematic diagram of decision making in the Scaling mode, which takes the POLARDB database as an example. Polardb database adopts a distributed cluster architecture of one write, many read, separated computing storage. A cluster contains a primary node and multiple read-only nodes. The primary node handles read and write requests, while the read-only node only handles read requests. Figure 6. The “Performance Data Monitoring Module” shown in Figure 6 will continuously monitor various performance indicators of the cluster and judge whether the instance load at the current moment meets the Autoscaling trigger condition described in Section 2.1. When the trigger condition is met, it will enter into Figure 6. , this module will analyze the current Workload of the instance and determine the reason why the instance is under high load based on the number of sessions, QPS, CPU utilization, locks and other indicators of the instance. If it is determined that the instance is under high load due to deadlock, a large number of slow SQL or large transactions, etc. The AutoCaling recommendation will also be recommended at the same time SQL flow limiting or SQL optimization recommendations, so that the instance failure quickly self-healing to reduce the risk.
In the Autoscaling decision generation module, it will judge which Scaling method is more effective. PolarDB database, for example, the module will pass the instance of performance index and the main library, the aggregation function, system protection, transaction split statements or custom cluster characteristics to determine the load distribution, the current cluster instance in read current flow dominated if judgment, will perform ScaleOut operations to increase the number of cluster read-only node; If it is determined that the instance is currently dominated by write traffic, the scaleUP operation is performed to upgrade the cluster’s specification. The decision selection of ScaleOut and ScaleUp is a very complex problem. In addition to considering the current load distribution of the instance, we also need to take into account the upper limit of the expansion specification and the upper limit of the number of read-only nodes set by the user. Therefore, we also introduce an effect tracking and decision feedback module. The Scaling method and Scaling effect in the history of this example will be analyzed, so as to make certain adjustments to the current Scaling method selection algorithm.
Fig. 6. Schematic diagram of PolarDB’s Scaling mode decision
2.3 Selection of Autoscaling specifications
2.3.1 ScaleUP decision algorithm
ScaleUP decision algorithm refers to that when ScaleUP operation is determined to be performed on a database instance, it selects appropriate specifications for the current instance according to the instance’s workload, instance metadata and other information, so that the current instance’s workload can meet the given constraints. At first, Das Autoscaling’s decision algorithm ScaleUP was implemented based on rules. Take POLARDB database as an example, there are currently 8 instance specifications of POLARDB cluster, and rules-based decision algorithm is enough to be used in the early stage. But we also explored a classification model based on machine learning/deep learning, because as database technology eventually iterates to the Serverless state, the number of specifications available to the database is very large, and the classification algorithm can be very useful in this scenario. As shown in Fig. 7. And Fig. 8., we currently implement the offline training model and real-time recommendation model of database specifications based on performance data. By marking the range of custom CPU utilization, we classify the model in the annotated data set by referring to the AutoTune automatic parameter adjustment algorithm previously implemented by DAS. The Proxy traffic forwarding tool is used to verify the results. The current classification algorithm has achieved more than 80% accuracy.
Figure 7. Schematic diagram of offline training of database specification ScaleUP model based on performance data
Figure 8. Schematic diagram of real-time recommendation method for database specification ScaleUP based on performance data
2.3.2 ScaleOut decision algorithm
The SCALEOUT decision algorithm is similar to the SCALEUP decision algorithm in that the essential problem is to determine how many read-only nodes can be added to reduce the current workload of the instance to a reasonable level. In the ScaleOut decision algorithm, we also implemented the rule-based algorithm and category-based algorithm. The idea of the classification algorithm is basically similar to that described in Section 2.3.1, and the idea of the rule-based algorithm is shown in Figure 9. First of all, we need to determine the index that is most relevant to read traffic. Here, we select COM_SELECT, QPS and ROWS_READ index. S_I represents the representational value of read-related index of the ith node. C_i represents the target constraint representation value of the ith node (usually using indicators such as CPU utilization rate and RT that directly reflect business performance), and f refers to the objective function. The goal of the algorithm is to determine how many additional read-only nodes X can reduce the load of the whole cluster to the range determined by the function f. This calculation method is clear and effective. After the algorithm is put online, the accuracy of the algorithm is more than 85% based on the evaluation condition of whether the CPU load of the cluster is reduced to a reasonable level after the transformation and matching. After the ScaleOut transformation and matching method is determined, the newly added read-only nodes of the ScaleOut decision algorithm can basically be in the “just-saturated” workload. Can effectively improve the database instance throughput.
Figure 9. Schematic diagram of database node number ScaleOut recommendation algorithm based on performance data
3. The ground
3.1 Implementation architecture
The Autoscaling capability is integrated in the DAS service. The whole service involves several modules including exception detection, global decision, Autoscaling service and low-level control execution, as shown in Figure 10. Shown is the service capability architecture for Das Autoscaling. The anomaly detection module is the entrance of all diagnostic optimization services (AutoCaling, SQL flow limiting, SQL optimization, spatial optimization, etc.) of DAS. This module will carry out real-time detection of monitoring indicators, SQL, locks, logs, operation and maintenance events, etc., 7*24 hours. In addition, AI algorithm will be used to predict and analyze the trends such as Spike, Seasonaliy, Trend and Meanshift. The global decision module of the DAS makes the best diagnostic recommendations based on the current workload of the instance. When the global decision module determines the execution of AUTOSCALING operation, it will enter the AUTOSCALING workflow introduced in Chapter 2, and finally realize the instance expansion and shrinkage through the management and control service at the bottom of the database.
Figure 10. Service capability architecture for DAS and Autoscaling
3.2 Product Plan
This section describes how to enable AutoCaling in the DAS. As shown in Fig. 11, the homepage of Ali Cloud products on DAS’s official website, all functions provided by DAS can be seen on this interface, such as “instance monitoring”, “request analysis”, “intelligent pressure test” and so on. Click “instance monitoring” to view all database instances accessed by users. We click on the specific instance ID link and select the “Autonomic Center” option, as shown in Figure 12. For instance of POLARDB, users can set options such as upper limit of capacity expansion, upper limit of number of read-only nodes, observation window and silent period. For instance of RDS-MySQL, users can set options such as Users can set trigger threshold, specification upper limit and storage capacity upper limit and other options.
Figure 11. DAS product home page
Figure 12. Diagram of automatic expansion and shrinkage Settings of POLARDB
Figure 13. Illustration of RDS-MySQL automatic scaling and scaling Settings
3.3 Effect cases
This chapter will introduce two specific online cases. Figure 14 shows a schematic diagram of the Autoscaling trigger of the calculation specification of the online POLARDB instance. During the period from 05:00 to 07:00, the load of the instance slowly increased, and finally the CPU utilization exceeded 80%. Automatic capacity expansion was triggered at 07:00. The Autoscaling service in the background judges that the instance is currently dominated by read traffic, so it performs ScaleOut operation and adds two read-only nodes to the cluster. As can be seen from the figure, the load of the cluster decreases obviously after the addition of nodes, and the CPU utilization rate drops to about 50%. In after 2 hours, the instance of business flow continues to increase, leads to further examples of load in the slowly rising, then again at 09:00 reached the expansion of the trigger condition, the back end service determine instance current flow dominated written, then perform the ScaleUp operations, to the specifications of the cluster of 4 nucleus 8 gb upgrade to 16 gb, As you can see from the diagram, the load of the instance stabilized for about 17 hours after the specification upgrade, after which the load of the instance decreased and automatic retracting was triggered, the backend Autoscaling service reduced the size of the instance from 8 cores 16GB to 4 cores 8GB and reduced two read-only nodes. The AutoScaing service runs automatically in the background without human intervention, expanding capacity at peak load times and shrinking at low load times, improving business stability while reducing user costs.
Fig. 14. Schematic diagram of horizontal and vertical expansion effect of online POLARDB
As shown in figure 15. Online RDS – MySQL instance of storage expansion diagram automatically, on the left side of the diagram instances in the nearly three hours triggered three nearly 300 gb disk space expansion and scale operation, the cumulative, schematic diagram on the right side of the disk space growth, rapid growth can be found in the instance storage space, the space expansion and automatic operation can be executed seamlessly, Truly achieved with the use of fetch, in order to avoid the instance space full at the same time to save the cost of users.
Figure 15. Schematic diagram of spatial expansion effect of online RDS-MySQL
This article is the original content of Aliyun, shall not be reproduced without permission.