Brief introduction:Recently, the paper of intelligent parameter tuning Restune system developed by the intelligent database and DAS team was accepted by SIGMOD 2021. SIGMOD is the first of the three top conferences of the database and the only one of the three top conferences of the Double Blind Review, whose authority is beyond doubt.
Recently, the paper of intelligent parameter tuning Restune system developed by the intelligent database and DAS team was accepted by SIGMOD 2021. SIGMOD is the first of the three top conferences of the database and the only one of the three top conferences of the Double Blind Review, whose authority is beyond doubt.
The acceptance of Restune’s paper shows our technical accumulation and depth in the direction of intelligent database management and control, and is also a milestone step for Alibaba Cloud’s autonomous database and intelligent operation and maintenance. At present, the intelligent parameter tuning function has been implemented on the Database Autonomy Service (DAS), which is the first intelligent parameter tuning function of database configuration officially launched in the industry, further demonstrating the technological leadership in the direction of Ali Cloud Autonomous Database.
1. An overview of the
The parameter tuning service is widely used in Alibaba’s rich business scenarios, such as optimization of database system performance and configuration parameters, selection of machine learning model/deep neural network overparameter, adaptive adjustment of parameters in recommendation system and cloud scheduling system, simulation optimization and parameter optimization in industrial control and supply chain, etc. How to support the actual needs of customers in the production environment is a research hotspot of AI for system in academia.
This year, the Restune intelligent tuning work developed by the Dharma Institute – Database and Storage Lab – Intelligent Database Team (Restune: Resource Oriented Tuning 1951 by Meta-learning for Cloud Databases
https://dl.acm.org/doi/pdf/10.1145/3448016.3457291__), which mainly aimed at the performance of the OLTP database system parameter tuning, It involves database systems such as RDS MySQL, RDS PostgreSQL, POLARDB MySQL, POLARDB-O, etc. The work was published in SIGMOD2021(Research Track), a top-level conference in the database field. And in the AliCloud database autonomous service DAS product technology implementation.
Background 2.
Database systems such as MySQL provide more than 200 configuration parameters. Different parameter combinations and constantly changing business load characteristics together determine the performance and resource utilization of the database system. For businesses within a group, DBAs usually manually select a set of appropriate parameters based on manual experience according to different businesses. As the cloud on the database accelerates, the business becomes more and more diversified, and relying only on the DBA’s manual parameter tuning encounters the bottleneck of horizontal scaling. At the same time, due to the diversity of DBA experiences, it is difficult to find the optimal parameters for various business loads. Automated parameter tuning is crucial for cloud vendors to “put customers first” : adaptively provide personalized optimization parameters for time-varying and diverse business loads in different instance environments.
Database system tuning needs to consider both performance (such as Transactions per second/TPS, Latency) and resource usage (CPU, Memory, IO). Of course, performance optimization is important, but the TPS of real load is often limited by the user’s request rate, and it is difficult to reach peak performance. Figure 1 shows TPS and CPU utilization with different values under the two parameters. It can be seen that the CPU utilization in the red region with the highest TPS varies greatly from 15% to 75%. In the case of the same TPS, there is a lot of room for optimization of resource utilization. From the perspective of Cost, TCO(Total Cost of Ownership) is an important indicator of cloud database and also the main advantage of cloud database.
Optimizing resource use is of great significance to reduce TCO of cloud database and improve cost advantage. In fact, we found over-provision in most instances on the cloud. In addition, excessive resource usage may lead to cloud database exceptions and performance degradation caused by resource contention; Optimizing database resource usage can effectively reduce or even avoid failures caused by such situations and improve stability.
3. The challenge
We analyzed that the goal of tuning parameters is to consider optimizing resource utilization and performance at the same time. As mentioned above, performance such as TPS is often limited by the request rate of the client and cannot reach the peak performance. Therefore, we need to find the database configuration parameters with the minimum resource utilization and meet the requirements of the SLA.
On the other hand, tuning the parameter itself needs to be done as quickly as possible (otherwise it violates reduced resource usage). A typical tuning system requires hundreds of iterations to find a good configuration. Each iteration takes about 3-5 minutes to play back the workload, which usually takes a day of tuning training. But to solve the online Troubleshoot’s needs, it often takes less than an hour to find the problem and recover. As a cloud vendor, we adopt knowledge transfer learning based on the historical data of the existing business load parameter tuning, which can effectively accelerate the parameter tuning process, so as to find out the good database parameter configuration as quickly as possible.
4. Relevant work
Database tuning is a relatively hot area of research recently, and a lot of work has been published in the past few years. These works can be mainly divided into three categories according to technical ideas: heuristic method based on search, method based on Bayesian optimization, method based on Reinforcement Learning model.
- Search-based heuristic method: This kind of method is usually based on the heuristic idea and searches through a given rule algorithm to find out the optimization parameters. The representative of this work is the BestConfig[3] system. This type of approach relies on prior assumptions about the workload and the impact of parameters on performance, but in practice, especially in cloud scenarios, it is often difficult to do specific optimization and feature engineering for each workload. When searching for a new set of parameters, this kind of method does not take into account the distribution of sampled data before, so it is not efficient.
- Bayesian optimization based methods: This class of methods is represented by Ituned [4] and SigMod17 working Ottertune [5] of CMU’s Andy Pavlo’s lab. Bayesian optimization treats the tuning parameter as a black box optimization problem, simulates the function between the parameter and the target by proxy function, and designs the acquisition function to minimize the number of sampling steps. This type of approach does not consider tuning parameters with the goal of optimizing resources, only optimizing peak performance. In practice, with the exception of extreme scenarios such as pressure test and push, TPS is generally not felt by users, and TPS tends not to reach its peak, so it is not enough to consider performance as the goal. Ottertune system also proposes a mapping scheme based on Internel Metric (index of database state table) to make use of existing data. This mapping method utilizes historical data from the same hardware type, and does not make full use of the rich data resources of cloud vendors. On the other hand, this method relies on the similarity calculation of the predicted Internel Metric, which is easy to be inaccurate in the case of fewer data points.
- Reinforcement learning-based approaches: This type of approach is a popular direction of database tuning recently, mainly including the work of SIGMOD18 CDBTUNE [6] and VLDB19 QTUNE [7]. By abstracting the relationship between Internal Metrics(state) and Knobs(action) into a policy neural network and a value network for feedback, the database tuning problem is transformed into a Markov decision process, and self-training is carried out continuously. Learn the optimal parameters. For one thing, this kind of work does not consider optimizing resources. On the other hand, more importantly, the parameter tuning problem is not a Markov decision process with state, because the parameters directly determine the performance of the database. It does not need a complex state space, which is different from reinforcement learning, which requires solving the Bellman equation to optimize the Reward accumulated by the model. In these works, it often takes thousands of iterations to find good parameters, which is difficult to meet the requirements of tuning parameters in production environment.
5. Problem definition and algorithm overview
We define the problem as a limited optimization problem as follows, where the limiting condition constants can be set to the TPS and Latency values under the default configuration parameters.
Restune turns optimizing resource use and satisfying SLA into a Constrained Bayesian Optimization problem. Compared with the traditional Bayesian optimization algorithm, the Constrained Ei Function (Constrained Ei, CEI) was adopted here, and the limiting information was added into the commonly used Ei utility Function (Acqusition Function). See the fifth chapter for details.
On the other hand, in order to make better use of the existing data, Restune also designed a Gaussian weighting model that combines static weights and dynamic weights. Through the Ensemble history Gaussian process model, the beat function of target workload is weighted and averaged. The central question here is how to define weights.
At cold start (when there is no observed data), static weight learning will assign weights according to the meta-feature distance of the task workload. The calculation of meta-feature needs to obtain the workload feature vector through workload analysis.
When accumulated a certain data, such as 10 data), ResTune using dynamic weight learning strategy, through the partial order relation (as shown in the figure below, although TPS absolute value is different, but the trend of curved surface is the same, so the partial order relation also similar), compare the prediction of history study and the similarity between the target task real observations. With a dynamic assignment strategy, weights are dynamically updated as the number of observations of the target workload increases. Through these two strategies, Restune finally gets a meta-learner, which can be used as an experienced agent model. For more details, please refer to Chapter 6 of this paper.
6. Restune system design
Restune abstracts the tuning problem into a constrained optimization problem that minimizes resource utilization while meeting SLA constraints. The following diagram shows the system architecture design for Restune. The Restune system consists of two main parts: the Restune Client and the Restune Server.
- The Restune Client runs in the user’s VPC environment and is responsible for the preprocessing of Target tasks and the execution of recommended parameter configuration. The Restune Client consists of the Meta-Data Processing module and the Target Workload Replay module.
- Restune Server runs in the back-end parameter cluster and is responsible for recommending parameter configuration in each training iteration, including the Knowledge Extraction module and the Knobs Recommendation module.
An iterative step in a parameter task flows as follows: When a parameter task is started, the system first copies the target database and collects the target workload for a certain period of time into the user environment for future playback.
In each iteration, the target task first obtains the meta-feature and base model through the meta-data Processing module, which are used as the input of the Knowledge Extraction module. The Knowledge Extraction module is responsible for calculating the static and dynamic weights of the current task and the historical task when integrating the base model, and the weighted sum of the base models to obtain the Meta model. In Knobs Recommendation module, a set of parameter configuration is recommended according to Meta Learner; The Target Workload Replay module validates the recommended parameters and writes the results into the historical observations of the Target task.
The above training process is repeated several iterative steps, and terminates when the maximum training step is reached or the lifting effect converges. After the training of the target task, Restune will collect the meta-feature and observation Data of the current task into a Data Repository as historical Data.
The specific functions of each module are as follows:
- Meta-data Processing: At the initial start of the parameter tuning task, the metadata Processing module analyzes the workload of the target task and uses TF-IDF method to count the SQL reserved words as the meta-feature of the target task. In each iteration, the metadata processing module takes the historical observation data as input, and after normalization, the utilization rate, TPS and Latency of resources (CPU, memory, IO, etc.) are fitted to the Gaussian model, which serves as the base model of the target task.
- Knowledge Extraction: In order to extract and utilize historical Knowledge, we put forward an integrated method of weighted summation using Gaussian model, that is, the key parameter u of metamodel M is calculated by weighting the base model. Static and dynamic methods are used to calculate the weight of the base model. During initialization, the weight is calculated in a static way, with the feature vector as the input, and the probability distribution vector of resource utilization is obtained through the pre-trained random forest. Finally, the static weight is determined according to the distance between the probability distribution vectors as the task similarity. When the data volume is sufficient, Restune uses a dynamic weighted learning strategy to compare the similarity between the prediction of the base learner and the actual observations of the target task. With a dynamic assignment strategy, the weights are updated as the number of observations of the target workload increases. With these two strategies, we end up with a meta-learner that can be used as an experienced proxy model.
- Knobs Recommendation: The parameter Recommendation module recommends a set of parameter configurations based on the metamodel; Collecting function We used the Constrained Ei function (Constrained Ei, CEI), which rewrote the utility function of Ei according to the constraints: when the parameters do not meet the SLA limitation, we set 0, and the current optimal parameter is defined as the optimal parameter meeting the SLA limitation. The CEI acquisition function can better guide the exploration of the optimal region satisfying the restriction.
- The Target Workload playback module first recommends parameters applied to the backup database and triggers the playback of the Workload. After a period of validation, the validation results (including resource utilization, TPS, and latency) along with the recommended parameters are written into the observation history of the Target task.
7. Experimental evaluation
We compared the performance and speed of Restune and other SOTA (State-of-the – Art) systems in multiple scenarios.
7.1. Single-task scenarios
First, in the single-task scenario, we selected CPU utilization as the optimization target to verify the effectiveness of Restune in solving optimization problems with SLA constraints. Here we test Sysbench, Twitter, TPC-C, and two real workload: Hotel Booking and Sales. As you can see, the Restune approach works best and is most efficient on all loads.
7.2. Migration scenarios
Due to the large number of various instances of users on the cloud database, it is very important that the method we proposed can be migrated between different workloads and different hardware. Also taking CPU utilization as the optimization goal, we test the migration effect between different machine hardware, and it can be seen that the meta-learning algorithm we proposed brings significant improvement in training speed and training effect. This allows the entire Restune parameter tuning process to be completed in about 30-50 steps, as opposed to the hundreds of iterations typically required for non-migration scenarios.
Similarly, in a migration experiment between different workloads, our meta-learning approach resulted in a significant increase in training speed.
7.3. Memory and I/O resource optimization
In addition to CPU resources, we test the parameter optimization effect of memory resources and IO resources. As you can see from the chart below, Restune reduces IOPS by 84% to 90% for IO resource optimization and parameter tuning tasks. For the memory resource optimization tuning task, Restune reduced the memory utilization from 22.5 GB to 16.34 GB. We also estimate the cost reduction of TCO in the paper.
8. DAS business landing
Intelligent parameter tuning technology was implemented on DAS(Database Autonomy Service) products. We went online in different stages and detailed functions. It mainly includes template function and intelligent parameter adjustment function based on pressure test. Alibaba Cloud is the first company in the industry to launch the parameter adjustment function, ahead of Tencent and Huawei.
8.1. Template parameter function
The template parameter function is a parameter tuning scenario that we launched in the first issue. Prior to this, the RDS MySQL database on the cloud had only one uniform set of parameter templates, which made it difficult to meet the varying user business loads on the cloud. Therefore, we selected different kinds of benchmarks to adjust the off-line training of parameters in the RDS Instance type most frequently used by users.
We divided the user load into 6 typical scenarios, such as trading, social network, pressure test, etc. Through offline training, we trained the optimal configuration for each typical scenario, and provided users to make choices according to their business characteristics. In this way, we extend the previous uniform set of parameter templates for RDS to a variety of typical OLTP business scenarios.
The table below shows the results of our off-line scheduling training, with a 13%-50% increase in different workload. Here we use TPS performance as the optimization goal.
The name of the Workload | TPS in RDS default configuration | TPS after tuning parameters | Percentage increase |
TPCC (Order Processing) | 620 | 940 | < / span > < span > write 52% |
Smallbank (banking business processing) | 17464 | 22109 | < / span > < span > write 26.6% |
Sysbench (stress test) | 7950 | 10017 | < / span > < span > write 26% |
Twitter (Social Network) | 41031 | 48946 | < / span > < span > write 19.2% |
TATP (Communications) | 18155 | 21773 | < / span > < span > write 19% |
YCSB (Stress Test) | 41553 | 55696 | < / span > < span > write 34% |
Wikipedia (encyclopedia of knowledge) | 600 | 678 | < / span > < span > write 13% |
9. Future jobs
10. Brief introduction of other research work on intelligent database
reference
It’s July 7th at 14:00
The annual blockbuster release of DAS, the database autonomy service
The following two-dimensional code
Here,
Copyright Notice: