Introduction: After the verification of “Double 11”, “618” and other Internet promotional activities, more and more Internet companies use irregular marketing activities to stimulate consumption and achieve the goal of improving revenue capacity. However, behind every business carnival, how to scientifically prepare the corresponding computing resources for promotional activities has become a constant problem perplexing developers. In addition, according to Gartner statistics, under the influence of the epidemic, more and more enterprises began to accelerate the migration of key business modules from the local cloud to the public cloud, in order to improve the stability and disaster resilience of enterprise services. How to effectively evaluate and plan the capacity of computing power, computing engine, bandwidth and other key resources has become a technical challenge in the cloud scenario.
1. The background
After the verification of “Double 11”, “618” and other Internet promotional activities, more and more Internet companies use irregular marketing activities to stimulate consumption, to achieve the goal of improving revenue capacity. However, behind every business carnival, how to scientifically prepare the corresponding computing resources for promotional activities has become a constant problem perplexing developers. In addition, according to Gartner statistics, under the influence of the epidemic, more and more enterprises began to accelerate the migration of key business modules from the local cloud to the public cloud, in order to improve the stability and disaster resilience of enterprise services. How to effectively evaluate and plan the capacity of computing power, computing engine, bandwidth and other key resources has become a technical challenge in the cloud scenario.
In view of this scenario, Ali Cloud Database Autonomous Service Team (DAS) has launched an intelligent press-test service, which is committed to solving the problems of computing resource evaluation in the big promotion scenario, offline resource capacity planning of the cloud on migration, cross-engine migration and other database selection evaluation issues. DAS (Database Autonomy Service) is a cloud Service that realizes Database self-perception, self-repair, self-optimization, self-operation and self-security based on machine learning and expert experience. It helps users eliminate the complexity of Database management and Service failures caused by manual operation, and effectively guarantees the stability, security and efficiency of Database services. The solution architecture is shown in Figure 1.
2. Intelligent pressure test composition
Pressure testing, or stress testing, is a test method used to establish the stability of a system. It is usually performed outside the normal operating range of the system to examine its functional limits and hazards. Generally speaking, network server testing is a test that constantly imposes “pressure” on the network server in the traditional sense. It is a test that obtains the maximum service level that the system can provide by determining the bottleneck or unacceptable performance points of a system. In the database scenario, the pressure test usually refers to the performance test of the database. By continuously increasing the number and concurrency of SQL executed by the database server, the test can test whether the database under the specified specifications can provide external services continuously and stably, and make corresponding decisions based on the test results. This includes tuning database specifications, deployment patterns, business SQL optimization, and more. Generally, the completion of a pressure test mainly involves three key parts: pressure test data preparation, flow playback and result analysis, as shown in Figure 2.
Figure 2. Key components of intelligent pressure test
Pressure test data: In the database scenario, the traffic data is SQL statements, but execution time SQL statements alone are not sufficient. During the execution of SQL statements in the database, the actual data distribution and the index of the library table will affect the execution time. Therefore, in the database scenario, the test data includes the database’s library table structure, the data in the library table, the index, and the SQL execution statement. In addition, in some special scenarios with strict security requirements, only the table structure is allowed to be reused, and the specific raw data cannot be used for flow pressure measurement. In view of this situation, we propose an algorithm to generate data intelligently, and produce simulated data in accordance with the original data distribution for playback.
Traffic replay technology: the traditional performance pressure measurement in the process, because the concurrent condition of SQL statement execution according to the original flow and execution order restriction, a pressure test with the original business flow effect is the phenomenon of differences, leading to a single database resource assessment tasks typically for pressure measurement for many times, and then after “averaging the result of the performance evaluation of resources. This approach takes a lot of testing time and requires the tester to have some database experience, which usually requires the DBA to operate. In order to solve this problem, DAS has improved the technology of single pressure test, which ensures that the performance after the playback of the pressure test is similar to the performance of the original business flow through the idembolic technology, without the need for multiple playback, which greatly saves the time of resource evaluation and reduces the requirement of database pressure test experience.
Pressure test result analysis: effective result analysis can help users reasonably choose resource specifications and find hidden dangers in the process of business traffic playback. Data such as key performance parameters of the database, comparison of key performance indicators, and SQL optimization suggestions can help users understand resource differences and potential optimization points, and assist in making subsequent decisions.
3. Insider of intelligent pressure test technology
3.1. Intelligent data generation technology
There are many open source tools in the industry for database performance test, such as Sysbench, mysqlslap, TPCC, and so on. Such tools can create a certain amount of SQL flow through concurrent database connections combined with certain query statements and achieve the effect of simulating the high-intensity use of database by business. However, the performance of the simulated scenarios is usually quite different from the actual performance of the business, so the simulated pressure test can not meet the requirements of computing resource evaluation. It is the basic condition of resource evaluation to make use of real data in business database to carry out pressure test. For users of AliCloud database, it is convenient to obtain the data required by the pressure test through SQL audit function. For cloud or AliCloud ECS self-built database users, it is difficult to obtain the historical database table data or traffic data to do the pressure test, and even in some scenarios with strict security data requirements, even the original data and SQL traffic data are not allowed to use.
At present, we use intelligent data generation technology in single-table query scenario to produce data that conforms to the business data distribution, which can be used to pressure test and evaluate resources. The premise of this algorithm is that we need to know some SQL templates and their corresponding execution indicators, such as RT, ROS_SENT, ROS_AFFECTED, etc., and we want to instantiate these SQL templates to generate SQL. This allows the SQL to be executed on the target table with similar execution metrics (here we assume that the SQL of the same template will be executed with the same execution plan). As shown in Figure 3, we need to search for the corresponding parameters a and b to instantiate the SQL template so that the number of rows returned is 1 when the given data is executed.
Figure 3 SQL templates
When searching for SQL parameters, point queries/point updates can be performed using primary and unique keys directly. In the case that the number of returned rows/updated rows is greater than 1 row, we use the sample-based cardinality estimation method to estimate the number of returned/updated rows after instantiation of SQL, and then carry out parameter search for SQL template instantiation.
Fig. 4 shows the pressure test generated by the traffic of a read-write service in the morning peak period. It can be seen that the pressure test generated by the traffic and the real service have similar performance in multiple indicators, proving that the generated data can effectively simulate the real online data.
Figure 4. Pressure test based on the generated data
3.2. Pressing idempotent technology
How to effectively and repeatably play back traffic after data preparation is completed is another core technology in intelligent pressure test. Although the existing open source tools in the industry can create a certain amount of SQL traffic by concurrent a large number of database connections combined with certain query statements, to achieve the effect of simulating the high-intensity use of databases by businesses. However, after using a real business model with certain data skew, a serious problem will be found: if the performance effect of the same model and the same data in RDS MySQL is tested for many times, the performance curves on both sides may not match under the condition of data skew. For example, in the first round of pressure test, A certain data was found at time point A, while in the second round of pressure test, it was likely to be found at time point B, which caused great interference to the analysis problem. As shown in Fig. 5, although the pressure of the two curves was similar, the jitter frequency was completely different, which was not conducive to the analysis.
Figure 5 The effect of running the same test model twice on the same database instance
In response to this situation, we propose the concept of idempotency of the test, which means that the same test will produce exactly the same SQL no matter how many times it is run. In the idempotent case, the SQL text generated at each point in time is identical (assuming the database processing power is identical), and all SQL is executed in the same order throughout the pressure test. At the moment, there is complete alignment at the thread level, and there is no strong alignment between different threads from a performance and requirements perspective.
With the help of idempotent technology, DAS intelligent pressure test can achieve consistent pressure test for the scene described above, and the effect is shown in Figure 6.
Figure 6. The effect of running the same smart pressure test twice on the same database instance
The technology of idempotent test is mainly processed from the three aspects of the generation logic of the test thread, the total number of requests, and the final consistency of writing. In the process of the test, the order of random numbers in each thread is guaranteed to be the same, and different threads are not the same. By keeping the same total number of requests in the thread, the total number of requests can be fixed. Combined with the method of custom primary key and agreed update interval, the conflict between self-increasing primary key and update is avoided, and the final consistency of data after the pressure test is ensured.
4. Product landing
4.1. Product process
After introducing the components of intelligent pressure test and the corresponding core technologies, let’s look at how DAS has implemented intelligent pressure test into a product. From the point of view of the process of pressure test, the whole process of intelligent pressure test can be divided into preparation stage, SQL processing stage, playback stage and effect evaluation stage, as shown in Figure 7.
Figure 7 Intelligent pressure test product flow
The preparation stage is mainly to solve the problem of machine environment for the test, which involves purchasing ECS machine, preparing the target instance for the test, configuring the operating environment on the ECS machine, etc. At present, the intelligent pressure test of DAS can independently select the appropriate ECS machine and automatically configure the operating environment according to the QPS of the pressure test flow and the playback time. It also allows users to use their own machines for pressure test. DAS can now help users prepare target instances through RDS backup and recovery, DTS synchronization, and allows users to specify the target instances freely.
The SQL processing stage is mainly to prepare the full amount of SQL detailed data used in the presstest before the presstest, and to preprocess the SQL data generated by SQL insight or intelligent algorithm, including the operations of Prepared Statement statement de-duplication, log culling, transaction statement merge and so on.
In the playback stage, it mainly uses the idempotent technology to playback the flow, which provides the real-time database performance data and the load of the testing machine, and facilitates the user to decompress the testing progress. In this link, the DAS combines the intelligent parameter tuning algorithm with the pressure test, and the user can realize the parameter tuning function through this function. The specific algorithm implementation will be described separately in the following articles.
The effect evaluation stage is mainly to interpret the index data in the process of pressure testing. DAS compares the commonly used performance parameters and key performance indicators in business tuning to assist users to make resource evaluation decisions. For slow SQL, lock and other problems found in the process of pressure test, DAS also provides corresponding improvement suggestions and treatment methods, and also provides information assistance for users to optimize their business.
4.2 Product Use
Users can use it in the left menu of the DAS console, “Intelligent Pressing Test”, as shown in Figure 8. Currently DAS supports RDS MySQL and Polardb MySQL pressure test, and support for other relational database engines is under development.
Figure 8 Intelligent pressure test interface
After the end of the pressure test, the user can view the performance data comparison between the target instance and the source instance as well as the comparison of key parameters through task details, as shown in Figure 9.
Fig. 9 Comparison of effect after pressure test
4.3. Product billing
At present, there is no separate charge for the intelligent pressure test function of DAS. The newly created ECS and RDS in the pressure test process will be charged according to the standard of charging by volume in the official website of corresponding products, and there is no additional service charge. As mentioned earlier, the crush relies on full SQL detail data on the source side or the corresponding library table infrastructure data, so the service only needs to crush the source side instance to enable DAS Pro functionality.
4.4. Customer cases
Since the launch of DAS intelligent pressure test service in 2020, the main customers are cloud head customers, and it has provided services for nearly 100 customers in total, including cloud resource assessment, business promotion assessment, engine switching assessment, database operation verification and other scenarios.
5. Planning for the future
Next, Smart Test will add the supported database engines to cover all relational database engines on the cloud; At the same time, the intelligent pressure test will be close to the real business problems of customers, closely combined with the scenarios of users’ cloud, resource assessment, engine recommendation, etc., and provide the corresponding pressure test evaluation suggestions and reports, so as to build the database capacity planning ability under large-scale scenarios together with enterprise customers.
This article is the original content of Aliyun, shall not be reproduced without permission.